WEBVTT captioned by sachac
NOTE Intro
00:00:00.000 --> 00:00:10.799
Hi everyone, this is EmacsConf 2024. I'm Colin, and today
00:00:10.800 --> 00:00:17.319
I'll be talking about transducers.
00:00:17.320 --> 00:00:21.879
After introducing them, I'll share a bit of history about
00:00:21.880 --> 00:00:25.359
transducers and the problems that they solve, some basics
00:00:25.360 --> 00:00:28.879
about how we can use them, how they work, like how they're
00:00:28.880 --> 00:00:32.399
implemented, some demonstrations of how we can actually
00:00:32.400 --> 00:00:36.959
use them in the wild, and then some other discussions about
00:00:36.960 --> 00:00:41.519
issues that they have.
NOTE What are transducers?
00:00:41.520 --> 00:00:46.399
Okay, let's get right in. What are transducers?
00:00:46.400 --> 00:00:49.679
Transducers are a way to do streaming iteration with a
00:00:49.680 --> 00:00:55.679
modern API.
00:00:55.680 --> 00:01:00.359
Who are transducers for, and thereby, who is
00:01:00.360 --> 00:01:05.599
this talk for? Well, it's for people who want to do streamed
00:01:05.600 --> 00:01:10.519
data processing in Emacs. It's for people who perhaps
00:01:10.520 --> 00:01:14.199
aren't satisfied with the existing APIs, for example, the
00:01:14.200 --> 00:01:19.359
seq API, or some other common libraries that provide
00:01:19.360 --> 00:01:23.719
similar functionality. Maybe you're not a fan of the loop
00:01:23.720 --> 00:01:29.079
macro. Some people find it difficult to understand. Or
00:01:29.080 --> 00:01:32.719
maybe you've done a bunch of Clojure before, and you'd like
00:01:32.720 --> 00:01:36.879
more aspects of Clojure in your Emacs Lisp. Or maybe you're
00:01:36.880 --> 00:01:40.239
just interested in transducers in general, because the
00:01:40.240 --> 00:01:48.839
pattern has now been ported to multiple different Lisps.
00:01:48.840 --> 00:01:55.039
So I'm Colin. I'm fosskers on everything online, and I do
00:01:55.040 --> 00:01:58.519
mainly back-end programming work and a lot of open source
00:01:58.520 --> 00:02:05.159
software. I wrote Haskell for a long time, both as a hobbyist
00:02:05.160 --> 00:02:09.079
and professionally. Since the COVID years, I've been
00:02:09.080 --> 00:02:13.439
writing Rust, both open source and professionally. But now
00:02:13.440 --> 00:02:19.719
I find that in my spare time, I'm mostly writing Common Lisp.
00:02:19.720 --> 00:02:22.719
Some things I learned from my years of Haskell was that a lot
00:02:22.720 --> 00:02:27.519
of programming is just altering the shape of data. You know,
00:02:27.520 --> 00:02:31.359
sometimes we work through our algorithm line by line. We're
00:02:31.360 --> 00:02:36.239
trying to just tell the computer exactly what to do. But if we
00:02:36.240 --> 00:02:39.639
step back, a lot of the time we're just getting in data of some
00:02:39.640 --> 00:02:44.119
shape, changing it, and then passing it along. A lot of
00:02:44.120 --> 00:02:49.279
these patterns are common, identified
00:02:49.280 --> 00:02:53.639
decades ago. For instance, we have some collection, and we
00:02:53.640 --> 00:02:56.999
want to transform every element of that collection and then
00:02:57.000 --> 00:03:01.199
pass it on. Or maybe we're trying to filter out bad elements
00:03:01.200 --> 00:03:04.799
in that collection. Or maybe we're looking for a specific
00:03:04.800 --> 00:03:07.759
element in that collection. Yes, you could write all that
00:03:07.760 --> 00:03:11.839
with for loops, but these kind of common patterns were
00:03:11.840 --> 00:03:18.559
identified and given names decades ago. So why not use them?
00:03:18.560 --> 00:03:21.879
They say that there are two major problems in computer
00:03:21.880 --> 00:03:25.759
science, one being cache validation and the other being
00:03:25.760 --> 00:03:27.589
naming things.
NOTE Common issues
00:03:27.590 --> 00:03:29.799
I've identified five other problems that
00:03:29.800 --> 00:03:33.199
come up when we're trying to deal with collections of data,
00:03:33.200 --> 00:03:40.599
or big streams of data. One is that if we were trying to
00:03:40.600 --> 00:03:45.279
load a file all into memory all at once and process the whole
00:03:45.280 --> 00:03:48.279
thing, sometimes we can have memory problems. You've
00:03:48.280 --> 00:03:54.999
probably seen out-of-memory errors or such things.
00:03:55.000 --> 00:03:58.199
A second issue that comes up is that if we were looking at a
00:03:58.200 --> 00:04:01.799
giant for loop, in particular a nested for loop or such
00:04:01.800 --> 00:04:06.079
things, it can be hard to tell just by looking at the code what
00:04:06.080 --> 00:04:11.039
it's trying to do, what it intends. If we don't go character
00:04:11.040 --> 00:04:16.439
by character or line by line, it can be hard to understand it.
00:04:16.440 --> 00:04:20.039
Furthermore, and this is particularly an issue with Emacs
00:04:20.040 --> 00:04:26.399
Lisp, is that if one call, for instance, to seq-map, then
00:04:26.400 --> 00:04:29.319
piped into seq-filter, for instance, will have an
00:04:29.320 --> 00:04:33.599
intermediate allocation, the map will take the source
00:04:33.600 --> 00:04:37.639
container, allocate a new one, and then the filter will
00:04:37.640 --> 00:04:40.319
operate over the second one. This is wasteful.
00:04:40.320 --> 00:04:48.879
Furthermore, it can often be difficult to abort a stream.
00:04:48.880 --> 00:04:53.199
For instance, if we were filtering through our collection,
00:04:53.200 --> 00:04:57.319
but we knew we only wanted to go halfway, for instance, for
00:04:57.320 --> 00:05:01.759
some reason, we have no way to stop it halfway through. We
00:05:01.760 --> 00:05:05.479
just have to process the whole thing, even if we know we don't
00:05:05.480 --> 00:05:11.919
need to. Another issue is that for languages that have
00:05:11.920 --> 00:05:18.039
traits, or in Haskell they're called type classes, if you
00:05:18.040 --> 00:05:22.399
are defining what it means to map over something, you often
00:05:22.400 --> 00:05:27.039
have to redefine that for every kind of container or thing
00:05:27.040 --> 00:05:31.239
that you're iterating over. Wouldn't it be nice if we could
00:05:31.240 --> 00:05:34.719
define things like map just once and then reuse them
00:05:34.720 --> 00:05:39.839
everywhere? Now, transducers solve all five of these,
00:05:39.840 --> 00:05:44.039
without the addition of new language features, and with
00:05:44.040 --> 00:05:47.279
little more than plain old function composition.
NOTE Transducers
00:05:47.280 --> 00:05:53.119
If this is your first time hearing of transducers, yeah,
00:05:53.120 --> 00:05:57.439
no problem. They were originally invented in Clojure by
00:05:57.440 --> 00:06:01.039
Rich Hickey, and this is a quote from him. He thinks
00:06:01.040 --> 00:06:05.439
transducers are a fundamental primitive that decouple
00:06:05.440 --> 00:06:10.079
critical logic from list or sequence processing, and if he
00:06:10.080 --> 00:06:13.999
had to do Clojure all over, he'd put them at the bottom, at the
00:06:14.000 --> 00:06:19.279
very bottom of all the fundamental primitives. Now, that's
00:06:19.280 --> 00:06:24.599
Rich speaking quite highly of them. And I think he has a point
00:06:24.600 --> 00:06:25.159
here.
00:06:25.160 --> 00:06:32.399
They were invented originally in Clojure. In more
00:06:32.400 --> 00:06:34.772
recent years, they were brought over to Scheme
00:06:34.773 --> 00:06:38.774
via SRFI 171. That's where I found them
00:06:38.775 --> 00:06:41.521
when I was learning the Guile language.
00:06:41.522 --> 00:06:43.919
In the process of submitting a patch, I realized
00:06:43.920 --> 00:06:48.199
that there were other things to be improved. So I ported the
00:06:48.200 --> 00:06:51.399
pattern to Common Lisp, then Fennel, and then more
00:06:51.400 --> 00:06:56.639
recently, Emacs Lisp. The Common Lisp and Emacs Lisp APIs
00:06:56.640 --> 00:07:01.199
are identical. And the Fennel one is not identical, but
00:07:01.200 --> 00:07:05.799
fairly similar. Overall, everywhere you find
00:07:05.800 --> 00:07:10.279
transducers, they should basically be fairly uniform.
00:07:10.280 --> 00:07:15.759
When I originally made the Common Lisp variant first, I
00:07:15.760 --> 00:07:18.799
sampled the APIs from a number of different languages and
00:07:18.800 --> 00:07:23.439
came up with what I believed to be a representative sample of
00:07:23.440 --> 00:07:27.959
what most people would want out of such a library. I gave
00:07:27.960 --> 00:07:32.439
functions their common modern names. For instance, map
00:07:32.440 --> 00:07:35.279
is map and filter is filter and so on.
NOTE Using transducers
00:07:35.280 --> 00:07:42.599
What does the usage of transducers look like? Well,
00:07:42.600 --> 00:07:48.959
these examples will all be the Emacs Lisp variant, but the
00:07:48.960 --> 00:07:52.359
Common Lisp will look basically exactly the same, minus
00:07:52.360 --> 00:07:54.079
this little t- prefix.
00:07:54.080 --> 00:08:00.919
Running transducers requires three things. It requires a
00:08:00.920 --> 00:08:06.439
source. This could be an obvious thing like a list or a
00:08:06.440 --> 00:08:11.479
vector, but it could be other things like a file, or in Emacs
00:08:11.480 --> 00:08:16.348
list in particular, a buffer.
00:08:16.349 --> 00:08:20.112
A reducer is a function. It's something like
00:08:20.113 --> 00:08:22.639
the + operator or the * operator,
00:08:22.640 --> 00:08:26.785
or certain constructors of various containers.
00:08:26.786 --> 00:08:32.125
It takes values and collates them into some final version.
00:08:32.126 --> 00:08:33.946
Now, finally, we have what we're calling here
00:08:33.947 --> 00:08:37.567
a transducer chain. This could be one transducer function
00:08:37.568 --> 00:08:43.479
or it could be multiple composed together. These are the
00:08:43.480 --> 00:08:47.079
functions that actually take data and transform them
00:08:47.080 --> 00:08:55.279
somehow. For instance, this. We have a list of three
00:08:55.280 --> 00:09:04.199
elements. We want to reduce it into a vector. How we are
00:09:04.200 --> 00:09:07.519
going to transform the elements along the way: we are doing
00:09:07.520 --> 00:09:13.359
plus one to each of them. If this syntax is new to you, just
00:09:13.360 --> 00:09:18.039
know that this #' just means that this thing that
00:09:18.040 --> 00:09:22.079
comes after it is the name of the function. In Common Lisp and
00:09:22.080 --> 00:09:26.079
Emacs Lisp, this is necessary, but for Clojure and Scheme,
00:09:26.080 --> 00:09:32.719
it is not. So we can see here that just this example is not much
00:09:32.720 --> 00:09:36.119
different than any other normal map call you might see made,
00:09:36.120 --> 00:09:40.239
but if nothing else, it's a handy way to convert a list to a
00:09:40.240 --> 00:09:44.999
vector or anything else. There are many, many reducers
00:09:45.000 --> 00:09:48.239
available and many different forms that we can
00:09:48.240 --> 00:09:52.624
collate the final value into.
NOTE A more involved example with comp
00:09:52.625 --> 00:09:55.086
Let's see a more involved example.
00:09:55.087 --> 00:09:58.049
Okay, now we've got some more meat here.
00:09:58.050 --> 00:10:01.772
Here we can see usage of the comp function
00:10:01.773 --> 00:10:05.255
and a custom source, ints.
00:10:05.256 --> 00:10:11.079
Ints is an infinite generator of integer values. That's not
00:10:11.080 --> 00:10:14.783
like a list or a file. It will generate infinitely.
00:10:14.784 --> 00:10:19.439
Comp is letting us compose multiple transducer functions
00:10:19.440 --> 00:10:23.759
together. Notice that this is the opposite order of what
00:10:23.760 --> 00:10:28.079
we'd usually be used to from a function like comp. The order
00:10:28.080 --> 00:10:32.679
here is top to bottom, basically, so that the map goes first,
00:10:32.680 --> 00:10:37.839
then the filter, and then the take. So effectively is what
00:10:37.840 --> 00:10:40.919
we're doing is taking all the integers that exist,
00:10:40.920 --> 00:10:45.399
positive, adding one to them, filtering out only the even
00:10:45.400 --> 00:10:50.039
ones, but then just taking 10. Cons here is a function that
00:10:50.040 --> 00:10:57.039
just produces the ending result as a list. So what happens
00:10:57.040 --> 00:11:00.479
here specifically is how we are avoiding intermediate
00:11:00.480 --> 00:11:04.238
allocations. First, the number 0 will come through.
00:11:04.239 --> 00:11:07.879
It will be pulled out of this source internally by transduce.
00:11:07.880 --> 00:11:10.919
It will make its way into the map. The map will add it. Then it
00:11:10.920 --> 00:11:15.799
will immediately go into this filter step. So it's not like
00:11:15.800 --> 00:11:19.119
all the maps occur, and then all the filters occur. We do
00:11:19.120 --> 00:11:24.039
everything for each element. So the 0 comes in, now it's 1.
00:11:24.040 --> 00:11:27.559
The filter would occur. Well, it's going to fail that
00:11:27.560 --> 00:11:31.119
because it's not even, so it will just bail there. Now we'll
00:11:31.120 --> 00:11:35.239
go to the next one. Now 1 will come, it will become 2, then
00:11:35.240 --> 00:11:39.119
it will be saved by this evenp call, and then the take will
00:11:39.120 --> 00:11:42.599
capture it, because we only want 10 values here. You can
00:11:42.600 --> 00:11:45.239
see 2, 4, 6, 8, and so on is the result that we
00:11:45.240 --> 00:11:49.332
expect. So let's play around a little bit.
NOTE In Emacs
00:11:49.333 --> 00:11:53.336
Let's jump into Emacs and see what we can do.
00:11:53.337 --> 00:11:58.500
Alright, you should see my Emacs screen here.
00:11:58.501 --> 00:12:04.359
These are the actual notes for the actual
00:12:04.360 --> 00:12:08.959
presentation done in Org Mode. I'll boost that up in size for
00:12:08.960 --> 00:12:12.639
a little bit. That should be more than big enough for you.
00:12:12.640 --> 00:12:17.719
Just by changing the reducer, we can change the result.
00:12:17.720 --> 00:12:21.079
Okay, now it's a vector. Well, what else can we do to it? Well,
00:12:21.080 --> 00:12:25.959
let's just add up the results. Maybe we just want to count the
00:12:25.960 --> 00:12:30.919
results. Oh, indeed, there were 10. What if we want to find
00:12:30.920 --> 00:12:36.959
the average of the results? What if we want to find the median
00:12:36.960 --> 00:12:40.959
of the results? And so on. Here's some more interesting
00:12:40.960 --> 00:12:45.839
things that we could do. We could add different steps. So
00:12:45.840 --> 00:12:51.239
here we have all the integers. Let's add, hmm, okay, we'll
00:12:51.240 --> 00:12:57.399
keep that. We're going to add t-enumerate. What enumerate does
00:12:57.400 --> 00:13:00.879
is for each item that comes through, it is
00:13:00.880 --> 00:13:06.039
going to add a sort of index to it and make it a pair. In this
00:13:06.040 --> 00:13:08.719
case, it's going to be equal to what came in here. Well, we can
00:13:08.720 --> 00:13:12.399
change it. If we start this at 1, now it will be different.
00:13:12.400 --> 00:13:15.519
1 will be paired with 0, and then 2 would be paired
00:13:15.520 --> 00:13:19.559
with 1, and so on. We'll accept that the even call will change
00:13:19.560 --> 00:13:24.039
that a little bit. Why we're doing this is because we want
00:13:24.040 --> 00:13:27.279
to form a hash table. Let's move that down to 3, maybe
00:13:27.280 --> 00:13:31.439
we'll get a better result. What do we see? Okay, here now the
00:13:31.440 --> 00:13:37.359
result is a hash table. What are its values? Well, 0 seems
00:13:37.360 --> 00:13:40.479
to have... The key of 0 seems to be paired with 2, the key of
00:13:40.480 --> 00:13:42.909
1 seems to be paired with 4,
00:13:42.910 --> 00:13:47.411
and 2 seems to be paired with 6.
00:13:47.412 --> 00:13:51.293
Maybe let's jazz that up even a little bit more.
00:13:51.294 --> 00:13:52.973
We're going to start from a string
00:13:52.974 --> 00:13:57.943
and we'll call it hello.
00:13:57.944 --> 00:13:59.564
That's not going to work anymore
00:13:59.565 --> 00:14:02.585
and neither is that, but what we could do is
00:14:02.586 --> 00:14:05.498
we could say t-map #'string.
00:14:05.499 --> 00:14:08.627
I believe we'll do that.
00:14:08.628 --> 00:14:08.959
Let's see if that works. It did. So that's
00:14:08.960 --> 00:14:13.589
going to convert a character into a string.
00:14:13.590 --> 00:14:14.679
Let's just go two
00:14:14.680 --> 00:14:18.399
just to make it a little easier. Now you can see that we've
00:14:18.400 --> 00:14:21.919
constructed a hash table here. The key of 0 is mapped to the
00:14:21.920 --> 00:14:27.079
string of h and 1 is mapped to e. Now, I really like having
00:14:27.080 --> 00:14:29.468
this reducer in particular.
NOTE Hash tables
00:14:29.469 --> 00:14:30.639
Know that hash tables are
00:14:30.640 --> 00:14:34.199
also legal sources. I find that both in Emacs Lisp and in
00:14:34.200 --> 00:14:37.119
Common Lisp, dealing with hash tables--like creating them
00:14:37.120 --> 00:14:41.599
and altering them--can be a bit of a pain. Having them
00:14:41.600 --> 00:14:45.679
immediately available like this with transducers is very
00:14:45.680 --> 00:14:49.079
handy, I find. We can work with something that wasn't a hash
00:14:49.080 --> 00:14:53.279
table. We can construct it in a way that makes it amenable to
00:14:53.280 --> 00:14:56.199
that, and then reduce it down into a hash table, and here you
00:14:56.200 --> 00:14:58.039
go. Very handy.
NOTE Clarity
00:14:58.040 --> 00:15:06.399
One last point is that you can see very clearly what
00:15:06.400 --> 00:15:10.479
this is attempting to do, as opposed to, say, a for loop. It's
00:15:10.480 --> 00:15:12.719
very clear what that step is doing, and then you can see what
00:15:12.720 --> 00:15:15.119
that is doing, and you know that the result is going to be two.
00:15:15.120 --> 00:15:18.559
Each line is kind of its own declarative step, and it should
00:15:18.560 --> 00:15:22.159
be clear, just by staring at this, basically what you're
00:15:22.160 --> 00:15:25.399
going to get out. This is one main difference from other
00:15:25.400 --> 00:15:29.599
languages that have things--say, for instance, Rust's
00:15:29.600 --> 00:15:35.439
iterator API--is the difference between the transducers
00:15:35.440 --> 00:15:41.639
and the reducers. If we go up here, for example, the
00:15:41.640 --> 00:15:44.679
difference between the transducers and the reducers and
00:15:44.680 --> 00:15:48.119
the sources is not explicitly laid out, whereas with
00:15:48.120 --> 00:15:53.119
transducers, it is. You have to be aware of how these things
00:15:53.120 --> 00:15:55.799
are different. I think that that helps clarity.
NOTE How do transducers work?
00:15:55.800 --> 00:16:01.999
Moving on. How do transducers work? Well,
00:16:02.000 --> 00:16:09.857
we want to go see the README.
00:16:09.858 --> 00:16:11.399
So, what we're going to do is
00:16:11.400 --> 00:16:19.102
we're going to go to here.
00:16:19.103 --> 00:16:21.959
You should still be able to see this.
00:16:21.960 --> 00:16:28.583
This is the CL example, actually.
00:16:28.584 --> 00:16:32.279
Let's go to transducers.el.
00:16:32.280 --> 00:16:37.744
Their APIs and READMEs are the same,
00:16:37.745 --> 00:16:39.919
but just for the sake of it, we will go see
00:16:39.920 --> 00:16:45.726
how this looks on the Emacs side,
00:16:45.727 --> 00:16:48.046
just so that nothing is a surprise.
00:16:48.047 --> 00:16:50.239
But recall that the APIs are essentially the same
00:16:50.240 --> 00:16:53.679
between the two. If you go to this section, writing your
00:16:53.680 --> 00:16:56.839
own primitives, you can read about how transducers are
00:16:56.840 --> 00:17:00.999
actually formed, whether or not you want to write them
00:17:01.000 --> 00:17:06.799
yourself or not. We can see here t-map. We accept the
00:17:06.800 --> 00:17:10.239
function that you want to operate with. Then you've got
00:17:10.240 --> 00:17:13.319
this extra little lambda here that's coming in, and it's
00:17:13.320 --> 00:17:17.079
receiving a thing that is named reducer. Now, while here
00:17:17.080 --> 00:17:20.439
we're calling it reducer, it's actually the chain of all the
00:17:20.440 --> 00:17:25.159
composed functions together. It's all those main
00:17:25.160 --> 00:17:28.479
transducer steps. Finally, it's the reducer all
00:17:28.480 --> 00:17:31.879
composed together with normal function composition.
00:17:31.880 --> 00:17:35.877
That will matter very soon. Now here's the actual meat.
00:17:35.878 --> 00:17:40.519
We can see the accumulative result that's coming in with the
00:17:40.520 --> 00:17:45.739
current element. Now we need to operate on this.
00:17:45.740 --> 00:17:47.840
Were it normally mapped, we would see us
00:17:47.841 --> 00:17:49.919
applying the F to the input.
00:17:49.920 --> 00:17:53.519
But here, you can see us applying the F to the input and then
00:17:53.520 --> 00:17:58.679
continuing on. So us calling the rest of the composed chain
00:17:58.680 --> 00:18:03.159
here is the effect of, in the previous slide, moving to the
00:18:03.160 --> 00:18:07.156
next step. We could ignore this line for now.
00:18:07.157 --> 00:18:13.819
If you're curious, please read the README in detail.
00:18:13.820 --> 00:18:15.579
Now, what about reducers?
00:18:15.580 --> 00:18:18.879
What do those look like? Well, let's just scroll
00:18:18.880 --> 00:18:22.439
down here. Recall that a reducer is a function that's
00:18:22.440 --> 00:18:26.959
consuming a stream, right? Zoom that up for you a little bit.
00:18:26.960 --> 00:18:33.919
Now, in the case of count, recall that this is how it's
00:18:33.920 --> 00:18:37.679
working, how we saw a moment ago. So clearly this list of five
00:18:37.680 --> 00:18:42.199
elements only has five things in it. Well, a reducer by
00:18:42.200 --> 00:18:47.599
structure is a function of two, one, or zero arguments. So we
00:18:47.600 --> 00:18:50.639
can see here in the case of two, this is the normal iterative
00:18:50.640 --> 00:18:54.519
case. We don't care about the input for count, we just care
00:18:54.520 --> 00:18:58.559
about the current accumulated count that we're doing, and
00:18:58.560 --> 00:19:02.879
we add one to it, and that's it. This then goes back to
00:19:02.880 --> 00:19:06.359
the loop and the whole process starts again with the next
00:19:06.360 --> 00:19:10.879
element. In this kind of done case, this is used internal to
00:19:10.880 --> 00:19:16.879
that sort of the supervising function transduce. It's just
00:19:16.880 --> 00:19:19.639
confirming the final result. Sometimes some
00:19:19.640 --> 00:19:21.839
post-processing is necessary here, but in the case of
00:19:21.840 --> 00:19:26.039
count, as it is so simple, that is not necessary. And now
00:19:26.040 --> 00:19:29.359
here's the base case. This is also used within that
00:19:29.360 --> 00:19:34.319
supervising transduce function at the very top. Well, if
00:19:34.320 --> 00:19:36.679
you're counting, you have to start from somewhere, right?
00:19:36.680 --> 00:19:37.349
In this case, well, what you're starting with is zero.
00:19:37.350 --> 00:19:40.251
In the case of cons, you'd be starting with an empty list.
00:19:40.252 --> 00:19:44.434
In the case of vector, you'd be starting
00:19:44.435 --> 00:19:53.999
with an empty vector and so on.
00:19:54.000 --> 00:19:56.799
Once again, if you are more curious, please take a look at
00:19:56.800 --> 00:19:57.679
the README.
NOTE Transducers in the wild - CSV
00:20:00.520 --> 00:20:06.039
Okay, transducers in the wild. Well, let's go take a look at
00:20:06.040 --> 00:20:07.639
processing some CSV data.
00:20:07.640 --> 00:20:21.319
We're going to open up a new Emacs Lisp bracket here. So I have
00:20:21.320 --> 00:20:28.839
a file. And in this file, let's just go look at C-x b right
00:20:28.840 --> 00:20:34.839
there, you will see that we've got some bank transaction
00:20:34.840 --> 00:20:37.879
information. It's got these transactions from a whole
00:20:37.880 --> 00:20:40.199
bunch of different people into different accounts,
00:20:40.200 --> 00:20:43.879
whether it's money coming in, money going out, and then a
00:20:43.880 --> 00:20:47.839
basic description. How's your Latin? But for this little
00:20:47.840 --> 00:20:53.679
test, what we want to do is we want to find Bob's final bank
00:20:53.680 --> 00:20:59.679
balance. Let's get on to it. First of all, let's
00:20:59.680 --> 00:21:04.444
just confirm, let's do some basic stuff.
00:21:04.445 --> 00:21:10.844
with-current-buffer, find-file-noselect.
00:21:10.845 --> 00:21:15.542
What's the name of that file?
00:21:15.543 --> 00:21:17.439
This is pre-organized, so you
00:21:17.440 --> 00:21:20.879
will just see it right here.
00:21:20.880 --> 00:21:26.999
t-transduce and t-comp. We don't know what we're going to comp
00:21:27.000 --> 00:21:33.039
yet. Actually, I'll just pass to show you. And then we will
00:21:33.040 --> 00:21:36.999
see, let's just do a little t-count just to confirm. What's
00:21:37.000 --> 00:21:45.112
our source? Well, our source is a buffer, t-buffer-read.
00:21:45.113 --> 00:21:50.153
And note that because we're using with-current-buffer,
00:21:50.154 --> 00:21:55.079
if we go like this, if we go current-buffer, this will just work. So
00:21:55.080 --> 00:21:59.919
now let's... Well, that was odd. I should have done it like
00:21:59.920 --> 00:22:02.159
that. There we go. So now we should make that a little smaller
00:22:02.160 --> 00:22:04.799
so you can see what it is. Now if we hit RET, we should get the
00:22:04.800 --> 00:22:09.559
right result. Okay, so there are 50,001 lines in this file,
00:22:09.560 --> 00:22:13.516
but the one extra one is the name of the headers, right?
00:22:13.517 --> 00:22:18.079
We want to process this file in more detail. So how can we do
00:22:18.080 --> 00:22:22.079
that? Well, let's start by just automatically
00:22:22.080 --> 00:22:28.799
interpreting the results as CSV. If we do that, okay, well
00:22:28.800 --> 00:22:31.559
now we only have 50,000 entries as we expected, right?
00:22:31.560 --> 00:22:36.759
Because it's going to pull out the header line. If we now say
00:22:36.760 --> 00:22:42.679
we want to just filter out, you know, We only want Bob, right?
00:22:42.680 --> 00:22:53.679
So if... gethash, it was in the row of name. Each line here is
00:22:53.680 --> 00:22:57.079
made into, at least by default, is made into a hash map. So if
00:22:57.080 --> 00:23:02.759
we go like this, we should see that. Okay, so 12,000 of these
00:23:02.760 --> 00:23:05.639
lines or thereabout belong to Bob.
00:23:05.640 --> 00:23:13.839
Let's just move that over a little bit. Actually, I suppose we don't even
00:23:13.840 --> 00:23:17.799
need that anymore. I'll just keep that full size for you.
00:23:17.800 --> 00:23:24.399
Okay, so all right, there's about 12,000 results for Bob of
00:23:24.400 --> 00:23:32.479
the 50,000. What's next? Well, we want to confirm,
00:23:32.480 --> 00:23:40.039
we want to pull out everything,
00:23:40.040 --> 00:23:43.079
all of the in and the out entries.
00:23:43.080 --> 00:23:56.279
Thank you. So, string to number, because we know that
00:23:56.280 --> 00:24:01.239
everything came in as strings. Unfortunately, the from-csv
00:24:01.240 --> 00:24:03.799
doesn't try to be smart at all, it's just pulling everything
00:24:03.800 --> 00:24:09.479
in as string values. If you want actual things to be
00:24:09.480 --> 00:24:13.399
numbers or whatever, that is up to you to do the parsing
00:24:13.400 --> 00:24:20.679
yourself. Okay, so we have those two values now. We know
00:24:20.680 --> 00:24:23.879
that we saw from the data just a moment ago that you're only
00:24:23.880 --> 00:24:26.999
going to have a value in one column or the other. It's either
00:24:27.000 --> 00:24:29.119
going to be 0 in the empty one, or you're going to have some
00:24:29.120 --> 00:24:32.159
number in the other. So we know that we can just naively add
00:24:32.160 --> 00:24:35.479
them. If it was in, it would always be positive. So we'll just
00:24:35.480 --> 00:24:41.519
add that. But in the negative case, we want to just make it
00:24:41.520 --> 00:24:45.279
negative really briefly before we add them all together.
00:24:45.280 --> 00:24:50.519
let's now just prove to ourselves that we are sane here. What
00:24:50.520 --> 00:24:52.479
we're going to do is we're going to quickly go say take
00:24:52.480 --> 00:24:57.039
5 just to convince ourselves, and we'll go cons, and let's
00:24:57.040 --> 00:24:59.839
see if we get kind of results that make sense. Okay, these
00:24:59.840 --> 00:25:02.799
sort of make sense. It looks like you know Bob's got some big
00:25:02.800 --> 00:25:07.679
expenses here. If we take say 15, does it look any better?
00:25:07.680 --> 00:25:10.319
Okay, looks like he had a payday. All right, good job Bob.
00:25:10.320 --> 00:25:15.439
Let's get back in there. Now we only really care about
00:25:15.440 --> 00:25:20.119
adding the final result, right? So there we go. Add that all
00:25:20.120 --> 00:25:24.559
together and we'll see what we get in a moment. Okay, wow,
00:25:24.560 --> 00:25:27.519
Bob's rich. Okay, so it looks like in his 12,000
00:25:27.520 --> 00:25:32.279
transaction, Bob has an overall net worth of $8.5 million.
00:25:32.280 --> 00:25:34.439
Looking pretty good.
00:25:34.440 --> 00:25:38.999
So here's an example of how you can, particularly in Emacs
00:25:39.000 --> 00:25:42.959
Lisp, how you can very easily just get a file, consider it the
00:25:42.960 --> 00:25:45.879
current buffer, and then just do whatever you want to it.
00:25:45.880 --> 00:25:50.359
Note that there is sort of first-class support for both CSV
00:25:50.360 --> 00:25:54.359
and JSON, and then you have, and both of those bring in their
00:25:54.360 --> 00:25:57.719
values as hash maps, and then you're just free to do whatever
00:25:57.720 --> 00:26:00.439
you want and process them, potentially both writing them
00:26:00.440 --> 00:26:03.239
back out as CSV or JSON once again.
NOTE Issues and next steps
00:26:03.240 --> 00:26:10.719
Some issues with transducers that can come up is
00:26:10.720 --> 00:26:14.919
that one, a zip operator is missing, but I'm working on it.
00:26:14.920 --> 00:26:19.399
Two is that performance, particularly in Emacs Lisp, isn't
00:26:19.400 --> 00:26:24.119
that great. It could be due to the sort of nested lambda calls
00:26:24.120 --> 00:26:27.759
that have to occur internally, but the common Lisp
00:26:27.760 --> 00:26:32.239
implementation is quite good. and there's yet no support
00:26:32.240 --> 00:26:35.399
for parallelism. You can imagine that a lot of those steps
00:26:35.400 --> 00:26:38.559
you could potentially perform in parallel depending on the
00:26:38.560 --> 00:26:44.399
platform, but research has not yet gotten that far. Okay,
00:26:44.400 --> 00:26:47.639
that's all. Thank you very much. If you have any questions,
00:26:47.640 --> 00:26:51.240
please contact me.