WEBVTT captioned by sachac
00:00:00.000 --> 00:00:13.359
Hello, I'm Danny McClanahan. This is EmacsConf 2024. And
00:00:13.360 --> 00:00:17.159
this presentation is ostensibly about Emacs Regex
00:00:17.160 --> 00:00:22.639
compilation. But it'll lead a lot more in future
00:00:22.640 --> 00:00:30.879
directions. Thanks for coming on this journey with me.
00:00:30.880 --> 00:00:36.719
This presentation is 50 slides, 50 footnotes, and that's
00:00:36.720 --> 00:00:40.679
intended for it to be a resource later on for your perusal. We
00:00:40.680 --> 00:00:44.199
are unfortunately not going to be able to go into all of it,
00:00:44.200 --> 00:00:49.439
but I will try to be within 20 minutes so we can make it
00:00:49.440 --> 00:00:56.199
throughout Q&A. This is the structure of the talk.
00:00:56.200 --> 00:01:03.519
But enough about me. Who are you? And why are you here?
00:01:03.520 --> 00:01:09.479
I'm Danny McClanahan.
00:01:09.480 --> 00:01:13.439
My experience is a lot in build tools, especially in the
00:01:13.440 --> 00:01:19.399
package managers. That started because I realized I was
00:01:19.400 --> 00:01:23.319
wasting a lot of time. Then I didn't like that. I
00:01:23.320 --> 00:01:29.439
started wasting a lot of time, trying to avoid wasting time.
00:01:29.440 --> 00:01:35.479
Then I ended up... going so far around that I ended up
00:01:35.480 --> 00:01:40.319
stopping other people from wasting their own time, in this
00:01:40.320 --> 00:01:44.359
case, regarding failing builds. But this is a kind of
00:01:44.360 --> 00:01:47.479
pattern that you'll see. I'm talking a lot about patterns in
00:01:47.480 --> 00:01:52.399
this presentation. Parsing in text is another one of
00:01:52.400 --> 00:01:57.479
those tendencies that I have. Why am I here? I've got a lot
00:01:57.480 --> 00:02:00.639
of feelings about text. For the next 20 minutes, I'm
00:02:00.640 --> 00:02:06.079
making it your problem.
00:02:06.080 --> 00:02:09.639
First off, a huge shout out to Emacs Devel and the Emacs
00:02:09.640 --> 00:02:12.919
community in general. I spent a lot of time learning about
00:02:12.920 --> 00:02:15.559
what I'm about to talk about. I was definitely super
00:02:15.560 --> 00:02:19.439
confused at first. Then when I became less confused and I
00:02:19.440 --> 00:02:23.919
decided I was going to look at the regular expressions of the
00:02:23.920 --> 00:02:28.039
Regex engine, I was like, oh, it's old C code. It's
00:02:28.040 --> 00:02:33.559
Emacs. We can just use modern techniques. Turns out that's
00:02:33.560 --> 00:02:37.839
wrong for kind of two reasons. One, because using modern
00:02:37.840 --> 00:02:41.479
techniques or other engines don't necessarily do what
00:02:41.480 --> 00:02:44.799
Emacs regex engine currently does. Then secondarily,
00:02:44.800 --> 00:02:48.719
that's not actually as interesting as the other kind of
00:02:48.720 --> 00:02:52.359
larger goals that emacs-devel discussed. Thank you, Eli
00:02:52.360 --> 00:02:56.279
Zaretskii, so, so much, especially Pip Cet and everyone else
00:02:56.280 --> 00:02:59.319
as well--I believe--Pip Cet, I hope I'm pronouncing that
00:02:59.320 --> 00:03:01.799
correctly. Thank you so much. I'll be shouting you out
00:03:01.800 --> 00:03:04.319
later as well. Then these larger goals ended up
00:03:04.320 --> 00:03:07.199
overlapping a lot with my own research interests. And
00:03:07.200 --> 00:03:09.879
that's very exciting. I'm hoping it's exciting for you
00:03:09.880 --> 00:03:14.079
too. What is a regular expression? And when and how does
00:03:14.080 --> 00:03:16.559
implementation match formal theory? So what does formal
00:03:16.560 --> 00:03:24.079
theory mean? And we'll talk about that.
00:03:24.080 --> 00:03:27.519
What is a regular expression? So I might ask you this
00:03:27.520 --> 00:03:30.799
question, and you might give an answer. Then I might ask
00:03:30.800 --> 00:03:33.519
someone else, and they might have an answer. Then I might
00:03:33.520 --> 00:03:38.039
ask myself, and I might try to think of an answer. Our
00:03:38.040 --> 00:03:41.799
answers would, you know, see, the thing is, they'd all be
00:03:41.800 --> 00:03:45.359
correct, but they'd probably be slightly different, and
00:03:45.360 --> 00:03:50.319
they'd be different in kind of important ways. I'm
00:03:50.320 --> 00:03:55.039
using formal theory to kind of describe what unifies these
00:03:55.040 --> 00:04:00.119
interpretations and what causes this sort of divergence,
00:04:00.120 --> 00:04:05.439
both over time and then across code bases. I'm kind of
00:04:05.440 --> 00:04:09.319
putting a flag in the ground here and saying formal theory is
00:04:09.320 --> 00:04:12.999
actually a really, really negative influence, I think, but
00:04:13.000 --> 00:04:15.999
it can be better. That's what I'm going to talk about in
00:04:16.000 --> 00:04:19.519
this talk, in this presentation. We might ask, how did
00:04:19.520 --> 00:04:26.679
this happen? and we might try to find a start state. We
00:04:26.680 --> 00:04:30.519
might put that place at the theories of formal languages
00:04:30.520 --> 00:04:34.679
that kind of arose, especially post Turing and post
00:04:34.680 --> 00:04:37.519
Chomsky. Especially there was this really, really
00:04:37.520 --> 00:04:40.119
interesting and powerful relationship with formal
00:04:40.120 --> 00:04:43.959
languages between representation and computation. And
00:04:43.960 --> 00:04:48.599
then on top of that, we have regex as this really powerful
00:04:48.600 --> 00:04:52.159
union of theory and practice And then, like I mentioned,
00:04:52.160 --> 00:04:55.799
this is kind of divergence that kind of occurs. This
00:04:55.800 --> 00:04:58.079
divergence happens for a good reason. This happens because
00:04:58.080 --> 00:05:01.999
people were adding implementations and people adding
00:05:02.000 --> 00:05:04.639
features to implementations. While the people adding
00:05:04.640 --> 00:05:06.679
these features were often academics, they were
00:05:06.680 --> 00:05:09.199
industries, people that were hobbyists, they were
00:05:09.200 --> 00:05:11.999
interested in building practical tools. This is a good
00:05:12.000 --> 00:05:14.879
thing. This is still a good thing, even though it moves a
00:05:14.880 --> 00:05:18.199
little bit away from formal theory. But we start seeing some
00:05:18.200 --> 00:05:22.639
cracks developing, and we'll go into that in a second. We're
00:05:22.640 --> 00:05:27.519
just going to kind of electric slide into the 1980s here, and
00:05:27.520 --> 00:05:31.879
we're going to be confronted with two occurrences very
00:05:31.880 --> 00:05:35.639
similarly. We might call it simultaneous discovery. In
00:05:35.640 --> 00:05:38.559
1983, you have Michael Jackson demonstrating the
00:05:38.560 --> 00:05:41.999
moonwalk. Three years later, we have backtracking
00:05:42.000 --> 00:05:44.999
developed to stimulate EGREP-style regular expressions.
00:05:45.000 --> 00:05:48.599
These would both be incredibly influential in their own
00:05:48.600 --> 00:05:54.039
kind of branching paths. Here's where the gloves come
00:05:54.040 --> 00:06:00.759
off. Formal theory, I claim, remains largely concerned
00:06:00.760 --> 00:06:03.359
with incremental improvements to artificial benchmarks,
00:06:03.360 --> 00:06:07.279
and much less with expanding models to cover actual user
00:06:07.280 --> 00:06:11.799
needs. This isn't just about, oh, if you listened to
00:06:11.800 --> 00:06:15.999
users, that you'd be a nicer person, you'd be a better
00:06:16.000 --> 00:06:19.359
engineer. What I'm actually saying is that they're missing
00:06:19.360 --> 00:06:23.919
out. When you don't listen to applications, you miss out on a
00:06:23.920 --> 00:06:26.639
lot of fantastic opportunities for novel theory. So
00:06:26.640 --> 00:06:30.839
this is, again, my complaint with formal theory as it
00:06:30.840 --> 00:06:34.599
stands. But we're gonna do better. Before we get better,
00:06:34.600 --> 00:06:36.959
we're gonna get, a little bit worse for a bit. We're going to
00:06:36.960 --> 00:06:40.359
actually get a little bit worse is better. What I mean by
00:06:40.360 --> 00:06:43.239
that is, by the 1990s, we start looking into these
00:06:43.240 --> 00:06:46.479
non-backtracking engines. This is a bit of a reaction to
00:06:46.480 --> 00:06:50.399
backtracking. The current ones include RE2,
00:06:50.400 --> 00:06:53.919
hyperscan, and the rust regex library. These are all
00:06:53.920 --> 00:06:56.439
great. I'll talk about them later as well. They make use
00:06:56.440 --> 00:06:58.719
of these. They kind of call back to the earlier formal
00:06:58.720 --> 00:07:01.479
theory. They have linear runtimes for well-specified
00:07:01.480 --> 00:07:02.519
search tasks.
00:07:02.520 --> 00:07:08.079
What happens if that doesn't fit your needs? We're going to
00:07:08.080 --> 00:07:11.479
talk about that. We're going to table that for a second,
00:07:11.480 --> 00:07:15.319
and we're going to focus more on Emacs, the subject of this
00:07:15.320 --> 00:07:19.359
conference. What are regex used for? And in this
00:07:19.360 --> 00:07:22.439
particular case, they're used for lots of things, with
00:07:22.440 --> 00:07:25.199
practically, and I think they should be. But more
00:07:25.200 --> 00:07:29.559
specifically, how do Emacs users use them? And I'm going to
00:07:29.560 --> 00:07:32.679
focus in on this text as input and output. I'll be kind of
00:07:32.680 --> 00:07:38.959
elaborating on this analogy as we continue. Why is text
00:07:38.960 --> 00:07:43.399
powerful? Text as I/O. The reason text programming
00:07:43.400 --> 00:07:45.759
languages and not just programming languages, but
00:07:45.760 --> 00:07:49.159
languages themselves, the reason why they're successful
00:07:49.160 --> 00:07:52.279
and why they propagate, I claim, is because text is both
00:07:52.280 --> 00:07:56.439
input readable and output writable. What this means
00:07:56.440 --> 00:08:01.199
is that if you receive something in text, you can read it, And
00:08:01.200 --> 00:08:04.239
then you can also write it, you can modify it, and you can
00:08:04.240 --> 00:08:06.959
produce a new version of it. You're on a kind of level
00:08:06.960 --> 00:08:10.959
playing field. That's not always the case, though. You
00:08:10.960 --> 00:08:15.959
recall that I've worked a lot with build systems and package
00:08:15.960 --> 00:08:20.999
managers. There's a discussion that goes by the name of
00:08:21.000 --> 00:08:25.319
software supply chain security. I think it's a massive
00:08:25.320 --> 00:08:29.079
joke. The reason why is because people largely raise it
00:08:29.080 --> 00:08:34.279
to explain why their for-profit company with their
00:08:34.280 --> 00:08:38.079
for-profit product is going to solve the problem for you, as
00:08:38.080 --> 00:08:41.959
opposed to the commons of open source. If you are unable to
00:08:41.960 --> 00:08:44.999
modify or deploy your code without employing an opaque
00:08:45.000 --> 00:08:48.599
external system, I think, then you have a hidden
00:08:48.600 --> 00:08:53.879
dependency. you don't remove a dependency, you just, by,
00:08:53.880 --> 00:08:59.239
for example, paying into a for-profit product or using a
00:08:59.240 --> 00:09:01.519
closed-off supply chain, you end up just having a hidden
00:09:01.520 --> 00:09:04.719
dependency, you end up just displacing that. This can
00:09:04.720 --> 00:09:07.639
actually exert arbitrary control over your programming
00:09:07.640 --> 00:09:11.279
output and potentially even your thoughts. This is really
00:09:11.280 --> 00:09:15.839
important. I'm going to dive in a little bit deeper and I'm
00:09:15.840 --> 00:09:18.999
going to overload the term locality here. I'm going to
00:09:19.000 --> 00:09:22.239
say, if you cannot reproduce a system locally, it becomes an
00:09:22.240 --> 00:09:24.999
opaque external system. I'm going to give examples
00:09:25.000 --> 00:09:27.479
here, and these are going to be a bit of a hot take. First
00:09:27.480 --> 00:09:30.519
off, GUI IDEs. I think we might, well, some of us might agree
00:09:30.520 --> 00:09:34.519
with that here. I say development environments that only
00:09:34.520 --> 00:09:38.119
allow you to use a graphical interface, do not expose
00:09:38.120 --> 00:09:42.799
interaction with text, are explicitly trying to kind of
00:09:42.800 --> 00:09:46.239
place you on a separate kind of plane where you're not an
00:09:46.240 --> 00:09:50.439
equal contributor to the people who make the development
00:09:50.440 --> 00:09:53.079
environment, make the development kind of frameworks
00:09:53.080 --> 00:09:57.399
here. We'll go one further. Cloud services are precisely,
00:09:57.400 --> 00:10:00.039
you know, they're useful for things that, you know, that
00:10:00.040 --> 00:10:04.399
require large domain computation, but, you know, Twitter,
00:10:04.400 --> 00:10:08.679
for example, didn't actually ever use any cloud services,
00:10:08.680 --> 00:10:12.199
external ones, because it was really important for them to
00:10:12.200 --> 00:10:14.999
actually own their own hardware, their own computation,
00:10:15.000 --> 00:10:20.199
their own thinking. Cloud services are a way to ensure
00:10:20.200 --> 00:10:24.919
that you're unable to reproduce a system without paying an
00:10:24.920 --> 00:10:28.039
amount per month, an amount per day, an amount per second, an
00:10:28.040 --> 00:10:32.439
amount per cycle to an external entity. I'm just going to
00:10:32.440 --> 00:10:35.559
conclude this with, I'd say, the argumentum ad absurdum,
00:10:35.560 --> 00:10:39.239
here, where large language models are all of these at once.
00:10:39.240 --> 00:10:42.879
They are a cloud service, specifically, and this is what
00:10:42.880 --> 00:10:48.439
makes them very evil, to make it so that, similar to GUI IDEs,
00:10:48.440 --> 00:10:52.919
so that text itself loses that ability to be both readable
00:10:52.920 --> 00:10:56.199
and writable. Instead, text is both unreadable, because
00:10:56.200 --> 00:10:59.519
it's produced by a machine, and then also unwritable,
00:10:59.520 --> 00:11:02.999
because you're subservient and subjugated to the machine,
00:11:03.000 --> 00:11:05.359
to the large language model to produce the code in the first
00:11:05.360 --> 00:11:08.919
place. You lose this input, output, readable, writable
00:11:08.920 --> 00:11:13.359
behavior that I claim text has specifically. To
00:11:13.360 --> 00:11:19.439
underline this, what is text? Text is local. Finally,
00:11:19.440 --> 00:11:23.639
we're at the subject of this conference. Emacs, I have
00:11:23.640 --> 00:11:27.479
double hearts with text. I start off the slide saying Emacs
00:11:27.480 --> 00:11:31.519
is a text editor. I think that's a good start. Which
00:11:31.520 --> 00:11:34.319
implements much of its own logic and user interface via
00:11:34.320 --> 00:11:38.399
text. What this means is that, you know, I say without
00:11:38.400 --> 00:11:42.639
trying, Emacs tries very hard, but without trying so hard,
00:11:42.640 --> 00:11:47.639
Emacs, is imbued with all of the capabilities that text has
00:11:47.640 --> 00:11:51.319
specifically. When you use text like Emacs does, and
00:11:51.320 --> 00:11:55.519
particularly you then start offering mechanisms to query,
00:11:55.520 --> 00:11:59.999
to transform, and to generally metaprogram text itself,
00:12:00.000 --> 00:12:03.319
you don't just have the ability to edit code in new ways. And
00:12:03.320 --> 00:12:06.999
this is something that I think is often lost, maybe not by
00:12:07.000 --> 00:12:11.239
participants of this conference, you particularly start
00:12:11.240 --> 00:12:14.319
being able to not only just edit code differently, but to
00:12:14.320 --> 00:12:16.599
change the way that you think about code and actually to
00:12:16.600 --> 00:12:20.239
expand your range of thought, the range of actions that you
00:12:20.240 --> 00:12:22.719
can perform. You can actually start then editing at the
00:12:22.720 --> 00:12:25.799
speed of thought. This is where especially Regex kind of
00:12:25.800 --> 00:12:30.319
comes into play. Finally, we get to the subject of the
00:12:30.320 --> 00:12:33.599
title of this talk. I'm about to disappoint a lot of
00:12:33.600 --> 00:12:38.759
people. I claim for good reason. Unfortunately, it's a
00:12:38.760 --> 00:12:41.599
very brief walkthrough, but I'm going to go over what the
00:12:41.600 --> 00:12:43.799
current Emacs Redix engine is. This is going to give us
00:12:43.800 --> 00:12:48.119
enough context for the next section on future directions.
00:12:48.120 --> 00:12:51.799
Quickly, it's a backtracking engine over a multi-byte
00:12:51.800 --> 00:12:53.919
code point. I'll define what that means. It's in
00:12:53.920 --> 00:12:58.439
regex-emacs.c. It's invoked in two ways, which you'll see
00:12:58.440 --> 00:13:01.759
is actually the same way, over a single contiguous string
00:13:01.760 --> 00:13:05.359
input. This is a Lisp string that you pass in. or over the
00:13:05.360 --> 00:13:07.039
two halves of the gap buffer. This is when you match
00:13:07.040 --> 00:13:11.879
against a buffer text. We'll go into that a little bit
00:13:11.880 --> 00:13:13.919
more, but this is one of the really actually interesting and
00:13:13.920 --> 00:13:17.839
specific things about Emacs Regex Engine as it stands. So
00:13:17.840 --> 00:13:21.559
very, very quickly, this is the data layout. This is just, if
00:13:21.560 --> 00:13:24.879
you're interested, this is where the code lies. So
00:13:24.880 --> 00:13:30.159
regex-emacs.h has re-pattern buffer, which is a struct
00:13:30.160 --> 00:13:34.239
Actually, you know, I love, by the way, I love the Emacs C
00:13:34.240 --> 00:13:37.359
source code. It's so nice to read. It made all this so, so
00:13:37.360 --> 00:13:41.119
easy. I really appreciated it. In this particular case,
00:13:41.120 --> 00:13:44.039
I'm just going to focus on re-pattern buffer actually has
00:13:44.040 --> 00:13:47.999
the compiler. It's a C struct. It has every single thing
00:13:48.000 --> 00:13:52.559
that is needed to execute the regular expression against a
00:13:52.560 --> 00:13:56.319
string input or against a buffer input. This buffer,
00:13:56.320 --> 00:13:59.839
It's not an Emacs buffer. It refers to just the instruction
00:13:59.840 --> 00:14:04.039
table and the match loop. Again, this is very, very
00:14:04.040 --> 00:14:07.839
brief, but I want to specifically focus on the first part. So
00:14:07.840 --> 00:14:11.879
this is this inner matching loop, and there's a prologue,
00:14:11.880 --> 00:14:15.679
and then there's a loop body, and there's an epilogue. And
00:14:15.680 --> 00:14:18.279
the prologue is the really, really interesting part. I say
00:14:18.280 --> 00:14:22.839
extract current and next char. What Emacs does here, it
00:14:22.840 --> 00:14:27.159
doesn't just reach for the next byte. It actually will
00:14:27.160 --> 00:14:31.879
perform lazily in some sense, this variable integer size
00:14:31.880 --> 00:14:36.039
VAR decoding for multi-byte, and it'll actually then
00:14:36.040 --> 00:14:43.959
decode the next one to four bytes. Up to 32 bits at once, and
00:14:43.960 --> 00:14:46.799
then it'll actually go into the loop. We'll talk about the
00:14:46.800 --> 00:14:52.519
implications of that later. Next, in the body of the loop, we
00:14:52.520 --> 00:14:54.239
read the instruction from the instruction pointer, which
00:14:54.240 --> 00:14:57.319
is, again, in that buffer field. Then we have this big
00:14:57.320 --> 00:14:59.479
switch statement, which is actually, love a big switch
00:14:59.480 --> 00:15:02.079
statement, super easy to read, super easy to understand
00:15:02.080 --> 00:15:05.399
kind of what's occurring. Then that's the loop body. And
00:15:05.400 --> 00:15:08.279
then at the end of it, we either increment the instruction
00:15:08.280 --> 00:15:11.119
pointer if it was matching a single character or something
00:15:11.120 --> 00:15:14.839
along those lines, or if it was a jump, we don't do that. A
00:15:14.840 --> 00:15:18.199
jump, however, it's not referring to a jump in the sense of a
00:15:18.200 --> 00:15:22.519
go-to, but a jump that's elsewhere within that table, that
00:15:22.520 --> 00:15:25.839
buffer field. If you've included a capture, we write
00:15:25.840 --> 00:15:29.479
that end position there. Of course, well, as you may
00:15:29.480 --> 00:15:34.439
recall, the zeroth capture is, of course, the entire match
00:15:34.440 --> 00:15:36.559
string. If the capture is zero, then we know we've
00:15:36.560 --> 00:15:39.839
actually completed that match. That's really great.
00:15:39.840 --> 00:15:43.599
I would love to receive Q&A about this as well. I've spent a
00:15:43.600 --> 00:15:46.719
lot of time kind of learning and understanding it. But it's
00:15:46.720 --> 00:15:49.879
really interesting that this can be described in a single
00:15:49.880 --> 00:15:52.159
slide because it's really simple. That simplicity is
00:15:52.160 --> 00:15:54.639
actually a really powerful thing. I'll mention that in
00:15:54.640 --> 00:15:58.759
the next section. I say, is that all? And I apologize for
00:15:58.760 --> 00:16:02.239
not doing so. But please, please ask questions in Q&A or
00:16:02.240 --> 00:16:04.999
message me about this, because I think it's really, really,
00:16:05.000 --> 00:16:07.079
again, interesting. Again, I find the code relatively
00:16:07.080 --> 00:16:11.999
easy to read. Now, here's, I think this is actually the
00:16:12.000 --> 00:16:15.519
point of the talk. The rest of it was, you know, I think just me
00:16:15.520 --> 00:16:18.839
posturing. This is the really, really interesting part.
00:16:18.840 --> 00:16:22.039
This is the ways that we can improve, well, not just we can
00:16:22.040 --> 00:16:25.839
improve stuff in Emacs, but why those are the right things to
00:16:25.840 --> 00:16:30.279
improve. Then also how that can be a model for even things
00:16:30.280 --> 00:16:35.079
outside of Emacs. This is gonna be a lot of text. I'm not
00:16:35.080 --> 00:16:38.879
gonna go through all of it. This is the one thing that I tried.
00:16:38.880 --> 00:16:42.239
This is the thing that I thought would be a slam dunk, easy
00:16:42.240 --> 00:16:47.439
solution. My initial thought process was, well, We tried
00:16:47.440 --> 00:16:52.919
very hard to do an LRU cache here. It works. It's actually
00:16:52.920 --> 00:16:57.399
very effective. However, though, we don't actually give
00:16:57.400 --> 00:17:00.479
the user, the list programmer, the ability to then say, I
00:17:00.480 --> 00:17:03.079
know that this regex is something that is going to be used
00:17:03.080 --> 00:17:06.399
again. I made an artificial benchmark. I made an
00:17:06.400 --> 00:17:10.039
artificial benchmark because I wanted to show there is one
00:17:10.040 --> 00:17:13.639
very specific case that it does solve, but it's the same
00:17:13.640 --> 00:17:16.919
issue with the artificial benchmarks. mentioned earlier.
00:17:16.920 --> 00:17:21.559
It's very specifically crafted in order to show that this
00:17:21.560 --> 00:17:25.319
particular solution would produce some speedup. What
00:17:25.320 --> 00:17:29.599
this means is it just creates more than 20 regexps in a row. It
00:17:29.600 --> 00:17:31.959
compiles them. Then, of course, because we just don't
00:17:31.960 --> 00:17:35.159
pay the compile costs, because we don't go through that
00:17:35.160 --> 00:17:39.079
cache eviction process, it ends up being faster. But this
00:17:39.080 --> 00:17:42.079
isn't really mean very much, particularly the goal here,
00:17:42.080 --> 00:17:45.559
you know, the goal would have been to show that the compile
00:17:45.560 --> 00:17:48.359
cache is actually causing the performance issue in
00:17:48.360 --> 00:17:51.359
comparison to pre-compiling it. That's not something
00:17:51.360 --> 00:17:56.039
I've been able to show. Match over bytes, not cars. So
00:17:56.040 --> 00:17:59.079
this is when I said at the beginning, oh, I came in and I think,
00:17:59.080 --> 00:18:02.079
oh, we can just use modern regex engine techniques. This is
00:18:02.080 --> 00:18:05.239
really what I meant. In particular, I mentioned in this
00:18:05.240 --> 00:18:09.279
match loop here that there's this, prolog that does this
00:18:09.280 --> 00:18:13.359
varring decoding. What this means is that every single
00:18:13.360 --> 00:18:18.519
iteration of that loop is going to be interspersed with this
00:18:18.520 --> 00:18:21.919
not being able to read a fixed number of bytes, but a variable
00:18:21.920 --> 00:18:24.359
number of bytes just depending upon the Unicode character
00:18:24.360 --> 00:18:27.039
or the Unicode code point or the multibyte code point. So
00:18:27.040 --> 00:18:29.799
this ends up, again, being relatively difficult to
00:18:29.800 --> 00:18:32.919
optimize because processors operate over bytes and not
00:18:32.920 --> 00:18:38.479
over code points. Yes, we might consider a multi-byte CPU at
00:18:38.480 --> 00:18:41.039
some point. But this is a really, really simple thing. It's
00:18:41.040 --> 00:18:44.999
just generating automata that operate over bytes as
00:18:45.000 --> 00:18:48.839
opposed to code points. This kind of goes into the much more
00:18:48.840 --> 00:18:51.839
abstract one. There's a lot of text here, and we're not
00:18:51.840 --> 00:18:56.159
going to go into it. But the really, really important point
00:18:56.160 --> 00:18:57.999
that I'm specifically mentioning here is this explicit
00:18:58.000 --> 00:19:02.079
control over linguistic complexity. That's the
00:19:02.080 --> 00:19:06.159
abstract kind of point. I want to introduce the inputs and
00:19:06.160 --> 00:19:11.279
the outputs. Basically, when you perform a search, or a
00:19:11.280 --> 00:19:14.799
match, or a parse, those are different tasks. They'll
00:19:14.800 --> 00:19:17.799
have different expected inputs and different desired
00:19:17.800 --> 00:19:21.559
outputs. Right now, Emacs, the API for the regular
00:19:21.560 --> 00:19:24.919
expression engine and for matching, It doesn't allow
00:19:24.920 --> 00:19:27.959
specialization on this. Or rather, if we do specialize on
00:19:27.960 --> 00:19:30.999
particular inputs, if we have a heuristic to check if a regex
00:19:31.000 --> 00:19:33.519
is actually a literal string, that's not something that the
00:19:33.520 --> 00:19:36.959
user actually has control over. For example, you can make
00:19:36.960 --> 00:19:38.999
a mistake escaping something, and then you don't have a
00:19:39.000 --> 00:19:42.039
literal, and then you accidentally have behavior that you
00:19:42.040 --> 00:19:44.279
totally didn't expect. Not just correctness issues, but
00:19:44.280 --> 00:19:48.599
also performance issues. I really like this one. I like
00:19:48.600 --> 00:19:52.239
this a lot, because I didn't think of it at all. I think it's
00:19:52.240 --> 00:19:58.119
better than in all of my ideas. This was proposed, at least
00:19:58.120 --> 00:20:01.839
to me, by Pip Cet, and I really hope that I'm pronouncing your
00:20:01.840 --> 00:20:04.479
name correctly. I'm sorry I didn't ask you beforehand,
00:20:04.480 --> 00:20:08.399
emacs-devel. In particular, this was after a couple of
00:20:08.400 --> 00:20:11.999
responses where I was trying to say, oh, I want to give the
00:20:12.000 --> 00:20:15.879
list programmer, way back in here, I want to give the list
00:20:15.880 --> 00:20:20.559
programmer the ability to control compilation in some
00:20:20.560 --> 00:20:25.759
sense. you know, he mentioned, I think he is correct, you
00:20:25.760 --> 00:20:28.439
know, there's no real introspection. That happens
00:20:28.440 --> 00:20:33.119
because it's written in C. I was thinking, oh, if I turn
00:20:33.120 --> 00:20:35.639
this into a list object that gives the list programmer the
00:20:35.640 --> 00:20:40.039
power and the ability to do more with that, but it doesn't
00:20:40.040 --> 00:20:42.839
actually because it's still in C. At first, I was
00:20:42.840 --> 00:20:46.679
thinking, oh, we can make the C part more flexible. But
00:20:46.680 --> 00:20:50.039
actually, especially if we want to do almost any of the
00:20:50.040 --> 00:20:52.719
things we previously mentioned, I think basically that
00:20:52.720 --> 00:20:56.599
this is... I think that if I'm not going to do it, somebody
00:20:56.600 --> 00:20:58.879
else really should do it, and I think we should maybe even do
00:20:58.880 --> 00:21:01.519
it together, because I think this is really, I think, how we
00:21:01.520 --> 00:21:04.079
can start experimenting, and not just experimenting, but
00:21:04.080 --> 00:21:07.039
because, as mentioned here, we have libgccjit, we have the
00:21:07.040 --> 00:21:09.519
native compiler, we have the ability to opt, like,
00:21:09.520 --> 00:21:12.639
specifically to generate specific code for this, so why not
00:21:12.640 --> 00:21:15.919
implement the or a Redix engine itself in list, And this
00:21:15.920 --> 00:21:18.359
gives us the ability to introspect it. That's one of the
00:21:18.360 --> 00:21:20.759
things I mentioned at the beginning. But it actually gives
00:21:20.760 --> 00:21:23.879
us the ability to then actually look at all the previous
00:21:23.880 --> 00:21:28.159
implementations, to explicitly compile beforehand, to
00:21:28.160 --> 00:21:32.519
match against bytes, to specialize and dispatch based upon
00:21:32.520 --> 00:21:36.799
input and output. This is something that, you know, it's
00:21:36.800 --> 00:21:37.999
super simple.
00:21:38.000 --> 00:21:40.799
It's really smart. I'm really, really glad that Pip
00:21:40.800 --> 00:21:44.839
mentioned this because it is, I think, the right way to solve
00:21:44.840 --> 00:21:49.879
the rest of it. We're at the final section. I talked a
00:21:49.880 --> 00:21:52.679
lot about, you know, kind of abstract, you know, thoughts.
00:21:52.680 --> 00:21:55.679
I talked a little about, you know, specific solutions.
00:21:55.680 --> 00:21:59.999
But I especially talked about, you know, what is Regex and
00:22:00.000 --> 00:22:02.959
Emacs? And I don't know if I had a lot of specific examples of
00:22:02.960 --> 00:22:06.079
it. I'm going to just describe kind of my, I guess,
00:22:06.080 --> 00:22:09.799
motivation, my impetus. Then I think something that's
00:22:09.800 --> 00:22:12.639
really something to chew on for the future. Do I have any
00:22:12.640 --> 00:22:15.799
concrete examples? Yes. Well, you can decide if they're
00:22:15.800 --> 00:22:22.799
concrete. Or am I just posturing? Also, yes. helm, rg. Helm,
00:22:22.800 --> 00:22:27.679
Erg, it's literally just M-x grep, it uses ripgrep, which
00:22:27.680 --> 00:22:31.999
is written by the same author of the Rust regex [??]. It
00:22:32.000 --> 00:22:36.199
happens to be very, very fast. In particular, I use this tool
00:22:36.200 --> 00:22:39.319
with ripgrep on the Twitter monorepo, and I was able to
00:22:39.320 --> 00:22:42.559
search very, very large amounts of code that was on my local
00:22:42.560 --> 00:22:46.399
machine using regular expressions. I think this is one
00:22:46.400 --> 00:22:49.199
thing that I think is really, really important, because
00:22:49.200 --> 00:22:52.919
when you want to scale, People say the word scaling and they
00:22:52.920 --> 00:22:56.719
assume there's a specific kind of answer for that. I've
00:22:56.720 --> 00:23:01.679
just found that text is not only flexible, it's actually
00:23:01.680 --> 00:23:04.359
something that can be more performant than the alternative
00:23:04.360 --> 00:23:07.399
and not only more performant, but more productive. It's
00:23:07.400 --> 00:23:10.359
again, it's just M-x grep using ripgrep. There's a tool
00:23:10.360 --> 00:23:12.719
deadgrep by Wilfred Hughes, which is also fantastic. I
00:23:12.720 --> 00:23:15.759
think it's actually better than this, but this one's mine so
00:23:15.760 --> 00:23:19.199
I can mess around with it. But this tool is kind of why,
00:23:19.200 --> 00:23:21.799
especially I started looking into Emacs and looking into
00:23:21.800 --> 00:23:24.919
changing the way that, or at least diving into how the
00:23:24.920 --> 00:23:27.559
regular expression matching actually kind of works, both
00:23:27.560 --> 00:23:30.359
in Emacs and then in ripgrep. We'll go to the next one.
00:23:30.360 --> 00:23:34.119
This is something that does exist and continues to exist.
00:23:34.120 --> 00:23:36.799
This is something that doesn't quite exist yet. I'm
00:23:36.800 --> 00:23:41.359
calling it telepathy grams. It's, you know, it's the name,
00:23:41.360 --> 00:23:44.719
and it's very, you know, it doesn't work, but it's a code
00:23:44.720 --> 00:23:47.919
search tool that, in this case, precompiles the database to
00:23:47.920 --> 00:23:51.879
execute NFAs against. I was thinking, how can I beat And
00:23:51.880 --> 00:23:55.039
the first thing I thought is, well, as I have worked on build
00:23:55.040 --> 00:23:57.759
tools, especially in monorepos, one of the things that the
00:23:57.760 --> 00:24:00.799
pants build tool from Twitter does is it uses a file watcher
00:24:00.800 --> 00:24:04.239
to ensure that instead of having to constantly read in the
00:24:04.240 --> 00:24:10.079
entire contents of a file, which may be very, very large, it
00:24:10.080 --> 00:24:13.679
only does so when the file has been changed. Finally, I
00:24:13.680 --> 00:24:16.919
want to conclude on this note, which is just that the stuff I
00:24:16.920 --> 00:24:20.839
didn't learn from emacs devel, I learned from Paul
00:24:20.840 --> 00:24:25.319
Wankadia, Jr., who is the RE2 maintainer, and he taught me
00:24:25.320 --> 00:24:32.399
quite a lot from 2023 to 2024. I'm thankful for the time
00:24:32.400 --> 00:24:37.959
that I learned from you, so thank you, Paul. With that, we're
00:24:37.960 --> 00:24:42.759
at point-max. Call me, beat me, if you want to reach me and or
00:24:42.760 --> 00:24:45.839
hire me. These are places that you can reach me at. There are
00:24:45.840 --> 00:24:49.719
probably others. Feel free to suggest other ways to contact
00:24:49.720 --> 00:24:53.199
me. But for now, this is the end. Thank you so much for your
00:24:53.200 --> 00:24:56.080
time. I really appreciate it.