summaryrefslogtreecommitdiffstats
path: root/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expre...
diff options
context:
space:
mode:
authorEmacsConf <emacsconf-org@gnu.org>2024-12-08 09:37:18 -0500
committerEmacsConf <emacsconf-org@gnu.org>2024-12-08 09:37:18 -0500
commit84710fb2fb5ef807c63a48eed67df9a42edbecac (patch)
treecf501c71374d02b23a28a72558fb5232d1fd95d0 /2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt
parente314823f7bf8c9ab928931d22e7284815b889803 (diff)
downloademacsconf-wiki-84710fb2fb5ef807c63a48eed67df9a42edbecac.tar.xz
emacsconf-wiki-84710fb2fb5ef807c63a48eed67df9a42edbecac.zip
update
Diffstat (limited to '2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt')
-rw-r--r--2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt1198
1 files changed, 1198 insertions, 0 deletions
diff --git a/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt b/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt
new file mode 100644
index 00000000..bad03237
--- /dev/null
+++ b/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt
@@ -0,0 +1,1198 @@
+WEBVTT captioned by sachac
+
+00:00:00.000 --> 00:00:13.359
+Hello, I'm Danny McClanahan. This is EmacsConf 2024. And
+
+00:00:13.360 --> 00:00:17.159
+this presentation is ostensibly about Emacs Regex
+
+00:00:17.160 --> 00:00:22.639
+compilation. But it'll lead a lot more in future
+
+00:00:22.640 --> 00:00:30.879
+directions. Thanks for coming on this journey with me.
+
+00:00:30.880 --> 00:00:36.719
+This presentation is 50 slides, 50 footnotes, and that's
+
+00:00:36.720 --> 00:00:40.679
+intended for it to be a resource later on for your perusal. We
+
+00:00:40.680 --> 00:00:44.199
+are unfortunately not going to be able to go into all of it,
+
+00:00:44.200 --> 00:00:49.439
+but I will try to be within 20 minutes so we can make it
+
+00:00:49.440 --> 00:00:56.199
+throughout Q&A. This is the structure of the talk.
+
+00:00:56.200 --> 00:01:03.519
+But enough about me. Who are you? And why are you here?
+
+00:01:03.520 --> 00:01:09.479
+I'm Danny McClanahan.
+
+00:01:09.480 --> 00:01:13.439
+My experience is a lot in build tools, especially in the
+
+00:01:13.440 --> 00:01:19.399
+package managers. That started because I realized I was
+
+00:01:19.400 --> 00:01:23.319
+wasting a lot of time. Then I didn't like that. I
+
+00:01:23.320 --> 00:01:29.439
+started wasting a lot of time, trying to avoid wasting time.
+
+00:01:29.440 --> 00:01:35.479
+Then I ended up... going so far around that I ended up
+
+00:01:35.480 --> 00:01:40.319
+stopping other people from wasting their own time, in this
+
+00:01:40.320 --> 00:01:44.359
+case, regarding failing builds. But this is a kind of
+
+00:01:44.360 --> 00:01:47.479
+pattern that you'll see. I'm talking a lot about patterns in
+
+00:01:47.480 --> 00:01:52.399
+this presentation. Parsing in text is another one of
+
+00:01:52.400 --> 00:01:57.479
+those tendencies that I have. Why am I here? I've got a lot
+
+00:01:57.480 --> 00:02:00.639
+of feelings about text. For the next 20 minutes, I'm
+
+00:02:00.640 --> 00:02:06.079
+making it your problem.
+
+00:02:06.080 --> 00:02:09.639
+First off, a huge shout out to Emacs Devel and the Emacs
+
+00:02:09.640 --> 00:02:12.919
+community in general. I spent a lot of time learning about
+
+00:02:12.920 --> 00:02:15.559
+what I'm about to talk about. I was definitely super
+
+00:02:15.560 --> 00:02:19.439
+confused at first. Then when I became less confused and I
+
+00:02:19.440 --> 00:02:23.919
+decided I was going to look at the regular expressions of the
+
+00:02:23.920 --> 00:02:28.039
+Regex engine, I was like, oh, it's old C code. It's
+
+00:02:28.040 --> 00:02:33.559
+Emacs. We can just use modern techniques. Turns out that's
+
+00:02:33.560 --> 00:02:37.839
+wrong for kind of two reasons. One, because using modern
+
+00:02:37.840 --> 00:02:41.479
+techniques or other engines don't necessarily do what
+
+00:02:41.480 --> 00:02:44.799
+Emacs regex engine currently does. Then secondarily,
+
+00:02:44.800 --> 00:02:48.719
+that's not actually as interesting as the other kind of
+
+00:02:48.720 --> 00:02:52.359
+larger goals that emacs-devel discussed. Thank you, Eli
+
+00:02:52.360 --> 00:02:56.279
+Zaretskii, so, so much, especially Pip Cet and everyone else
+
+00:02:56.280 --> 00:02:59.319
+as well--I believe--Pip Cet, I hope I'm pronouncing that
+
+00:02:59.320 --> 00:03:01.799
+correctly. Thank you so much. I'll be shouting you out
+
+00:03:01.800 --> 00:03:04.319
+later as well. Then these larger goals ended up
+
+00:03:04.320 --> 00:03:07.199
+overlapping a lot with my own research interests. And
+
+00:03:07.200 --> 00:03:09.879
+that's very exciting. I'm hoping it's exciting for you
+
+00:03:09.880 --> 00:03:14.079
+too. What is a regular expression? And when and how does
+
+00:03:14.080 --> 00:03:16.559
+implementation match formal theory? So what does formal
+
+00:03:16.560 --> 00:03:24.079
+theory mean? And we'll talk about that.
+
+00:03:24.080 --> 00:03:27.519
+What is a regular expression? So I might ask you this
+
+00:03:27.520 --> 00:03:30.799
+question, and you might give an answer. Then I might ask
+
+00:03:30.800 --> 00:03:33.519
+someone else, and they might have an answer. Then I might
+
+00:03:33.520 --> 00:03:38.039
+ask myself, and I might try to think of an answer. Our
+
+00:03:38.040 --> 00:03:41.799
+answers would, you know, see, the thing is, they'd all be
+
+00:03:41.800 --> 00:03:45.359
+correct, but they'd probably be slightly different, and
+
+00:03:45.360 --> 00:03:50.319
+they'd be different in kind of important ways. I'm
+
+00:03:50.320 --> 00:03:55.039
+using formal theory to kind of describe what unifies these
+
+00:03:55.040 --> 00:04:00.119
+interpretations and what causes this sort of divergence,
+
+00:04:00.120 --> 00:04:05.439
+both over time and then across code bases. I'm kind of
+
+00:04:05.440 --> 00:04:09.319
+putting a flag in the ground here and saying formal theory is
+
+00:04:09.320 --> 00:04:12.999
+actually a really, really negative influence, I think, but
+
+00:04:13.000 --> 00:04:15.999
+it can be better. That's what I'm going to talk about in
+
+00:04:16.000 --> 00:04:19.519
+this talk, in this presentation. We might ask, how did
+
+00:04:19.520 --> 00:04:26.679
+this happen? and we might try to find a start state. We
+
+00:04:26.680 --> 00:04:30.519
+might put that place at the theories of formal languages
+
+00:04:30.520 --> 00:04:34.679
+that kind of arose, especially post Turing and post
+
+00:04:34.680 --> 00:04:37.519
+Chomsky. Especially there was this really, really
+
+00:04:37.520 --> 00:04:40.119
+interesting and powerful relationship with formal
+
+00:04:40.120 --> 00:04:43.959
+languages between representation and computation. And
+
+00:04:43.960 --> 00:04:48.599
+then on top of that, we have regex as this really powerful
+
+00:04:48.600 --> 00:04:52.159
+union of theory and practice And then, like I mentioned,
+
+00:04:52.160 --> 00:04:55.799
+this is kind of divergence that kind of occurs. This
+
+00:04:55.800 --> 00:04:58.079
+divergence happens for a good reason. This happens because
+
+00:04:58.080 --> 00:05:01.999
+people were adding implementations and people adding
+
+00:05:02.000 --> 00:05:04.639
+features to implementations. While the people adding
+
+00:05:04.640 --> 00:05:06.679
+these features were often academics, they were
+
+00:05:06.680 --> 00:05:09.199
+industries, people that were hobbyists, they were
+
+00:05:09.200 --> 00:05:11.999
+interested in building practical tools. This is a good
+
+00:05:12.000 --> 00:05:14.879
+thing. This is still a good thing, even though it moves a
+
+00:05:14.880 --> 00:05:18.199
+little bit away from formal theory. But we start seeing some
+
+00:05:18.200 --> 00:05:22.639
+cracks developing, and we'll go into that in a second. We're
+
+00:05:22.640 --> 00:05:27.519
+just going to kind of electric slide into the 1980s here, and
+
+00:05:27.520 --> 00:05:31.879
+we're going to be confronted with two occurrences very
+
+00:05:31.880 --> 00:05:35.639
+similarly. We might call it simultaneous discovery. In
+
+00:05:35.640 --> 00:05:38.559
+1983, you have Michael Jackson demonstrating the
+
+00:05:38.560 --> 00:05:41.999
+moonwalk. Three years later, we have backtracking
+
+00:05:42.000 --> 00:05:44.999
+developed to stimulate EGREP-style regular expressions.
+
+00:05:45.000 --> 00:05:48.599
+These would both be incredibly influential in their own
+
+00:05:48.600 --> 00:05:54.039
+kind of branching paths. Here's where the gloves come
+
+00:05:54.040 --> 00:06:00.759
+off. Formal theory, I claim, remains largely concerned
+
+00:06:00.760 --> 00:06:03.359
+with incremental improvements to artificial benchmarks,
+
+00:06:03.360 --> 00:06:07.279
+and much less with expanding models to cover actual user
+
+00:06:07.280 --> 00:06:11.799
+needs. This isn't just about, oh, if you listened to
+
+00:06:11.800 --> 00:06:15.999
+users, that you'd be a nicer person, you'd be a better
+
+00:06:16.000 --> 00:06:19.359
+engineer. What I'm actually saying is that they're missing
+
+00:06:19.360 --> 00:06:23.919
+out. When you don't listen to applications, you miss out on a
+
+00:06:23.920 --> 00:06:26.639
+lot of fantastic opportunities for novel theory. So
+
+00:06:26.640 --> 00:06:30.839
+this is, again, my complaint with formal theory as it
+
+00:06:30.840 --> 00:06:34.599
+stands. But we're gonna do better. Before we get better,
+
+00:06:34.600 --> 00:06:36.959
+we're gonna get, a little bit worse for a bit. We're going to
+
+00:06:36.960 --> 00:06:40.359
+actually get a little bit worse is better. What I mean by
+
+00:06:40.360 --> 00:06:43.239
+that is, by the 1990s, we start looking into these
+
+00:06:43.240 --> 00:06:46.479
+non-backtracking engines. This is a bit of a reaction to
+
+00:06:46.480 --> 00:06:50.399
+backtracking. The current ones include RE2,
+
+00:06:50.400 --> 00:06:53.919
+hyperscan, and the rust regex library. These are all
+
+00:06:53.920 --> 00:06:56.439
+great. I'll talk about them later as well. They make use
+
+00:06:56.440 --> 00:06:58.719
+of these. They kind of call back to the earlier formal
+
+00:06:58.720 --> 00:07:01.479
+theory. They have linear runtimes for well-specified
+
+00:07:01.480 --> 00:07:02.519
+search tasks.
+
+00:07:02.520 --> 00:07:08.079
+What happens if that doesn't fit your needs? We're going to
+
+00:07:08.080 --> 00:07:11.479
+talk about that. We're going to table that for a second,
+
+00:07:11.480 --> 00:07:15.319
+and we're going to focus more on Emacs, the subject of this
+
+00:07:15.320 --> 00:07:19.359
+conference. What are regex used for? And in this
+
+00:07:19.360 --> 00:07:22.439
+particular case, they're used for lots of things, with
+
+00:07:22.440 --> 00:07:25.199
+practically, and I think they should be. But more
+
+00:07:25.200 --> 00:07:29.559
+specifically, how do Emacs users use them? And I'm going to
+
+00:07:29.560 --> 00:07:32.679
+focus in on this text as input and output. I'll be kind of
+
+00:07:32.680 --> 00:07:38.959
+elaborating on this analogy as we continue. Why is text
+
+00:07:38.960 --> 00:07:43.399
+powerful? Text as I/O. The reason text programming
+
+00:07:43.400 --> 00:07:45.759
+languages and not just programming languages, but
+
+00:07:45.760 --> 00:07:49.159
+languages themselves, the reason why they're successful
+
+00:07:49.160 --> 00:07:52.279
+and why they propagate, I claim, is because text is both
+
+00:07:52.280 --> 00:07:56.439
+input readable and output writable. What this means
+
+00:07:56.440 --> 00:08:01.199
+is that if you receive something in text, you can read it, And
+
+00:08:01.200 --> 00:08:04.239
+then you can also write it, you can modify it, and you can
+
+00:08:04.240 --> 00:08:06.959
+produce a new version of it. You're on a kind of level
+
+00:08:06.960 --> 00:08:10.959
+playing field. That's not always the case, though. You
+
+00:08:10.960 --> 00:08:15.959
+recall that I've worked a lot with build systems and package
+
+00:08:15.960 --> 00:08:20.999
+managers. There's a discussion that goes by the name of
+
+00:08:21.000 --> 00:08:25.319
+software supply chain security. I think it's a massive
+
+00:08:25.320 --> 00:08:29.079
+joke. The reason why is because people largely raise it
+
+00:08:29.080 --> 00:08:34.279
+to explain why their for-profit company with their
+
+00:08:34.280 --> 00:08:38.079
+for-profit product is going to solve the problem for you, as
+
+00:08:38.080 --> 00:08:41.959
+opposed to the commons of open source. If you are unable to
+
+00:08:41.960 --> 00:08:44.999
+modify or deploy your code without employing an opaque
+
+00:08:45.000 --> 00:08:48.599
+external system, I think, then you have a hidden
+
+00:08:48.600 --> 00:08:53.879
+dependency. you don't remove a dependency, you just, by,
+
+00:08:53.880 --> 00:08:59.239
+for example, paying into a for-profit product or using a
+
+00:08:59.240 --> 00:09:01.519
+closed-off supply chain, you end up just having a hidden
+
+00:09:01.520 --> 00:09:04.719
+dependency, you end up just displacing that. This can
+
+00:09:04.720 --> 00:09:07.639
+actually exert arbitrary control over your programming
+
+00:09:07.640 --> 00:09:11.279
+output and potentially even your thoughts. This is really
+
+00:09:11.280 --> 00:09:15.839
+important. I'm going to dive in a little bit deeper and I'm
+
+00:09:15.840 --> 00:09:18.999
+going to overload the term locality here. I'm going to
+
+00:09:19.000 --> 00:09:22.239
+say, if you cannot reproduce a system locally, it becomes an
+
+00:09:22.240 --> 00:09:24.999
+opaque external system. I'm going to give examples
+
+00:09:25.000 --> 00:09:27.479
+here, and these are going to be a bit of a hot take. First
+
+00:09:27.480 --> 00:09:30.519
+off, GUI IDEs. I think we might, well, some of us might agree
+
+00:09:30.520 --> 00:09:34.519
+with that here. I say development environments that only
+
+00:09:34.520 --> 00:09:38.119
+allow you to use a graphical interface, do not expose
+
+00:09:38.120 --> 00:09:42.799
+interaction with text, are explicitly trying to kind of
+
+00:09:42.800 --> 00:09:46.239
+place you on a separate kind of plane where you're not an
+
+00:09:46.240 --> 00:09:50.439
+equal contributor to the people who make the development
+
+00:09:50.440 --> 00:09:53.079
+environment, make the development kind of frameworks
+
+00:09:53.080 --> 00:09:57.399
+here. We'll go one further. Cloud services are precisely,
+
+00:09:57.400 --> 00:10:00.039
+you know, they're useful for things that, you know, that
+
+00:10:00.040 --> 00:10:04.399
+require large domain computation, but, you know, Twitter,
+
+00:10:04.400 --> 00:10:08.679
+for example, didn't actually ever use any cloud services,
+
+00:10:08.680 --> 00:10:12.199
+external ones, because it was really important for them to
+
+00:10:12.200 --> 00:10:14.999
+actually own their own hardware, their own computation,
+
+00:10:15.000 --> 00:10:20.199
+their own thinking. Cloud services are a way to ensure
+
+00:10:20.200 --> 00:10:24.919
+that you're unable to reproduce a system without paying an
+
+00:10:24.920 --> 00:10:28.039
+amount per month, an amount per day, an amount per second, an
+
+00:10:28.040 --> 00:10:32.439
+amount per cycle to an external entity. I'm just going to
+
+00:10:32.440 --> 00:10:35.559
+conclude this with, I'd say, the argumentum ad absurdum,
+
+00:10:35.560 --> 00:10:39.239
+here, where large language models are all of these at once.
+
+00:10:39.240 --> 00:10:42.879
+They are a cloud service, specifically, and this is what
+
+00:10:42.880 --> 00:10:48.439
+makes them very evil, to make it so that, similar to GUI IDEs,
+
+00:10:48.440 --> 00:10:52.919
+so that text itself loses that ability to be both readable
+
+00:10:52.920 --> 00:10:56.199
+and writable. Instead, text is both unreadable, because
+
+00:10:56.200 --> 00:10:59.519
+it's produced by a machine, and then also unwritable,
+
+00:10:59.520 --> 00:11:02.999
+because you're subservient and subjugated to the machine,
+
+00:11:03.000 --> 00:11:05.359
+to the large language model to produce the code in the first
+
+00:11:05.360 --> 00:11:08.919
+place. You lose this input, output, readable, writable
+
+00:11:08.920 --> 00:11:13.359
+behavior that I claim text has specifically. To
+
+00:11:13.360 --> 00:11:19.439
+underline this, what is text? Text is local. Finally,
+
+00:11:19.440 --> 00:11:23.639
+we're at the subject of this conference. Emacs, I have
+
+00:11:23.640 --> 00:11:27.479
+double hearts with text. I start off the slide saying Emacs
+
+00:11:27.480 --> 00:11:31.519
+is a text editor. I think that's a good start. Which
+
+00:11:31.520 --> 00:11:34.319
+implements much of its own logic and user interface via
+
+00:11:34.320 --> 00:11:38.399
+text. What this means is that, you know, I say without
+
+00:11:38.400 --> 00:11:42.639
+trying, Emacs tries very hard, but without trying so hard,
+
+00:11:42.640 --> 00:11:47.639
+Emacs, is imbued with all of the capabilities that text has
+
+00:11:47.640 --> 00:11:51.319
+specifically. When you use text like Emacs does, and
+
+00:11:51.320 --> 00:11:55.519
+particularly you then start offering mechanisms to query,
+
+00:11:55.520 --> 00:11:59.999
+to transform, and to generally metaprogram text itself,
+
+00:12:00.000 --> 00:12:03.319
+you don't just have the ability to edit code in new ways. And
+
+00:12:03.320 --> 00:12:06.999
+this is something that I think is often lost, maybe not by
+
+00:12:07.000 --> 00:12:11.239
+participants of this conference, you particularly start
+
+00:12:11.240 --> 00:12:14.319
+being able to not only just edit code differently, but to
+
+00:12:14.320 --> 00:12:16.599
+change the way that you think about code and actually to
+
+00:12:16.600 --> 00:12:20.239
+expand your range of thought, the range of actions that you
+
+00:12:20.240 --> 00:12:22.719
+can perform. You can actually start then editing at the
+
+00:12:22.720 --> 00:12:25.799
+speed of thought. This is where especially Regex kind of
+
+00:12:25.800 --> 00:12:30.319
+comes into play. Finally, we get to the subject of the
+
+00:12:30.320 --> 00:12:33.599
+title of this talk. I'm about to disappoint a lot of
+
+00:12:33.600 --> 00:12:38.759
+people. I claim for good reason. Unfortunately, it's a
+
+00:12:38.760 --> 00:12:41.599
+very brief walkthrough, but I'm going to go over what the
+
+00:12:41.600 --> 00:12:43.799
+current Emacs Redix engine is. This is going to give us
+
+00:12:43.800 --> 00:12:48.119
+enough context for the next section on future directions.
+
+00:12:48.120 --> 00:12:51.799
+Quickly, it's a backtracking engine over a multi-byte
+
+00:12:51.800 --> 00:12:53.919
+code point. I'll define what that means. It's in
+
+00:12:53.920 --> 00:12:58.439
+regex-emacs.c. It's invoked in two ways, which you'll see
+
+00:12:58.440 --> 00:13:01.759
+is actually the same way, over a single contiguous string
+
+00:13:01.760 --> 00:13:05.359
+input. This is a Lisp string that you pass in. or over the
+
+00:13:05.360 --> 00:13:07.039
+two halves of the gap buffer. This is when you match
+
+00:13:07.040 --> 00:13:11.879
+against a buffer text. We'll go into that a little bit
+
+00:13:11.880 --> 00:13:13.919
+more, but this is one of the really actually interesting and
+
+00:13:13.920 --> 00:13:17.839
+specific things about Emacs Regex Engine as it stands. So
+
+00:13:17.840 --> 00:13:21.559
+very, very quickly, this is the data layout. This is just, if
+
+00:13:21.560 --> 00:13:24.879
+you're interested, this is where the code lies. So
+
+00:13:24.880 --> 00:13:30.159
+regex-emacs.h has re-pattern buffer, which is a struct
+
+00:13:30.160 --> 00:13:34.239
+Actually, you know, I love, by the way, I love the Emacs C
+
+00:13:34.240 --> 00:13:37.359
+source code. It's so nice to read. It made all this so, so
+
+00:13:37.360 --> 00:13:41.119
+easy. I really appreciated it. In this particular case,
+
+00:13:41.120 --> 00:13:44.039
+I'm just going to focus on re-pattern buffer actually has
+
+00:13:44.040 --> 00:13:47.999
+the compiler. It's a C struct. It has every single thing
+
+00:13:48.000 --> 00:13:52.559
+that is needed to execute the regular expression against a
+
+00:13:52.560 --> 00:13:56.319
+string input or against a buffer input. This buffer,
+
+00:13:56.320 --> 00:13:59.839
+It's not an Emacs buffer. It refers to just the instruction
+
+00:13:59.840 --> 00:14:04.039
+table and the match loop. Again, this is very, very
+
+00:14:04.040 --> 00:14:07.839
+brief, but I want to specifically focus on the first part. So
+
+00:14:07.840 --> 00:14:11.879
+this is this inner matching loop, and there's a prologue,
+
+00:14:11.880 --> 00:14:15.679
+and then there's a loop body, and there's an epilogue. And
+
+00:14:15.680 --> 00:14:18.279
+the prologue is the really, really interesting part. I say
+
+00:14:18.280 --> 00:14:22.839
+extract current and next char. What Emacs does here, it
+
+00:14:22.840 --> 00:14:27.159
+doesn't just reach for the next byte. It actually will
+
+00:14:27.160 --> 00:14:31.879
+perform lazily in some sense, this variable integer size
+
+00:14:31.880 --> 00:14:36.039
+VAR decoding for multi-byte, and it'll actually then
+
+00:14:36.040 --> 00:14:43.959
+decode the next one to four bytes. Up to 32 bits at once, and
+
+00:14:43.960 --> 00:14:46.799
+then it'll actually go into the loop. We'll talk about the
+
+00:14:46.800 --> 00:14:52.519
+implications of that later. Next, in the body of the loop, we
+
+00:14:52.520 --> 00:14:54.239
+read the instruction from the instruction pointer, which
+
+00:14:54.240 --> 00:14:57.319
+is, again, in that buffer field. Then we have this big
+
+00:14:57.320 --> 00:14:59.479
+switch statement, which is actually, love a big switch
+
+00:14:59.480 --> 00:15:02.079
+statement, super easy to read, super easy to understand
+
+00:15:02.080 --> 00:15:05.399
+kind of what's occurring. Then that's the loop body. And
+
+00:15:05.400 --> 00:15:08.279
+then at the end of it, we either increment the instruction
+
+00:15:08.280 --> 00:15:11.119
+pointer if it was matching a single character or something
+
+00:15:11.120 --> 00:15:14.839
+along those lines, or if it was a jump, we don't do that. A
+
+00:15:14.840 --> 00:15:18.199
+jump, however, it's not referring to a jump in the sense of a
+
+00:15:18.200 --> 00:15:22.519
+go-to, but a jump that's elsewhere within that table, that
+
+00:15:22.520 --> 00:15:25.839
+buffer field. If you've included a capture, we write
+
+00:15:25.840 --> 00:15:29.479
+that end position there. Of course, well, as you may
+
+00:15:29.480 --> 00:15:34.439
+recall, the zeroth capture is, of course, the entire match
+
+00:15:34.440 --> 00:15:36.559
+string. If the capture is zero, then we know we've
+
+00:15:36.560 --> 00:15:39.839
+actually completed that match. That's really great.
+
+00:15:39.840 --> 00:15:43.599
+I would love to receive Q&A about this as well. I've spent a
+
+00:15:43.600 --> 00:15:46.719
+lot of time kind of learning and understanding it. But it's
+
+00:15:46.720 --> 00:15:49.879
+really interesting that this can be described in a single
+
+00:15:49.880 --> 00:15:52.159
+slide because it's really simple. That simplicity is
+
+00:15:52.160 --> 00:15:54.639
+actually a really powerful thing. I'll mention that in
+
+00:15:54.640 --> 00:15:58.759
+the next section. I say, is that all? And I apologize for
+
+00:15:58.760 --> 00:16:02.239
+not doing so. But please, please ask questions in Q&A or
+
+00:16:02.240 --> 00:16:04.999
+message me about this, because I think it's really, really,
+
+00:16:05.000 --> 00:16:07.079
+again, interesting. Again, I find the code relatively
+
+00:16:07.080 --> 00:16:11.999
+easy to read. Now, here's, I think this is actually the
+
+00:16:12.000 --> 00:16:15.519
+point of the talk. The rest of it was, you know, I think just me
+
+00:16:15.520 --> 00:16:18.839
+posturing. This is the really, really interesting part.
+
+00:16:18.840 --> 00:16:22.039
+This is the ways that we can improve, well, not just we can
+
+00:16:22.040 --> 00:16:25.839
+improve stuff in Emacs, but why those are the right things to
+
+00:16:25.840 --> 00:16:30.279
+improve. Then also how that can be a model for even things
+
+00:16:30.280 --> 00:16:35.079
+outside of Emacs. This is gonna be a lot of text. I'm not
+
+00:16:35.080 --> 00:16:38.879
+gonna go through all of it. This is the one thing that I tried.
+
+00:16:38.880 --> 00:16:42.239
+This is the thing that I thought would be a slam dunk, easy
+
+00:16:42.240 --> 00:16:47.439
+solution. My initial thought process was, well, We tried
+
+00:16:47.440 --> 00:16:52.919
+very hard to do an LRU cache here. It works. It's actually
+
+00:16:52.920 --> 00:16:57.399
+very effective. However, though, we don't actually give
+
+00:16:57.400 --> 00:17:00.479
+the user, the list programmer, the ability to then say, I
+
+00:17:00.480 --> 00:17:03.079
+know that this regex is something that is going to be used
+
+00:17:03.080 --> 00:17:06.399
+again. I made an artificial benchmark. I made an
+
+00:17:06.400 --> 00:17:10.039
+artificial benchmark because I wanted to show there is one
+
+00:17:10.040 --> 00:17:13.639
+very specific case that it does solve, but it's the same
+
+00:17:13.640 --> 00:17:16.919
+issue with the artificial benchmarks. mentioned earlier.
+
+00:17:16.920 --> 00:17:21.559
+It's very specifically crafted in order to show that this
+
+00:17:21.560 --> 00:17:25.319
+particular solution would produce some speedup. What
+
+00:17:25.320 --> 00:17:29.599
+this means is it just creates more than 20 regexps in a row. It
+
+00:17:29.600 --> 00:17:31.959
+compiles them. Then, of course, because we just don't
+
+00:17:31.960 --> 00:17:35.159
+pay the compile costs, because we don't go through that
+
+00:17:35.160 --> 00:17:39.079
+cache eviction process, it ends up being faster. But this
+
+00:17:39.080 --> 00:17:42.079
+isn't really mean very much, particularly the goal here,
+
+00:17:42.080 --> 00:17:45.559
+you know, the goal would have been to show that the compile
+
+00:17:45.560 --> 00:17:48.359
+cache is actually causing the performance issue in
+
+00:17:48.360 --> 00:17:51.359
+comparison to pre-compiling it. That's not something
+
+00:17:51.360 --> 00:17:56.039
+I've been able to show. Match over bytes, not cars. So
+
+00:17:56.040 --> 00:17:59.079
+this is when I said at the beginning, oh, I came in and I think,
+
+00:17:59.080 --> 00:18:02.079
+oh, we can just use modern regex engine techniques. This is
+
+00:18:02.080 --> 00:18:05.239
+really what I meant. In particular, I mentioned in this
+
+00:18:05.240 --> 00:18:09.279
+match loop here that there's this, prolog that does this
+
+00:18:09.280 --> 00:18:13.359
+varring decoding. What this means is that every single
+
+00:18:13.360 --> 00:18:18.519
+iteration of that loop is going to be interspersed with this
+
+00:18:18.520 --> 00:18:21.919
+not being able to read a fixed number of bytes, but a variable
+
+00:18:21.920 --> 00:18:24.359
+number of bytes just depending upon the Unicode character
+
+00:18:24.360 --> 00:18:27.039
+or the Unicode code point or the multibyte code point. So
+
+00:18:27.040 --> 00:18:29.799
+this ends up, again, being relatively difficult to
+
+00:18:29.800 --> 00:18:32.919
+optimize because processors operate over bytes and not
+
+00:18:32.920 --> 00:18:38.479
+over code points. Yes, we might consider a multi-byte CPU at
+
+00:18:38.480 --> 00:18:41.039
+some point. But this is a really, really simple thing. It's
+
+00:18:41.040 --> 00:18:44.999
+just generating automata that operate over bytes as
+
+00:18:45.000 --> 00:18:48.839
+opposed to code points. This kind of goes into the much more
+
+00:18:48.840 --> 00:18:51.839
+abstract one. There's a lot of text here, and we're not
+
+00:18:51.840 --> 00:18:56.159
+going to go into it. But the really, really important point
+
+00:18:56.160 --> 00:18:57.999
+that I'm specifically mentioning here is this explicit
+
+00:18:58.000 --> 00:19:02.079
+control over linguistic complexity. That's the
+
+00:19:02.080 --> 00:19:06.159
+abstract kind of point. I want to introduce the inputs and
+
+00:19:06.160 --> 00:19:11.279
+the outputs. Basically, when you perform a search, or a
+
+00:19:11.280 --> 00:19:14.799
+match, or a parse, those are different tasks. They'll
+
+00:19:14.800 --> 00:19:17.799
+have different expected inputs and different desired
+
+00:19:17.800 --> 00:19:21.559
+outputs. Right now, Emacs, the API for the regular
+
+00:19:21.560 --> 00:19:24.919
+expression engine and for matching, It doesn't allow
+
+00:19:24.920 --> 00:19:27.959
+specialization on this. Or rather, if we do specialize on
+
+00:19:27.960 --> 00:19:30.999
+particular inputs, if we have a heuristic to check if a regex
+
+00:19:31.000 --> 00:19:33.519
+is actually a literal string, that's not something that the
+
+00:19:33.520 --> 00:19:36.959
+user actually has control over. For example, you can make
+
+00:19:36.960 --> 00:19:38.999
+a mistake escaping something, and then you don't have a
+
+00:19:39.000 --> 00:19:42.039
+literal, and then you accidentally have behavior that you
+
+00:19:42.040 --> 00:19:44.279
+totally didn't expect. Not just correctness issues, but
+
+00:19:44.280 --> 00:19:48.599
+also performance issues. I really like this one. I like
+
+00:19:48.600 --> 00:19:52.239
+this a lot, because I didn't think of it at all. I think it's
+
+00:19:52.240 --> 00:19:58.119
+better than in all of my ideas. This was proposed, at least
+
+00:19:58.120 --> 00:20:01.839
+to me, by Pip Cet, and I really hope that I'm pronouncing your
+
+00:20:01.840 --> 00:20:04.479
+name correctly. I'm sorry I didn't ask you beforehand,
+
+00:20:04.480 --> 00:20:08.399
+emacs-devel. In particular, this was after a couple of
+
+00:20:08.400 --> 00:20:11.999
+responses where I was trying to say, oh, I want to give the
+
+00:20:12.000 --> 00:20:15.879
+list programmer, way back in here, I want to give the list
+
+00:20:15.880 --> 00:20:20.559
+programmer the ability to control compilation in some
+
+00:20:20.560 --> 00:20:25.759
+sense. you know, he mentioned, I think he is correct, you
+
+00:20:25.760 --> 00:20:28.439
+know, there's no real introspection. That happens
+
+00:20:28.440 --> 00:20:33.119
+because it's written in C. I was thinking, oh, if I turn
+
+00:20:33.120 --> 00:20:35.639
+this into a list object that gives the list programmer the
+
+00:20:35.640 --> 00:20:40.039
+power and the ability to do more with that, but it doesn't
+
+00:20:40.040 --> 00:20:42.839
+actually because it's still in C. At first, I was
+
+00:20:42.840 --> 00:20:46.679
+thinking, oh, we can make the C part more flexible. But
+
+00:20:46.680 --> 00:20:50.039
+actually, especially if we want to do almost any of the
+
+00:20:50.040 --> 00:20:52.719
+things we previously mentioned, I think basically that
+
+00:20:52.720 --> 00:20:56.599
+this is... I think that if I'm not going to do it, somebody
+
+00:20:56.600 --> 00:20:58.879
+else really should do it, and I think we should maybe even do
+
+00:20:58.880 --> 00:21:01.519
+it together, because I think this is really, I think, how we
+
+00:21:01.520 --> 00:21:04.079
+can start experimenting, and not just experimenting, but
+
+00:21:04.080 --> 00:21:07.039
+because, as mentioned here, we have libgccjit, we have the
+
+00:21:07.040 --> 00:21:09.519
+native compiler, we have the ability to opt, like,
+
+00:21:09.520 --> 00:21:12.639
+specifically to generate specific code for this, so why not
+
+00:21:12.640 --> 00:21:15.919
+implement the or a Redix engine itself in list, And this
+
+00:21:15.920 --> 00:21:18.359
+gives us the ability to introspect it. That's one of the
+
+00:21:18.360 --> 00:21:20.759
+things I mentioned at the beginning. But it actually gives
+
+00:21:20.760 --> 00:21:23.879
+us the ability to then actually look at all the previous
+
+00:21:23.880 --> 00:21:28.159
+implementations, to explicitly compile beforehand, to
+
+00:21:28.160 --> 00:21:32.519
+match against bytes, to specialize and dispatch based upon
+
+00:21:32.520 --> 00:21:36.799
+input and output. This is something that, you know, it's
+
+00:21:36.800 --> 00:21:37.999
+super simple.
+
+00:21:38.000 --> 00:21:40.799
+It's really smart. I'm really, really glad that Pip
+
+00:21:40.800 --> 00:21:44.839
+mentioned this because it is, I think, the right way to solve
+
+00:21:44.840 --> 00:21:49.879
+the rest of it. We're at the final section. I talked a
+
+00:21:49.880 --> 00:21:52.679
+lot about, you know, kind of abstract, you know, thoughts.
+
+00:21:52.680 --> 00:21:55.679
+I talked a little about, you know, specific solutions.
+
+00:21:55.680 --> 00:21:59.999
+But I especially talked about, you know, what is Regex and
+
+00:22:00.000 --> 00:22:02.959
+Emacs? And I don't know if I had a lot of specific examples of
+
+00:22:02.960 --> 00:22:06.079
+it. I'm going to just describe kind of my, I guess,
+
+00:22:06.080 --> 00:22:09.799
+motivation, my impetus. Then I think something that's
+
+00:22:09.800 --> 00:22:12.639
+really something to chew on for the future. Do I have any
+
+00:22:12.640 --> 00:22:15.799
+concrete examples? Yes. Well, you can decide if they're
+
+00:22:15.800 --> 00:22:22.799
+concrete. Or am I just posturing? Also, yes. helm, rg. Helm,
+
+00:22:22.800 --> 00:22:27.679
+Erg, it's literally just M-x grep, it uses ripgrep, which
+
+00:22:27.680 --> 00:22:31.999
+is written by the same author of the Rust regex [??]. It
+
+00:22:32.000 --> 00:22:36.199
+happens to be very, very fast. In particular, I use this tool
+
+00:22:36.200 --> 00:22:39.319
+with ripgrep on the Twitter monorepo, and I was able to
+
+00:22:39.320 --> 00:22:42.559
+search very, very large amounts of code that was on my local
+
+00:22:42.560 --> 00:22:46.399
+machine using regular expressions. I think this is one
+
+00:22:46.400 --> 00:22:49.199
+thing that I think is really, really important, because
+
+00:22:49.200 --> 00:22:52.919
+when you want to scale, People say the word scaling and they
+
+00:22:52.920 --> 00:22:56.719
+assume there's a specific kind of answer for that. I've
+
+00:22:56.720 --> 00:23:01.679
+just found that text is not only flexible, it's actually
+
+00:23:01.680 --> 00:23:04.359
+something that can be more performant than the alternative
+
+00:23:04.360 --> 00:23:07.399
+and not only more performant, but more productive. It's
+
+00:23:07.400 --> 00:23:10.359
+again, it's just M-x grep using ripgrep. There's a tool
+
+00:23:10.360 --> 00:23:12.719
+deadgrep by Wilfred Hughes, which is also fantastic. I
+
+00:23:12.720 --> 00:23:15.759
+think it's actually better than this, but this one's mine so
+
+00:23:15.760 --> 00:23:19.199
+I can mess around with it. But this tool is kind of why,
+
+00:23:19.200 --> 00:23:21.799
+especially I started looking into Emacs and looking into
+
+00:23:21.800 --> 00:23:24.919
+changing the way that, or at least diving into how the
+
+00:23:24.920 --> 00:23:27.559
+regular expression matching actually kind of works, both
+
+00:23:27.560 --> 00:23:30.359
+in Emacs and then in ripgrep. We'll go to the next one.
+
+00:23:30.360 --> 00:23:34.119
+This is something that does exist and continues to exist.
+
+00:23:34.120 --> 00:23:36.799
+This is something that doesn't quite exist yet. I'm
+
+00:23:36.800 --> 00:23:41.359
+calling it telepathy grams. It's, you know, it's the name,
+
+00:23:41.360 --> 00:23:44.719
+and it's very, you know, it doesn't work, but it's a code
+
+00:23:44.720 --> 00:23:47.919
+search tool that, in this case, precompiles the database to
+
+00:23:47.920 --> 00:23:51.879
+execute NFAs against. I was thinking, how can I beat And
+
+00:23:51.880 --> 00:23:55.039
+the first thing I thought is, well, as I have worked on build
+
+00:23:55.040 --> 00:23:57.759
+tools, especially in monorepos, one of the things that the
+
+00:23:57.760 --> 00:24:00.799
+pants build tool from Twitter does is it uses a file watcher
+
+00:24:00.800 --> 00:24:04.239
+to ensure that instead of having to constantly read in the
+
+00:24:04.240 --> 00:24:10.079
+entire contents of a file, which may be very, very large, it
+
+00:24:10.080 --> 00:24:13.679
+only does so when the file has been changed. Finally, I
+
+00:24:13.680 --> 00:24:16.919
+want to conclude on this note, which is just that the stuff I
+
+00:24:16.920 --> 00:24:20.839
+didn't learn from emacs devel, I learned from Paul
+
+00:24:20.840 --> 00:24:25.319
+Wankadia, Jr., who is the RE2 maintainer, and he taught me
+
+00:24:25.320 --> 00:24:32.399
+quite a lot from 2023 to 2024. I'm thankful for the time
+
+00:24:32.400 --> 00:24:37.959
+that I learned from you, so thank you, Paul. With that, we're
+
+00:24:37.960 --> 00:24:42.759
+at point-max. Call me, beat me, if you want to reach me and or
+
+00:24:42.760 --> 00:24:45.839
+hire me. These are places that you can reach me at. There are
+
+00:24:45.840 --> 00:24:49.719
+probably others. Feel free to suggest other ways to contact
+
+00:24:49.720 --> 00:24:53.199
+me. But for now, this is the end. Thank you so much for your
+
+00:24:53.200 --> 00:24:56.080
+time. I really appreciate it.