From 84710fb2fb5ef807c63a48eed67df9a42edbecac Mon Sep 17 00:00:00 2001 From: EmacsConf Date: Sun, 8 Dec 2024 09:37:18 -0500 Subject: update --- ...ve-pattern-matching--danny-mcclanahan--main.vtt | 1198 ++++++++++++++++++++ 1 file changed, 1198 insertions(+) create mode 100644 2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt diff --git a/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt b/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt new file mode 100644 index 00000000..bad03237 --- /dev/null +++ b/2024/captions/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--main.vtt @@ -0,0 +1,1198 @@ +WEBVTT captioned by sachac + +00:00:00.000 --> 00:00:13.359 +Hello, I'm Danny McClanahan. This is EmacsConf 2024. And + +00:00:13.360 --> 00:00:17.159 +this presentation is ostensibly about Emacs Regex + +00:00:17.160 --> 00:00:22.639 +compilation. But it'll lead a lot more in future + +00:00:22.640 --> 00:00:30.879 +directions. Thanks for coming on this journey with me. + +00:00:30.880 --> 00:00:36.719 +This presentation is 50 slides, 50 footnotes, and that's + +00:00:36.720 --> 00:00:40.679 +intended for it to be a resource later on for your perusal. We + +00:00:40.680 --> 00:00:44.199 +are unfortunately not going to be able to go into all of it, + +00:00:44.200 --> 00:00:49.439 +but I will try to be within 20 minutes so we can make it + +00:00:49.440 --> 00:00:56.199 +throughout Q&A. This is the structure of the talk. + +00:00:56.200 --> 00:01:03.519 +But enough about me. Who are you? And why are you here? + +00:01:03.520 --> 00:01:09.479 +I'm Danny McClanahan. + +00:01:09.480 --> 00:01:13.439 +My experience is a lot in build tools, especially in the + +00:01:13.440 --> 00:01:19.399 +package managers. That started because I realized I was + +00:01:19.400 --> 00:01:23.319 +wasting a lot of time. Then I didn't like that. I + +00:01:23.320 --> 00:01:29.439 +started wasting a lot of time, trying to avoid wasting time. + +00:01:29.440 --> 00:01:35.479 +Then I ended up... going so far around that I ended up + +00:01:35.480 --> 00:01:40.319 +stopping other people from wasting their own time, in this + +00:01:40.320 --> 00:01:44.359 +case, regarding failing builds. But this is a kind of + +00:01:44.360 --> 00:01:47.479 +pattern that you'll see. I'm talking a lot about patterns in + +00:01:47.480 --> 00:01:52.399 +this presentation. Parsing in text is another one of + +00:01:52.400 --> 00:01:57.479 +those tendencies that I have. Why am I here? I've got a lot + +00:01:57.480 --> 00:02:00.639 +of feelings about text. For the next 20 minutes, I'm + +00:02:00.640 --> 00:02:06.079 +making it your problem. + +00:02:06.080 --> 00:02:09.639 +First off, a huge shout out to Emacs Devel and the Emacs + +00:02:09.640 --> 00:02:12.919 +community in general. I spent a lot of time learning about + +00:02:12.920 --> 00:02:15.559 +what I'm about to talk about. I was definitely super + +00:02:15.560 --> 00:02:19.439 +confused at first. Then when I became less confused and I + +00:02:19.440 --> 00:02:23.919 +decided I was going to look at the regular expressions of the + +00:02:23.920 --> 00:02:28.039 +Regex engine, I was like, oh, it's old C code. It's + +00:02:28.040 --> 00:02:33.559 +Emacs. We can just use modern techniques. Turns out that's + +00:02:33.560 --> 00:02:37.839 +wrong for kind of two reasons. One, because using modern + +00:02:37.840 --> 00:02:41.479 +techniques or other engines don't necessarily do what + +00:02:41.480 --> 00:02:44.799 +Emacs regex engine currently does. Then secondarily, + +00:02:44.800 --> 00:02:48.719 +that's not actually as interesting as the other kind of + +00:02:48.720 --> 00:02:52.359 +larger goals that emacs-devel discussed. Thank you, Eli + +00:02:52.360 --> 00:02:56.279 +Zaretskii, so, so much, especially Pip Cet and everyone else + +00:02:56.280 --> 00:02:59.319 +as well--I believe--Pip Cet, I hope I'm pronouncing that + +00:02:59.320 --> 00:03:01.799 +correctly. Thank you so much. I'll be shouting you out + +00:03:01.800 --> 00:03:04.319 +later as well. Then these larger goals ended up + +00:03:04.320 --> 00:03:07.199 +overlapping a lot with my own research interests. And + +00:03:07.200 --> 00:03:09.879 +that's very exciting. I'm hoping it's exciting for you + +00:03:09.880 --> 00:03:14.079 +too. What is a regular expression? And when and how does + +00:03:14.080 --> 00:03:16.559 +implementation match formal theory? So what does formal + +00:03:16.560 --> 00:03:24.079 +theory mean? And we'll talk about that. + +00:03:24.080 --> 00:03:27.519 +What is a regular expression? So I might ask you this + +00:03:27.520 --> 00:03:30.799 +question, and you might give an answer. Then I might ask + +00:03:30.800 --> 00:03:33.519 +someone else, and they might have an answer. Then I might + +00:03:33.520 --> 00:03:38.039 +ask myself, and I might try to think of an answer. Our + +00:03:38.040 --> 00:03:41.799 +answers would, you know, see, the thing is, they'd all be + +00:03:41.800 --> 00:03:45.359 +correct, but they'd probably be slightly different, and + +00:03:45.360 --> 00:03:50.319 +they'd be different in kind of important ways. I'm + +00:03:50.320 --> 00:03:55.039 +using formal theory to kind of describe what unifies these + +00:03:55.040 --> 00:04:00.119 +interpretations and what causes this sort of divergence, + +00:04:00.120 --> 00:04:05.439 +both over time and then across code bases. I'm kind of + +00:04:05.440 --> 00:04:09.319 +putting a flag in the ground here and saying formal theory is + +00:04:09.320 --> 00:04:12.999 +actually a really, really negative influence, I think, but + +00:04:13.000 --> 00:04:15.999 +it can be better. That's what I'm going to talk about in + +00:04:16.000 --> 00:04:19.519 +this talk, in this presentation. We might ask, how did + +00:04:19.520 --> 00:04:26.679 +this happen? and we might try to find a start state. We + +00:04:26.680 --> 00:04:30.519 +might put that place at the theories of formal languages + +00:04:30.520 --> 00:04:34.679 +that kind of arose, especially post Turing and post + +00:04:34.680 --> 00:04:37.519 +Chomsky. Especially there was this really, really + +00:04:37.520 --> 00:04:40.119 +interesting and powerful relationship with formal + +00:04:40.120 --> 00:04:43.959 +languages between representation and computation. And + +00:04:43.960 --> 00:04:48.599 +then on top of that, we have regex as this really powerful + +00:04:48.600 --> 00:04:52.159 +union of theory and practice And then, like I mentioned, + +00:04:52.160 --> 00:04:55.799 +this is kind of divergence that kind of occurs. This + +00:04:55.800 --> 00:04:58.079 +divergence happens for a good reason. This happens because + +00:04:58.080 --> 00:05:01.999 +people were adding implementations and people adding + +00:05:02.000 --> 00:05:04.639 +features to implementations. While the people adding + +00:05:04.640 --> 00:05:06.679 +these features were often academics, they were + +00:05:06.680 --> 00:05:09.199 +industries, people that were hobbyists, they were + +00:05:09.200 --> 00:05:11.999 +interested in building practical tools. This is a good + +00:05:12.000 --> 00:05:14.879 +thing. This is still a good thing, even though it moves a + +00:05:14.880 --> 00:05:18.199 +little bit away from formal theory. But we start seeing some + +00:05:18.200 --> 00:05:22.639 +cracks developing, and we'll go into that in a second. We're + +00:05:22.640 --> 00:05:27.519 +just going to kind of electric slide into the 1980s here, and + +00:05:27.520 --> 00:05:31.879 +we're going to be confronted with two occurrences very + +00:05:31.880 --> 00:05:35.639 +similarly. We might call it simultaneous discovery. In + +00:05:35.640 --> 00:05:38.559 +1983, you have Michael Jackson demonstrating the + +00:05:38.560 --> 00:05:41.999 +moonwalk. Three years later, we have backtracking + +00:05:42.000 --> 00:05:44.999 +developed to stimulate EGREP-style regular expressions. + +00:05:45.000 --> 00:05:48.599 +These would both be incredibly influential in their own + +00:05:48.600 --> 00:05:54.039 +kind of branching paths. Here's where the gloves come + +00:05:54.040 --> 00:06:00.759 +off. Formal theory, I claim, remains largely concerned + +00:06:00.760 --> 00:06:03.359 +with incremental improvements to artificial benchmarks, + +00:06:03.360 --> 00:06:07.279 +and much less with expanding models to cover actual user + +00:06:07.280 --> 00:06:11.799 +needs. This isn't just about, oh, if you listened to + +00:06:11.800 --> 00:06:15.999 +users, that you'd be a nicer person, you'd be a better + +00:06:16.000 --> 00:06:19.359 +engineer. What I'm actually saying is that they're missing + +00:06:19.360 --> 00:06:23.919 +out. When you don't listen to applications, you miss out on a + +00:06:23.920 --> 00:06:26.639 +lot of fantastic opportunities for novel theory. So + +00:06:26.640 --> 00:06:30.839 +this is, again, my complaint with formal theory as it + +00:06:30.840 --> 00:06:34.599 +stands. But we're gonna do better. Before we get better, + +00:06:34.600 --> 00:06:36.959 +we're gonna get, a little bit worse for a bit. We're going to + +00:06:36.960 --> 00:06:40.359 +actually get a little bit worse is better. What I mean by + +00:06:40.360 --> 00:06:43.239 +that is, by the 1990s, we start looking into these + +00:06:43.240 --> 00:06:46.479 +non-backtracking engines. This is a bit of a reaction to + +00:06:46.480 --> 00:06:50.399 +backtracking. The current ones include RE2, + +00:06:50.400 --> 00:06:53.919 +hyperscan, and the rust regex library. These are all + +00:06:53.920 --> 00:06:56.439 +great. I'll talk about them later as well. They make use + +00:06:56.440 --> 00:06:58.719 +of these. They kind of call back to the earlier formal + +00:06:58.720 --> 00:07:01.479 +theory. They have linear runtimes for well-specified + +00:07:01.480 --> 00:07:02.519 +search tasks. + +00:07:02.520 --> 00:07:08.079 +What happens if that doesn't fit your needs? We're going to + +00:07:08.080 --> 00:07:11.479 +talk about that. We're going to table that for a second, + +00:07:11.480 --> 00:07:15.319 +and we're going to focus more on Emacs, the subject of this + +00:07:15.320 --> 00:07:19.359 +conference. What are regex used for? And in this + +00:07:19.360 --> 00:07:22.439 +particular case, they're used for lots of things, with + +00:07:22.440 --> 00:07:25.199 +practically, and I think they should be. But more + +00:07:25.200 --> 00:07:29.559 +specifically, how do Emacs users use them? And I'm going to + +00:07:29.560 --> 00:07:32.679 +focus in on this text as input and output. I'll be kind of + +00:07:32.680 --> 00:07:38.959 +elaborating on this analogy as we continue. Why is text + +00:07:38.960 --> 00:07:43.399 +powerful? Text as I/O. The reason text programming + +00:07:43.400 --> 00:07:45.759 +languages and not just programming languages, but + +00:07:45.760 --> 00:07:49.159 +languages themselves, the reason why they're successful + +00:07:49.160 --> 00:07:52.279 +and why they propagate, I claim, is because text is both + +00:07:52.280 --> 00:07:56.439 +input readable and output writable. What this means + +00:07:56.440 --> 00:08:01.199 +is that if you receive something in text, you can read it, And + +00:08:01.200 --> 00:08:04.239 +then you can also write it, you can modify it, and you can + +00:08:04.240 --> 00:08:06.959 +produce a new version of it. You're on a kind of level + +00:08:06.960 --> 00:08:10.959 +playing field. That's not always the case, though. You + +00:08:10.960 --> 00:08:15.959 +recall that I've worked a lot with build systems and package + +00:08:15.960 --> 00:08:20.999 +managers. There's a discussion that goes by the name of + +00:08:21.000 --> 00:08:25.319 +software supply chain security. I think it's a massive + +00:08:25.320 --> 00:08:29.079 +joke. The reason why is because people largely raise it + +00:08:29.080 --> 00:08:34.279 +to explain why their for-profit company with their + +00:08:34.280 --> 00:08:38.079 +for-profit product is going to solve the problem for you, as + +00:08:38.080 --> 00:08:41.959 +opposed to the commons of open source. If you are unable to + +00:08:41.960 --> 00:08:44.999 +modify or deploy your code without employing an opaque + +00:08:45.000 --> 00:08:48.599 +external system, I think, then you have a hidden + +00:08:48.600 --> 00:08:53.879 +dependency. you don't remove a dependency, you just, by, + +00:08:53.880 --> 00:08:59.239 +for example, paying into a for-profit product or using a + +00:08:59.240 --> 00:09:01.519 +closed-off supply chain, you end up just having a hidden + +00:09:01.520 --> 00:09:04.719 +dependency, you end up just displacing that. This can + +00:09:04.720 --> 00:09:07.639 +actually exert arbitrary control over your programming + +00:09:07.640 --> 00:09:11.279 +output and potentially even your thoughts. This is really + +00:09:11.280 --> 00:09:15.839 +important. I'm going to dive in a little bit deeper and I'm + +00:09:15.840 --> 00:09:18.999 +going to overload the term locality here. I'm going to + +00:09:19.000 --> 00:09:22.239 +say, if you cannot reproduce a system locally, it becomes an + +00:09:22.240 --> 00:09:24.999 +opaque external system. I'm going to give examples + +00:09:25.000 --> 00:09:27.479 +here, and these are going to be a bit of a hot take. First + +00:09:27.480 --> 00:09:30.519 +off, GUI IDEs. I think we might, well, some of us might agree + +00:09:30.520 --> 00:09:34.519 +with that here. I say development environments that only + +00:09:34.520 --> 00:09:38.119 +allow you to use a graphical interface, do not expose + +00:09:38.120 --> 00:09:42.799 +interaction with text, are explicitly trying to kind of + +00:09:42.800 --> 00:09:46.239 +place you on a separate kind of plane where you're not an + +00:09:46.240 --> 00:09:50.439 +equal contributor to the people who make the development + +00:09:50.440 --> 00:09:53.079 +environment, make the development kind of frameworks + +00:09:53.080 --> 00:09:57.399 +here. We'll go one further. Cloud services are precisely, + +00:09:57.400 --> 00:10:00.039 +you know, they're useful for things that, you know, that + +00:10:00.040 --> 00:10:04.399 +require large domain computation, but, you know, Twitter, + +00:10:04.400 --> 00:10:08.679 +for example, didn't actually ever use any cloud services, + +00:10:08.680 --> 00:10:12.199 +external ones, because it was really important for them to + +00:10:12.200 --> 00:10:14.999 +actually own their own hardware, their own computation, + +00:10:15.000 --> 00:10:20.199 +their own thinking. Cloud services are a way to ensure + +00:10:20.200 --> 00:10:24.919 +that you're unable to reproduce a system without paying an + +00:10:24.920 --> 00:10:28.039 +amount per month, an amount per day, an amount per second, an + +00:10:28.040 --> 00:10:32.439 +amount per cycle to an external entity. I'm just going to + +00:10:32.440 --> 00:10:35.559 +conclude this with, I'd say, the argumentum ad absurdum, + +00:10:35.560 --> 00:10:39.239 +here, where large language models are all of these at once. + +00:10:39.240 --> 00:10:42.879 +They are a cloud service, specifically, and this is what + +00:10:42.880 --> 00:10:48.439 +makes them very evil, to make it so that, similar to GUI IDEs, + +00:10:48.440 --> 00:10:52.919 +so that text itself loses that ability to be both readable + +00:10:52.920 --> 00:10:56.199 +and writable. Instead, text is both unreadable, because + +00:10:56.200 --> 00:10:59.519 +it's produced by a machine, and then also unwritable, + +00:10:59.520 --> 00:11:02.999 +because you're subservient and subjugated to the machine, + +00:11:03.000 --> 00:11:05.359 +to the large language model to produce the code in the first + +00:11:05.360 --> 00:11:08.919 +place. You lose this input, output, readable, writable + +00:11:08.920 --> 00:11:13.359 +behavior that I claim text has specifically. To + +00:11:13.360 --> 00:11:19.439 +underline this, what is text? Text is local. Finally, + +00:11:19.440 --> 00:11:23.639 +we're at the subject of this conference. Emacs, I have + +00:11:23.640 --> 00:11:27.479 +double hearts with text. I start off the slide saying Emacs + +00:11:27.480 --> 00:11:31.519 +is a text editor. I think that's a good start. Which + +00:11:31.520 --> 00:11:34.319 +implements much of its own logic and user interface via + +00:11:34.320 --> 00:11:38.399 +text. What this means is that, you know, I say without + +00:11:38.400 --> 00:11:42.639 +trying, Emacs tries very hard, but without trying so hard, + +00:11:42.640 --> 00:11:47.639 +Emacs, is imbued with all of the capabilities that text has + +00:11:47.640 --> 00:11:51.319 +specifically. When you use text like Emacs does, and + +00:11:51.320 --> 00:11:55.519 +particularly you then start offering mechanisms to query, + +00:11:55.520 --> 00:11:59.999 +to transform, and to generally metaprogram text itself, + +00:12:00.000 --> 00:12:03.319 +you don't just have the ability to edit code in new ways. And + +00:12:03.320 --> 00:12:06.999 +this is something that I think is often lost, maybe not by + +00:12:07.000 --> 00:12:11.239 +participants of this conference, you particularly start + +00:12:11.240 --> 00:12:14.319 +being able to not only just edit code differently, but to + +00:12:14.320 --> 00:12:16.599 +change the way that you think about code and actually to + +00:12:16.600 --> 00:12:20.239 +expand your range of thought, the range of actions that you + +00:12:20.240 --> 00:12:22.719 +can perform. You can actually start then editing at the + +00:12:22.720 --> 00:12:25.799 +speed of thought. This is where especially Regex kind of + +00:12:25.800 --> 00:12:30.319 +comes into play. Finally, we get to the subject of the + +00:12:30.320 --> 00:12:33.599 +title of this talk. I'm about to disappoint a lot of + +00:12:33.600 --> 00:12:38.759 +people. I claim for good reason. Unfortunately, it's a + +00:12:38.760 --> 00:12:41.599 +very brief walkthrough, but I'm going to go over what the + +00:12:41.600 --> 00:12:43.799 +current Emacs Redix engine is. This is going to give us + +00:12:43.800 --> 00:12:48.119 +enough context for the next section on future directions. + +00:12:48.120 --> 00:12:51.799 +Quickly, it's a backtracking engine over a multi-byte + +00:12:51.800 --> 00:12:53.919 +code point. I'll define what that means. It's in + +00:12:53.920 --> 00:12:58.439 +regex-emacs.c. It's invoked in two ways, which you'll see + +00:12:58.440 --> 00:13:01.759 +is actually the same way, over a single contiguous string + +00:13:01.760 --> 00:13:05.359 +input. This is a Lisp string that you pass in. or over the + +00:13:05.360 --> 00:13:07.039 +two halves of the gap buffer. This is when you match + +00:13:07.040 --> 00:13:11.879 +against a buffer text. We'll go into that a little bit + +00:13:11.880 --> 00:13:13.919 +more, but this is one of the really actually interesting and + +00:13:13.920 --> 00:13:17.839 +specific things about Emacs Regex Engine as it stands. So + +00:13:17.840 --> 00:13:21.559 +very, very quickly, this is the data layout. This is just, if + +00:13:21.560 --> 00:13:24.879 +you're interested, this is where the code lies. So + +00:13:24.880 --> 00:13:30.159 +regex-emacs.h has re-pattern buffer, which is a struct + +00:13:30.160 --> 00:13:34.239 +Actually, you know, I love, by the way, I love the Emacs C + +00:13:34.240 --> 00:13:37.359 +source code. It's so nice to read. It made all this so, so + +00:13:37.360 --> 00:13:41.119 +easy. I really appreciated it. In this particular case, + +00:13:41.120 --> 00:13:44.039 +I'm just going to focus on re-pattern buffer actually has + +00:13:44.040 --> 00:13:47.999 +the compiler. It's a C struct. It has every single thing + +00:13:48.000 --> 00:13:52.559 +that is needed to execute the regular expression against a + +00:13:52.560 --> 00:13:56.319 +string input or against a buffer input. This buffer, + +00:13:56.320 --> 00:13:59.839 +It's not an Emacs buffer. It refers to just the instruction + +00:13:59.840 --> 00:14:04.039 +table and the match loop. Again, this is very, very + +00:14:04.040 --> 00:14:07.839 +brief, but I want to specifically focus on the first part. So + +00:14:07.840 --> 00:14:11.879 +this is this inner matching loop, and there's a prologue, + +00:14:11.880 --> 00:14:15.679 +and then there's a loop body, and there's an epilogue. And + +00:14:15.680 --> 00:14:18.279 +the prologue is the really, really interesting part. I say + +00:14:18.280 --> 00:14:22.839 +extract current and next char. What Emacs does here, it + +00:14:22.840 --> 00:14:27.159 +doesn't just reach for the next byte. It actually will + +00:14:27.160 --> 00:14:31.879 +perform lazily in some sense, this variable integer size + +00:14:31.880 --> 00:14:36.039 +VAR decoding for multi-byte, and it'll actually then + +00:14:36.040 --> 00:14:43.959 +decode the next one to four bytes. Up to 32 bits at once, and + +00:14:43.960 --> 00:14:46.799 +then it'll actually go into the loop. We'll talk about the + +00:14:46.800 --> 00:14:52.519 +implications of that later. Next, in the body of the loop, we + +00:14:52.520 --> 00:14:54.239 +read the instruction from the instruction pointer, which + +00:14:54.240 --> 00:14:57.319 +is, again, in that buffer field. Then we have this big + +00:14:57.320 --> 00:14:59.479 +switch statement, which is actually, love a big switch + +00:14:59.480 --> 00:15:02.079 +statement, super easy to read, super easy to understand + +00:15:02.080 --> 00:15:05.399 +kind of what's occurring. Then that's the loop body. And + +00:15:05.400 --> 00:15:08.279 +then at the end of it, we either increment the instruction + +00:15:08.280 --> 00:15:11.119 +pointer if it was matching a single character or something + +00:15:11.120 --> 00:15:14.839 +along those lines, or if it was a jump, we don't do that. A + +00:15:14.840 --> 00:15:18.199 +jump, however, it's not referring to a jump in the sense of a + +00:15:18.200 --> 00:15:22.519 +go-to, but a jump that's elsewhere within that table, that + +00:15:22.520 --> 00:15:25.839 +buffer field. If you've included a capture, we write + +00:15:25.840 --> 00:15:29.479 +that end position there. Of course, well, as you may + +00:15:29.480 --> 00:15:34.439 +recall, the zeroth capture is, of course, the entire match + +00:15:34.440 --> 00:15:36.559 +string. If the capture is zero, then we know we've + +00:15:36.560 --> 00:15:39.839 +actually completed that match. That's really great. + +00:15:39.840 --> 00:15:43.599 +I would love to receive Q&A about this as well. I've spent a + +00:15:43.600 --> 00:15:46.719 +lot of time kind of learning and understanding it. But it's + +00:15:46.720 --> 00:15:49.879 +really interesting that this can be described in a single + +00:15:49.880 --> 00:15:52.159 +slide because it's really simple. That simplicity is + +00:15:52.160 --> 00:15:54.639 +actually a really powerful thing. I'll mention that in + +00:15:54.640 --> 00:15:58.759 +the next section. I say, is that all? And I apologize for + +00:15:58.760 --> 00:16:02.239 +not doing so. But please, please ask questions in Q&A or + +00:16:02.240 --> 00:16:04.999 +message me about this, because I think it's really, really, + +00:16:05.000 --> 00:16:07.079 +again, interesting. Again, I find the code relatively + +00:16:07.080 --> 00:16:11.999 +easy to read. Now, here's, I think this is actually the + +00:16:12.000 --> 00:16:15.519 +point of the talk. The rest of it was, you know, I think just me + +00:16:15.520 --> 00:16:18.839 +posturing. This is the really, really interesting part. + +00:16:18.840 --> 00:16:22.039 +This is the ways that we can improve, well, not just we can + +00:16:22.040 --> 00:16:25.839 +improve stuff in Emacs, but why those are the right things to + +00:16:25.840 --> 00:16:30.279 +improve. Then also how that can be a model for even things + +00:16:30.280 --> 00:16:35.079 +outside of Emacs. This is gonna be a lot of text. I'm not + +00:16:35.080 --> 00:16:38.879 +gonna go through all of it. This is the one thing that I tried. + +00:16:38.880 --> 00:16:42.239 +This is the thing that I thought would be a slam dunk, easy + +00:16:42.240 --> 00:16:47.439 +solution. My initial thought process was, well, We tried + +00:16:47.440 --> 00:16:52.919 +very hard to do an LRU cache here. It works. It's actually + +00:16:52.920 --> 00:16:57.399 +very effective. However, though, we don't actually give + +00:16:57.400 --> 00:17:00.479 +the user, the list programmer, the ability to then say, I + +00:17:00.480 --> 00:17:03.079 +know that this regex is something that is going to be used + +00:17:03.080 --> 00:17:06.399 +again. I made an artificial benchmark. I made an + +00:17:06.400 --> 00:17:10.039 +artificial benchmark because I wanted to show there is one + +00:17:10.040 --> 00:17:13.639 +very specific case that it does solve, but it's the same + +00:17:13.640 --> 00:17:16.919 +issue with the artificial benchmarks. mentioned earlier. + +00:17:16.920 --> 00:17:21.559 +It's very specifically crafted in order to show that this + +00:17:21.560 --> 00:17:25.319 +particular solution would produce some speedup. What + +00:17:25.320 --> 00:17:29.599 +this means is it just creates more than 20 regexps in a row. It + +00:17:29.600 --> 00:17:31.959 +compiles them. Then, of course, because we just don't + +00:17:31.960 --> 00:17:35.159 +pay the compile costs, because we don't go through that + +00:17:35.160 --> 00:17:39.079 +cache eviction process, it ends up being faster. But this + +00:17:39.080 --> 00:17:42.079 +isn't really mean very much, particularly the goal here, + +00:17:42.080 --> 00:17:45.559 +you know, the goal would have been to show that the compile + +00:17:45.560 --> 00:17:48.359 +cache is actually causing the performance issue in + +00:17:48.360 --> 00:17:51.359 +comparison to pre-compiling it. That's not something + +00:17:51.360 --> 00:17:56.039 +I've been able to show. Match over bytes, not cars. So + +00:17:56.040 --> 00:17:59.079 +this is when I said at the beginning, oh, I came in and I think, + +00:17:59.080 --> 00:18:02.079 +oh, we can just use modern regex engine techniques. This is + +00:18:02.080 --> 00:18:05.239 +really what I meant. In particular, I mentioned in this + +00:18:05.240 --> 00:18:09.279 +match loop here that there's this, prolog that does this + +00:18:09.280 --> 00:18:13.359 +varring decoding. What this means is that every single + +00:18:13.360 --> 00:18:18.519 +iteration of that loop is going to be interspersed with this + +00:18:18.520 --> 00:18:21.919 +not being able to read a fixed number of bytes, but a variable + +00:18:21.920 --> 00:18:24.359 +number of bytes just depending upon the Unicode character + +00:18:24.360 --> 00:18:27.039 +or the Unicode code point or the multibyte code point. So + +00:18:27.040 --> 00:18:29.799 +this ends up, again, being relatively difficult to + +00:18:29.800 --> 00:18:32.919 +optimize because processors operate over bytes and not + +00:18:32.920 --> 00:18:38.479 +over code points. Yes, we might consider a multi-byte CPU at + +00:18:38.480 --> 00:18:41.039 +some point. But this is a really, really simple thing. It's + +00:18:41.040 --> 00:18:44.999 +just generating automata that operate over bytes as + +00:18:45.000 --> 00:18:48.839 +opposed to code points. This kind of goes into the much more + +00:18:48.840 --> 00:18:51.839 +abstract one. There's a lot of text here, and we're not + +00:18:51.840 --> 00:18:56.159 +going to go into it. But the really, really important point + +00:18:56.160 --> 00:18:57.999 +that I'm specifically mentioning here is this explicit + +00:18:58.000 --> 00:19:02.079 +control over linguistic complexity. That's the + +00:19:02.080 --> 00:19:06.159 +abstract kind of point. I want to introduce the inputs and + +00:19:06.160 --> 00:19:11.279 +the outputs. Basically, when you perform a search, or a + +00:19:11.280 --> 00:19:14.799 +match, or a parse, those are different tasks. They'll + +00:19:14.800 --> 00:19:17.799 +have different expected inputs and different desired + +00:19:17.800 --> 00:19:21.559 +outputs. Right now, Emacs, the API for the regular + +00:19:21.560 --> 00:19:24.919 +expression engine and for matching, It doesn't allow + +00:19:24.920 --> 00:19:27.959 +specialization on this. Or rather, if we do specialize on + +00:19:27.960 --> 00:19:30.999 +particular inputs, if we have a heuristic to check if a regex + +00:19:31.000 --> 00:19:33.519 +is actually a literal string, that's not something that the + +00:19:33.520 --> 00:19:36.959 +user actually has control over. For example, you can make + +00:19:36.960 --> 00:19:38.999 +a mistake escaping something, and then you don't have a + +00:19:39.000 --> 00:19:42.039 +literal, and then you accidentally have behavior that you + +00:19:42.040 --> 00:19:44.279 +totally didn't expect. Not just correctness issues, but + +00:19:44.280 --> 00:19:48.599 +also performance issues. I really like this one. I like + +00:19:48.600 --> 00:19:52.239 +this a lot, because I didn't think of it at all. I think it's + +00:19:52.240 --> 00:19:58.119 +better than in all of my ideas. This was proposed, at least + +00:19:58.120 --> 00:20:01.839 +to me, by Pip Cet, and I really hope that I'm pronouncing your + +00:20:01.840 --> 00:20:04.479 +name correctly. I'm sorry I didn't ask you beforehand, + +00:20:04.480 --> 00:20:08.399 +emacs-devel. In particular, this was after a couple of + +00:20:08.400 --> 00:20:11.999 +responses where I was trying to say, oh, I want to give the + +00:20:12.000 --> 00:20:15.879 +list programmer, way back in here, I want to give the list + +00:20:15.880 --> 00:20:20.559 +programmer the ability to control compilation in some + +00:20:20.560 --> 00:20:25.759 +sense. you know, he mentioned, I think he is correct, you + +00:20:25.760 --> 00:20:28.439 +know, there's no real introspection. That happens + +00:20:28.440 --> 00:20:33.119 +because it's written in C. I was thinking, oh, if I turn + +00:20:33.120 --> 00:20:35.639 +this into a list object that gives the list programmer the + +00:20:35.640 --> 00:20:40.039 +power and the ability to do more with that, but it doesn't + +00:20:40.040 --> 00:20:42.839 +actually because it's still in C. At first, I was + +00:20:42.840 --> 00:20:46.679 +thinking, oh, we can make the C part more flexible. But + +00:20:46.680 --> 00:20:50.039 +actually, especially if we want to do almost any of the + +00:20:50.040 --> 00:20:52.719 +things we previously mentioned, I think basically that + +00:20:52.720 --> 00:20:56.599 +this is... I think that if I'm not going to do it, somebody + +00:20:56.600 --> 00:20:58.879 +else really should do it, and I think we should maybe even do + +00:20:58.880 --> 00:21:01.519 +it together, because I think this is really, I think, how we + +00:21:01.520 --> 00:21:04.079 +can start experimenting, and not just experimenting, but + +00:21:04.080 --> 00:21:07.039 +because, as mentioned here, we have libgccjit, we have the + +00:21:07.040 --> 00:21:09.519 +native compiler, we have the ability to opt, like, + +00:21:09.520 --> 00:21:12.639 +specifically to generate specific code for this, so why not + +00:21:12.640 --> 00:21:15.919 +implement the or a Redix engine itself in list, And this + +00:21:15.920 --> 00:21:18.359 +gives us the ability to introspect it. That's one of the + +00:21:18.360 --> 00:21:20.759 +things I mentioned at the beginning. But it actually gives + +00:21:20.760 --> 00:21:23.879 +us the ability to then actually look at all the previous + +00:21:23.880 --> 00:21:28.159 +implementations, to explicitly compile beforehand, to + +00:21:28.160 --> 00:21:32.519 +match against bytes, to specialize and dispatch based upon + +00:21:32.520 --> 00:21:36.799 +input and output. This is something that, you know, it's + +00:21:36.800 --> 00:21:37.999 +super simple. + +00:21:38.000 --> 00:21:40.799 +It's really smart. I'm really, really glad that Pip + +00:21:40.800 --> 00:21:44.839 +mentioned this because it is, I think, the right way to solve + +00:21:44.840 --> 00:21:49.879 +the rest of it. We're at the final section. I talked a + +00:21:49.880 --> 00:21:52.679 +lot about, you know, kind of abstract, you know, thoughts. + +00:21:52.680 --> 00:21:55.679 +I talked a little about, you know, specific solutions. + +00:21:55.680 --> 00:21:59.999 +But I especially talked about, you know, what is Regex and + +00:22:00.000 --> 00:22:02.959 +Emacs? And I don't know if I had a lot of specific examples of + +00:22:02.960 --> 00:22:06.079 +it. I'm going to just describe kind of my, I guess, + +00:22:06.080 --> 00:22:09.799 +motivation, my impetus. Then I think something that's + +00:22:09.800 --> 00:22:12.639 +really something to chew on for the future. Do I have any + +00:22:12.640 --> 00:22:15.799 +concrete examples? Yes. Well, you can decide if they're + +00:22:15.800 --> 00:22:22.799 +concrete. Or am I just posturing? Also, yes. helm, rg. Helm, + +00:22:22.800 --> 00:22:27.679 +Erg, it's literally just M-x grep, it uses ripgrep, which + +00:22:27.680 --> 00:22:31.999 +is written by the same author of the Rust regex [??]. It + +00:22:32.000 --> 00:22:36.199 +happens to be very, very fast. In particular, I use this tool + +00:22:36.200 --> 00:22:39.319 +with ripgrep on the Twitter monorepo, and I was able to + +00:22:39.320 --> 00:22:42.559 +search very, very large amounts of code that was on my local + +00:22:42.560 --> 00:22:46.399 +machine using regular expressions. I think this is one + +00:22:46.400 --> 00:22:49.199 +thing that I think is really, really important, because + +00:22:49.200 --> 00:22:52.919 +when you want to scale, People say the word scaling and they + +00:22:52.920 --> 00:22:56.719 +assume there's a specific kind of answer for that. I've + +00:22:56.720 --> 00:23:01.679 +just found that text is not only flexible, it's actually + +00:23:01.680 --> 00:23:04.359 +something that can be more performant than the alternative + +00:23:04.360 --> 00:23:07.399 +and not only more performant, but more productive. It's + +00:23:07.400 --> 00:23:10.359 +again, it's just M-x grep using ripgrep. There's a tool + +00:23:10.360 --> 00:23:12.719 +deadgrep by Wilfred Hughes, which is also fantastic. I + +00:23:12.720 --> 00:23:15.759 +think it's actually better than this, but this one's mine so + +00:23:15.760 --> 00:23:19.199 +I can mess around with it. But this tool is kind of why, + +00:23:19.200 --> 00:23:21.799 +especially I started looking into Emacs and looking into + +00:23:21.800 --> 00:23:24.919 +changing the way that, or at least diving into how the + +00:23:24.920 --> 00:23:27.559 +regular expression matching actually kind of works, both + +00:23:27.560 --> 00:23:30.359 +in Emacs and then in ripgrep. We'll go to the next one. + +00:23:30.360 --> 00:23:34.119 +This is something that does exist and continues to exist. + +00:23:34.120 --> 00:23:36.799 +This is something that doesn't quite exist yet. I'm + +00:23:36.800 --> 00:23:41.359 +calling it telepathy grams. It's, you know, it's the name, + +00:23:41.360 --> 00:23:44.719 +and it's very, you know, it doesn't work, but it's a code + +00:23:44.720 --> 00:23:47.919 +search tool that, in this case, precompiles the database to + +00:23:47.920 --> 00:23:51.879 +execute NFAs against. I was thinking, how can I beat And + +00:23:51.880 --> 00:23:55.039 +the first thing I thought is, well, as I have worked on build + +00:23:55.040 --> 00:23:57.759 +tools, especially in monorepos, one of the things that the + +00:23:57.760 --> 00:24:00.799 +pants build tool from Twitter does is it uses a file watcher + +00:24:00.800 --> 00:24:04.239 +to ensure that instead of having to constantly read in the + +00:24:04.240 --> 00:24:10.079 +entire contents of a file, which may be very, very large, it + +00:24:10.080 --> 00:24:13.679 +only does so when the file has been changed. Finally, I + +00:24:13.680 --> 00:24:16.919 +want to conclude on this note, which is just that the stuff I + +00:24:16.920 --> 00:24:20.839 +didn't learn from emacs devel, I learned from Paul + +00:24:20.840 --> 00:24:25.319 +Wankadia, Jr., who is the RE2 maintainer, and he taught me + +00:24:25.320 --> 00:24:32.399 +quite a lot from 2023 to 2024. I'm thankful for the time + +00:24:32.400 --> 00:24:37.959 +that I learned from you, so thank you, Paul. With that, we're + +00:24:37.960 --> 00:24:42.759 +at point-max. Call me, beat me, if you want to reach me and or + +00:24:42.760 --> 00:24:45.839 +hire me. These are places that you can reach me at. There are + +00:24:45.840 --> 00:24:49.719 +probably others. Feel free to suggest other ways to contact + +00:24:49.720 --> 00:24:53.199 +me. But for now, this is the end. Thank you so much for your + +00:24:53.200 --> 00:24:56.080 +time. I really appreciate it. -- cgit v1.2.3