WEBVTT captioned by sachac 00:00:00.000 --> 00:00:13.359 Hello, I'm Danny McClanahan. This is EmacsConf 2024. And 00:00:13.360 --> 00:00:17.159 this presentation is ostensibly about Emacs Regex 00:00:17.160 --> 00:00:22.639 compilation. But it'll lead a lot more in future 00:00:22.640 --> 00:00:30.879 directions. Thanks for coming on this journey with me. 00:00:30.880 --> 00:00:36.719 This presentation is 50 slides, 50 footnotes, and that's 00:00:36.720 --> 00:00:40.679 intended for it to be a resource later on for your perusal. We 00:00:40.680 --> 00:00:44.199 are unfortunately not going to be able to go into all of it, 00:00:44.200 --> 00:00:49.439 but I will try to be within 20 minutes so we can make it 00:00:49.440 --> 00:00:56.199 throughout Q&A. This is the structure of the talk. 00:00:56.200 --> 00:01:03.519 But enough about me. Who are you? And why are you here? 00:01:03.520 --> 00:01:09.479 I'm Danny McClanahan. 00:01:09.480 --> 00:01:13.439 My experience is a lot in build tools, especially in the 00:01:13.440 --> 00:01:19.399 package managers. That started because I realized I was 00:01:19.400 --> 00:01:23.319 wasting a lot of time. Then I didn't like that. I 00:01:23.320 --> 00:01:29.439 started wasting a lot of time, trying to avoid wasting time. 00:01:29.440 --> 00:01:35.479 Then I ended up... going so far around that I ended up 00:01:35.480 --> 00:01:40.319 stopping other people from wasting their own time, in this 00:01:40.320 --> 00:01:44.359 case, regarding failing builds. But this is a kind of 00:01:44.360 --> 00:01:47.479 pattern that you'll see. I'm talking a lot about patterns in 00:01:47.480 --> 00:01:52.399 this presentation. Parsing in text is another one of 00:01:52.400 --> 00:01:57.479 those tendencies that I have. Why am I here? I've got a lot 00:01:57.480 --> 00:02:00.639 of feelings about text. For the next 20 minutes, I'm 00:02:00.640 --> 00:02:06.079 making it your problem. 00:02:06.080 --> 00:02:09.639 First off, a huge shout out to Emacs Devel and the Emacs 00:02:09.640 --> 00:02:12.919 community in general. I spent a lot of time learning about 00:02:12.920 --> 00:02:15.559 what I'm about to talk about. I was definitely super 00:02:15.560 --> 00:02:19.439 confused at first. Then when I became less confused and I 00:02:19.440 --> 00:02:23.919 decided I was going to look at the regular expressions of the 00:02:23.920 --> 00:02:28.039 Regex engine, I was like, oh, it's old C code. It's 00:02:28.040 --> 00:02:33.559 Emacs. We can just use modern techniques. Turns out that's 00:02:33.560 --> 00:02:37.839 wrong for kind of two reasons. One, because using modern 00:02:37.840 --> 00:02:41.479 techniques or other engines don't necessarily do what 00:02:41.480 --> 00:02:44.799 Emacs regex engine currently does. Then secondarily, 00:02:44.800 --> 00:02:48.719 that's not actually as interesting as the other kind of 00:02:48.720 --> 00:02:52.359 larger goals that emacs-devel discussed. Thank you, Eli 00:02:52.360 --> 00:02:56.279 Zaretskii, so, so much, especially Pip Cet and everyone else 00:02:56.280 --> 00:02:59.319 as well--I believe--Pip Cet, I hope I'm pronouncing that 00:02:59.320 --> 00:03:01.799 correctly. Thank you so much. I'll be shouting you out 00:03:01.800 --> 00:03:04.319 later as well. Then these larger goals ended up 00:03:04.320 --> 00:03:07.199 overlapping a lot with my own research interests. And 00:03:07.200 --> 00:03:09.879 that's very exciting. I'm hoping it's exciting for you 00:03:09.880 --> 00:03:14.079 too. What is a regular expression? And when and how does 00:03:14.080 --> 00:03:16.559 implementation match formal theory? So what does formal 00:03:16.560 --> 00:03:24.079 theory mean? And we'll talk about that. 00:03:24.080 --> 00:03:27.519 What is a regular expression? So I might ask you this 00:03:27.520 --> 00:03:30.799 question, and you might give an answer. Then I might ask 00:03:30.800 --> 00:03:33.519 someone else, and they might have an answer. Then I might 00:03:33.520 --> 00:03:38.039 ask myself, and I might try to think of an answer. Our 00:03:38.040 --> 00:03:41.799 answers would, you know, see, the thing is, they'd all be 00:03:41.800 --> 00:03:45.359 correct, but they'd probably be slightly different, and 00:03:45.360 --> 00:03:50.319 they'd be different in kind of important ways. I'm 00:03:50.320 --> 00:03:55.039 using formal theory to kind of describe what unifies these 00:03:55.040 --> 00:04:00.119 interpretations and what causes this sort of divergence, 00:04:00.120 --> 00:04:05.439 both over time and then across code bases. I'm kind of 00:04:05.440 --> 00:04:09.319 putting a flag in the ground here and saying formal theory is 00:04:09.320 --> 00:04:12.999 actually a really, really negative influence, I think, but 00:04:13.000 --> 00:04:15.999 it can be better. That's what I'm going to talk about in 00:04:16.000 --> 00:04:19.519 this talk, in this presentation. We might ask, how did 00:04:19.520 --> 00:04:26.679 this happen? and we might try to find a start state. We 00:04:26.680 --> 00:04:30.519 might put that place at the theories of formal languages 00:04:30.520 --> 00:04:34.679 that kind of arose, especially post Turing and post 00:04:34.680 --> 00:04:37.519 Chomsky. Especially there was this really, really 00:04:37.520 --> 00:04:40.119 interesting and powerful relationship with formal 00:04:40.120 --> 00:04:43.959 languages between representation and computation. And 00:04:43.960 --> 00:04:48.599 then on top of that, we have regex as this really powerful 00:04:48.600 --> 00:04:52.159 union of theory and practice And then, like I mentioned, 00:04:52.160 --> 00:04:55.799 this is kind of divergence that kind of occurs. This 00:04:55.800 --> 00:04:58.079 divergence happens for a good reason. This happens because 00:04:58.080 --> 00:05:01.999 people were adding implementations and people adding 00:05:02.000 --> 00:05:04.639 features to implementations. While the people adding 00:05:04.640 --> 00:05:06.679 these features were often academics, they were 00:05:06.680 --> 00:05:09.199 industries, people that were hobbyists, they were 00:05:09.200 --> 00:05:11.999 interested in building practical tools. This is a good 00:05:12.000 --> 00:05:14.879 thing. This is still a good thing, even though it moves a 00:05:14.880 --> 00:05:18.199 little bit away from formal theory. But we start seeing some 00:05:18.200 --> 00:05:22.639 cracks developing, and we'll go into that in a second. We're 00:05:22.640 --> 00:05:27.519 just going to kind of electric slide into the 1980s here, and 00:05:27.520 --> 00:05:31.879 we're going to be confronted with two occurrences very 00:05:31.880 --> 00:05:35.639 similarly. We might call it simultaneous discovery. In 00:05:35.640 --> 00:05:38.559 1983, you have Michael Jackson demonstrating the 00:05:38.560 --> 00:05:41.999 moonwalk. Three years later, we have backtracking 00:05:42.000 --> 00:05:44.999 developed to stimulate EGREP-style regular expressions. 00:05:45.000 --> 00:05:48.599 These would both be incredibly influential in their own 00:05:48.600 --> 00:05:54.039 kind of branching paths. Here's where the gloves come 00:05:54.040 --> 00:06:00.759 off. Formal theory, I claim, remains largely concerned 00:06:00.760 --> 00:06:03.359 with incremental improvements to artificial benchmarks, 00:06:03.360 --> 00:06:07.279 and much less with expanding models to cover actual user 00:06:07.280 --> 00:06:11.799 needs. This isn't just about, oh, if you listened to 00:06:11.800 --> 00:06:15.999 users, that you'd be a nicer person, you'd be a better 00:06:16.000 --> 00:06:19.359 engineer. What I'm actually saying is that they're missing 00:06:19.360 --> 00:06:23.919 out. When you don't listen to applications, you miss out on a 00:06:23.920 --> 00:06:26.639 lot of fantastic opportunities for novel theory. So 00:06:26.640 --> 00:06:30.839 this is, again, my complaint with formal theory as it 00:06:30.840 --> 00:06:34.599 stands. But we're gonna do better. Before we get better, 00:06:34.600 --> 00:06:36.959 we're gonna get, a little bit worse for a bit. We're going to 00:06:36.960 --> 00:06:40.359 actually get a little bit worse is better. What I mean by 00:06:40.360 --> 00:06:43.239 that is, by the 1990s, we start looking into these 00:06:43.240 --> 00:06:46.479 non-backtracking engines. This is a bit of a reaction to 00:06:46.480 --> 00:06:50.399 backtracking. The current ones include RE2, 00:06:50.400 --> 00:06:53.919 hyperscan, and the rust regex library. These are all 00:06:53.920 --> 00:06:56.439 great. I'll talk about them later as well. They make use 00:06:56.440 --> 00:06:58.719 of these. They kind of call back to the earlier formal 00:06:58.720 --> 00:07:01.479 theory. They have linear runtimes for well-specified 00:07:01.480 --> 00:07:02.519 search tasks. 00:07:02.520 --> 00:07:08.079 What happens if that doesn't fit your needs? We're going to 00:07:08.080 --> 00:07:11.479 talk about that. We're going to table that for a second, 00:07:11.480 --> 00:07:15.319 and we're going to focus more on Emacs, the subject of this 00:07:15.320 --> 00:07:19.359 conference. What are regex used for? And in this 00:07:19.360 --> 00:07:22.439 particular case, they're used for lots of things, with 00:07:22.440 --> 00:07:25.199 practically, and I think they should be. But more 00:07:25.200 --> 00:07:29.559 specifically, how do Emacs users use them? And I'm going to 00:07:29.560 --> 00:07:32.679 focus in on this text as input and output. I'll be kind of 00:07:32.680 --> 00:07:38.959 elaborating on this analogy as we continue. Why is text 00:07:38.960 --> 00:07:43.399 powerful? Text as I/O. The reason text programming 00:07:43.400 --> 00:07:45.759 languages and not just programming languages, but 00:07:45.760 --> 00:07:49.159 languages themselves, the reason why they're successful 00:07:49.160 --> 00:07:52.279 and why they propagate, I claim, is because text is both 00:07:52.280 --> 00:07:56.439 input readable and output writable. What this means 00:07:56.440 --> 00:08:01.199 is that if you receive something in text, you can read it, And 00:08:01.200 --> 00:08:04.239 then you can also write it, you can modify it, and you can 00:08:04.240 --> 00:08:06.959 produce a new version of it. You're on a kind of level 00:08:06.960 --> 00:08:10.959 playing field. That's not always the case, though. You 00:08:10.960 --> 00:08:15.959 recall that I've worked a lot with build systems and package 00:08:15.960 --> 00:08:20.999 managers. There's a discussion that goes by the name of 00:08:21.000 --> 00:08:25.319 software supply chain security. I think it's a massive 00:08:25.320 --> 00:08:29.079 joke. The reason why is because people largely raise it 00:08:29.080 --> 00:08:34.279 to explain why their for-profit company with their 00:08:34.280 --> 00:08:38.079 for-profit product is going to solve the problem for you, as 00:08:38.080 --> 00:08:41.959 opposed to the commons of open source. If you are unable to 00:08:41.960 --> 00:08:44.999 modify or deploy your code without employing an opaque 00:08:45.000 --> 00:08:48.599 external system, I think, then you have a hidden 00:08:48.600 --> 00:08:53.879 dependency. you don't remove a dependency, you just, by, 00:08:53.880 --> 00:08:59.239 for example, paying into a for-profit product or using a 00:08:59.240 --> 00:09:01.519 closed-off supply chain, you end up just having a hidden 00:09:01.520 --> 00:09:04.719 dependency, you end up just displacing that. This can 00:09:04.720 --> 00:09:07.639 actually exert arbitrary control over your programming 00:09:07.640 --> 00:09:11.279 output and potentially even your thoughts. This is really 00:09:11.280 --> 00:09:15.839 important. I'm going to dive in a little bit deeper and I'm 00:09:15.840 --> 00:09:18.999 going to overload the term locality here. I'm going to 00:09:19.000 --> 00:09:22.239 say, if you cannot reproduce a system locally, it becomes an 00:09:22.240 --> 00:09:24.999 opaque external system. I'm going to give examples 00:09:25.000 --> 00:09:27.479 here, and these are going to be a bit of a hot take. First 00:09:27.480 --> 00:09:30.519 off, GUI IDEs. I think we might, well, some of us might agree 00:09:30.520 --> 00:09:34.519 with that here. I say development environments that only 00:09:34.520 --> 00:09:38.119 allow you to use a graphical interface, do not expose 00:09:38.120 --> 00:09:42.799 interaction with text, are explicitly trying to kind of 00:09:42.800 --> 00:09:46.239 place you on a separate kind of plane where you're not an 00:09:46.240 --> 00:09:50.439 equal contributor to the people who make the development 00:09:50.440 --> 00:09:53.079 environment, make the development kind of frameworks 00:09:53.080 --> 00:09:57.399 here. We'll go one further. Cloud services are precisely, 00:09:57.400 --> 00:10:00.039 you know, they're useful for things that, you know, that 00:10:00.040 --> 00:10:04.399 require large domain computation, but, you know, Twitter, 00:10:04.400 --> 00:10:08.679 for example, didn't actually ever use any cloud services, 00:10:08.680 --> 00:10:12.199 external ones, because it was really important for them to 00:10:12.200 --> 00:10:14.999 actually own their own hardware, their own computation, 00:10:15.000 --> 00:10:20.199 their own thinking. Cloud services are a way to ensure 00:10:20.200 --> 00:10:24.919 that you're unable to reproduce a system without paying an 00:10:24.920 --> 00:10:28.039 amount per month, an amount per day, an amount per second, an 00:10:28.040 --> 00:10:32.439 amount per cycle to an external entity. I'm just going to 00:10:32.440 --> 00:10:35.559 conclude this with, I'd say, the argumentum ad absurdum, 00:10:35.560 --> 00:10:39.239 here, where large language models are all of these at once. 00:10:39.240 --> 00:10:42.879 They are a cloud service, specifically, and this is what 00:10:42.880 --> 00:10:48.439 makes them very evil, to make it so that, similar to GUI IDEs, 00:10:48.440 --> 00:10:52.919 so that text itself loses that ability to be both readable 00:10:52.920 --> 00:10:56.199 and writable. Instead, text is both unreadable, because 00:10:56.200 --> 00:10:59.519 it's produced by a machine, and then also unwritable, 00:10:59.520 --> 00:11:02.999 because you're subservient and subjugated to the machine, 00:11:03.000 --> 00:11:05.359 to the large language model to produce the code in the first 00:11:05.360 --> 00:11:08.919 place. You lose this input, output, readable, writable 00:11:08.920 --> 00:11:13.359 behavior that I claim text has specifically. To 00:11:13.360 --> 00:11:19.439 underline this, what is text? Text is local. Finally, 00:11:19.440 --> 00:11:23.639 we're at the subject of this conference. Emacs, I have 00:11:23.640 --> 00:11:27.479 double hearts with text. I start off the slide saying Emacs 00:11:27.480 --> 00:11:31.519 is a text editor. I think that's a good start. Which 00:11:31.520 --> 00:11:34.319 implements much of its own logic and user interface via 00:11:34.320 --> 00:11:38.399 text. What this means is that, you know, I say without 00:11:38.400 --> 00:11:42.639 trying, Emacs tries very hard, but without trying so hard, 00:11:42.640 --> 00:11:47.639 Emacs, is imbued with all of the capabilities that text has 00:11:47.640 --> 00:11:51.319 specifically. When you use text like Emacs does, and 00:11:51.320 --> 00:11:55.519 particularly you then start offering mechanisms to query, 00:11:55.520 --> 00:11:59.999 to transform, and to generally metaprogram text itself, 00:12:00.000 --> 00:12:03.319 you don't just have the ability to edit code in new ways. And 00:12:03.320 --> 00:12:06.999 this is something that I think is often lost, maybe not by 00:12:07.000 --> 00:12:11.239 participants of this conference, you particularly start 00:12:11.240 --> 00:12:14.319 being able to not only just edit code differently, but to 00:12:14.320 --> 00:12:16.599 change the way that you think about code and actually to 00:12:16.600 --> 00:12:20.239 expand your range of thought, the range of actions that you 00:12:20.240 --> 00:12:22.719 can perform. You can actually start then editing at the 00:12:22.720 --> 00:12:25.799 speed of thought. This is where especially Regex kind of 00:12:25.800 --> 00:12:30.319 comes into play. Finally, we get to the subject of the 00:12:30.320 --> 00:12:33.599 title of this talk. I'm about to disappoint a lot of 00:12:33.600 --> 00:12:38.759 people. I claim for good reason. Unfortunately, it's a 00:12:38.760 --> 00:12:41.599 very brief walkthrough, but I'm going to go over what the 00:12:41.600 --> 00:12:43.799 current Emacs Redix engine is. This is going to give us 00:12:43.800 --> 00:12:48.119 enough context for the next section on future directions. 00:12:48.120 --> 00:12:51.799 Quickly, it's a backtracking engine over a multi-byte 00:12:51.800 --> 00:12:53.919 code point. I'll define what that means. It's in 00:12:53.920 --> 00:12:58.439 regex-emacs.c. It's invoked in two ways, which you'll see 00:12:58.440 --> 00:13:01.759 is actually the same way, over a single contiguous string 00:13:01.760 --> 00:13:05.359 input. This is a Lisp string that you pass in. or over the 00:13:05.360 --> 00:13:07.039 two halves of the gap buffer. This is when you match 00:13:07.040 --> 00:13:11.879 against a buffer text. We'll go into that a little bit 00:13:11.880 --> 00:13:13.919 more, but this is one of the really actually interesting and 00:13:13.920 --> 00:13:17.839 specific things about Emacs Regex Engine as it stands. So 00:13:17.840 --> 00:13:21.559 very, very quickly, this is the data layout. This is just, if 00:13:21.560 --> 00:13:24.879 you're interested, this is where the code lies. So 00:13:24.880 --> 00:13:30.159 regex-emacs.h has re-pattern buffer, which is a struct 00:13:30.160 --> 00:13:34.239 Actually, you know, I love, by the way, I love the Emacs C 00:13:34.240 --> 00:13:37.359 source code. It's so nice to read. It made all this so, so 00:13:37.360 --> 00:13:41.119 easy. I really appreciated it. In this particular case, 00:13:41.120 --> 00:13:44.039 I'm just going to focus on re-pattern buffer actually has 00:13:44.040 --> 00:13:47.999 the compiler. It's a C struct. It has every single thing 00:13:48.000 --> 00:13:52.559 that is needed to execute the regular expression against a 00:13:52.560 --> 00:13:56.319 string input or against a buffer input. This buffer, 00:13:56.320 --> 00:13:59.839 It's not an Emacs buffer. It refers to just the instruction 00:13:59.840 --> 00:14:04.039 table and the match loop. Again, this is very, very 00:14:04.040 --> 00:14:07.839 brief, but I want to specifically focus on the first part. So 00:14:07.840 --> 00:14:11.879 this is this inner matching loop, and there's a prologue, 00:14:11.880 --> 00:14:15.679 and then there's a loop body, and there's an epilogue. And 00:14:15.680 --> 00:14:18.279 the prologue is the really, really interesting part. I say 00:14:18.280 --> 00:14:22.839 extract current and next char. What Emacs does here, it 00:14:22.840 --> 00:14:27.159 doesn't just reach for the next byte. It actually will 00:14:27.160 --> 00:14:31.879 perform lazily in some sense, this variable integer size 00:14:31.880 --> 00:14:36.039 VAR decoding for multi-byte, and it'll actually then 00:14:36.040 --> 00:14:43.959 decode the next one to four bytes. Up to 32 bits at once, and 00:14:43.960 --> 00:14:46.799 then it'll actually go into the loop. We'll talk about the 00:14:46.800 --> 00:14:52.519 implications of that later. Next, in the body of the loop, we 00:14:52.520 --> 00:14:54.239 read the instruction from the instruction pointer, which 00:14:54.240 --> 00:14:57.319 is, again, in that buffer field. Then we have this big 00:14:57.320 --> 00:14:59.479 switch statement, which is actually, love a big switch 00:14:59.480 --> 00:15:02.079 statement, super easy to read, super easy to understand 00:15:02.080 --> 00:15:05.399 kind of what's occurring. Then that's the loop body. And 00:15:05.400 --> 00:15:08.279 then at the end of it, we either increment the instruction 00:15:08.280 --> 00:15:11.119 pointer if it was matching a single character or something 00:15:11.120 --> 00:15:14.839 along those lines, or if it was a jump, we don't do that. A 00:15:14.840 --> 00:15:18.199 jump, however, it's not referring to a jump in the sense of a 00:15:18.200 --> 00:15:22.519 go-to, but a jump that's elsewhere within that table, that 00:15:22.520 --> 00:15:25.839 buffer field. If you've included a capture, we write 00:15:25.840 --> 00:15:29.479 that end position there. Of course, well, as you may 00:15:29.480 --> 00:15:34.439 recall, the zeroth capture is, of course, the entire match 00:15:34.440 --> 00:15:36.559 string. If the capture is zero, then we know we've 00:15:36.560 --> 00:15:39.839 actually completed that match. That's really great. 00:15:39.840 --> 00:15:43.599 I would love to receive Q&A about this as well. I've spent a 00:15:43.600 --> 00:15:46.719 lot of time kind of learning and understanding it. But it's 00:15:46.720 --> 00:15:49.879 really interesting that this can be described in a single 00:15:49.880 --> 00:15:52.159 slide because it's really simple. That simplicity is 00:15:52.160 --> 00:15:54.639 actually a really powerful thing. I'll mention that in 00:15:54.640 --> 00:15:58.759 the next section. I say, is that all? And I apologize for 00:15:58.760 --> 00:16:02.239 not doing so. But please, please ask questions in Q&A or 00:16:02.240 --> 00:16:04.999 message me about this, because I think it's really, really, 00:16:05.000 --> 00:16:07.079 again, interesting. Again, I find the code relatively 00:16:07.080 --> 00:16:11.999 easy to read. Now, here's, I think this is actually the 00:16:12.000 --> 00:16:15.519 point of the talk. The rest of it was, you know, I think just me 00:16:15.520 --> 00:16:18.839 posturing. This is the really, really interesting part. 00:16:18.840 --> 00:16:22.039 This is the ways that we can improve, well, not just we can 00:16:22.040 --> 00:16:25.839 improve stuff in Emacs, but why those are the right things to 00:16:25.840 --> 00:16:30.279 improve. Then also how that can be a model for even things 00:16:30.280 --> 00:16:35.079 outside of Emacs. This is gonna be a lot of text. I'm not 00:16:35.080 --> 00:16:38.879 gonna go through all of it. This is the one thing that I tried. 00:16:38.880 --> 00:16:42.239 This is the thing that I thought would be a slam dunk, easy 00:16:42.240 --> 00:16:47.439 solution. My initial thought process was, well, We tried 00:16:47.440 --> 00:16:52.919 very hard to do an LRU cache here. It works. It's actually 00:16:52.920 --> 00:16:57.399 very effective. However, though, we don't actually give 00:16:57.400 --> 00:17:00.479 the user, the list programmer, the ability to then say, I 00:17:00.480 --> 00:17:03.079 know that this regex is something that is going to be used 00:17:03.080 --> 00:17:06.399 again. I made an artificial benchmark. I made an 00:17:06.400 --> 00:17:10.039 artificial benchmark because I wanted to show there is one 00:17:10.040 --> 00:17:13.639 very specific case that it does solve, but it's the same 00:17:13.640 --> 00:17:16.919 issue with the artificial benchmarks. mentioned earlier. 00:17:16.920 --> 00:17:21.559 It's very specifically crafted in order to show that this 00:17:21.560 --> 00:17:25.319 particular solution would produce some speedup. What 00:17:25.320 --> 00:17:29.599 this means is it just creates more than 20 regexps in a row. It 00:17:29.600 --> 00:17:31.959 compiles them. Then, of course, because we just don't 00:17:31.960 --> 00:17:35.159 pay the compile costs, because we don't go through that 00:17:35.160 --> 00:17:39.079 cache eviction process, it ends up being faster. But this 00:17:39.080 --> 00:17:42.079 isn't really mean very much, particularly the goal here, 00:17:42.080 --> 00:17:45.559 you know, the goal would have been to show that the compile 00:17:45.560 --> 00:17:48.359 cache is actually causing the performance issue in 00:17:48.360 --> 00:17:51.359 comparison to pre-compiling it. That's not something 00:17:51.360 --> 00:17:56.039 I've been able to show. Match over bytes, not cars. So 00:17:56.040 --> 00:17:59.079 this is when I said at the beginning, oh, I came in and I think, 00:17:59.080 --> 00:18:02.079 oh, we can just use modern regex engine techniques. This is 00:18:02.080 --> 00:18:05.239 really what I meant. In particular, I mentioned in this 00:18:05.240 --> 00:18:09.279 match loop here that there's this, prolog that does this 00:18:09.280 --> 00:18:13.359 varring decoding. What this means is that every single 00:18:13.360 --> 00:18:18.519 iteration of that loop is going to be interspersed with this 00:18:18.520 --> 00:18:21.919 not being able to read a fixed number of bytes, but a variable 00:18:21.920 --> 00:18:24.359 number of bytes just depending upon the Unicode character 00:18:24.360 --> 00:18:27.039 or the Unicode code point or the multibyte code point. So 00:18:27.040 --> 00:18:29.799 this ends up, again, being relatively difficult to 00:18:29.800 --> 00:18:32.919 optimize because processors operate over bytes and not 00:18:32.920 --> 00:18:38.479 over code points. Yes, we might consider a multi-byte CPU at 00:18:38.480 --> 00:18:41.039 some point. But this is a really, really simple thing. It's 00:18:41.040 --> 00:18:44.999 just generating automata that operate over bytes as 00:18:45.000 --> 00:18:48.839 opposed to code points. This kind of goes into the much more 00:18:48.840 --> 00:18:51.839 abstract one. There's a lot of text here, and we're not 00:18:51.840 --> 00:18:56.159 going to go into it. But the really, really important point 00:18:56.160 --> 00:18:57.999 that I'm specifically mentioning here is this explicit 00:18:58.000 --> 00:19:02.079 control over linguistic complexity. That's the 00:19:02.080 --> 00:19:06.159 abstract kind of point. I want to introduce the inputs and 00:19:06.160 --> 00:19:11.279 the outputs. Basically, when you perform a search, or a 00:19:11.280 --> 00:19:14.799 match, or a parse, those are different tasks. They'll 00:19:14.800 --> 00:19:17.799 have different expected inputs and different desired 00:19:17.800 --> 00:19:21.559 outputs. Right now, Emacs, the API for the regular 00:19:21.560 --> 00:19:24.919 expression engine and for matching, It doesn't allow 00:19:24.920 --> 00:19:27.959 specialization on this. Or rather, if we do specialize on 00:19:27.960 --> 00:19:30.999 particular inputs, if we have a heuristic to check if a regex 00:19:31.000 --> 00:19:33.519 is actually a literal string, that's not something that the 00:19:33.520 --> 00:19:36.959 user actually has control over. For example, you can make 00:19:36.960 --> 00:19:38.999 a mistake escaping something, and then you don't have a 00:19:39.000 --> 00:19:42.039 literal, and then you accidentally have behavior that you 00:19:42.040 --> 00:19:44.279 totally didn't expect. Not just correctness issues, but 00:19:44.280 --> 00:19:48.599 also performance issues. I really like this one. I like 00:19:48.600 --> 00:19:52.239 this a lot, because I didn't think of it at all. I think it's 00:19:52.240 --> 00:19:58.119 better than in all of my ideas. This was proposed, at least 00:19:58.120 --> 00:20:01.839 to me, by Pip Cet, and I really hope that I'm pronouncing your 00:20:01.840 --> 00:20:04.479 name correctly. I'm sorry I didn't ask you beforehand, 00:20:04.480 --> 00:20:08.399 emacs-devel. In particular, this was after a couple of 00:20:08.400 --> 00:20:11.999 responses where I was trying to say, oh, I want to give the 00:20:12.000 --> 00:20:15.879 list programmer, way back in here, I want to give the list 00:20:15.880 --> 00:20:20.559 programmer the ability to control compilation in some 00:20:20.560 --> 00:20:25.759 sense. you know, he mentioned, I think he is correct, you 00:20:25.760 --> 00:20:28.439 know, there's no real introspection. That happens 00:20:28.440 --> 00:20:33.119 because it's written in C. I was thinking, oh, if I turn 00:20:33.120 --> 00:20:35.639 this into a list object that gives the list programmer the 00:20:35.640 --> 00:20:40.039 power and the ability to do more with that, but it doesn't 00:20:40.040 --> 00:20:42.839 actually because it's still in C. At first, I was 00:20:42.840 --> 00:20:46.679 thinking, oh, we can make the C part more flexible. But 00:20:46.680 --> 00:20:50.039 actually, especially if we want to do almost any of the 00:20:50.040 --> 00:20:52.719 things we previously mentioned, I think basically that 00:20:52.720 --> 00:20:56.599 this is... I think that if I'm not going to do it, somebody 00:20:56.600 --> 00:20:58.879 else really should do it, and I think we should maybe even do 00:20:58.880 --> 00:21:01.519 it together, because I think this is really, I think, how we 00:21:01.520 --> 00:21:04.079 can start experimenting, and not just experimenting, but 00:21:04.080 --> 00:21:07.039 because, as mentioned here, we have libgccjit, we have the 00:21:07.040 --> 00:21:09.519 native compiler, we have the ability to opt, like, 00:21:09.520 --> 00:21:12.639 specifically to generate specific code for this, so why not 00:21:12.640 --> 00:21:15.919 implement the or a Redix engine itself in list, And this 00:21:15.920 --> 00:21:18.359 gives us the ability to introspect it. That's one of the 00:21:18.360 --> 00:21:20.759 things I mentioned at the beginning. But it actually gives 00:21:20.760 --> 00:21:23.879 us the ability to then actually look at all the previous 00:21:23.880 --> 00:21:28.159 implementations, to explicitly compile beforehand, to 00:21:28.160 --> 00:21:32.519 match against bytes, to specialize and dispatch based upon 00:21:32.520 --> 00:21:36.799 input and output. This is something that, you know, it's 00:21:36.800 --> 00:21:37.999 super simple. 00:21:38.000 --> 00:21:40.799 It's really smart. I'm really, really glad that Pip 00:21:40.800 --> 00:21:44.839 mentioned this because it is, I think, the right way to solve 00:21:44.840 --> 00:21:49.879 the rest of it. We're at the final section. I talked a 00:21:49.880 --> 00:21:52.679 lot about, you know, kind of abstract, you know, thoughts. 00:21:52.680 --> 00:21:55.679 I talked a little about, you know, specific solutions. 00:21:55.680 --> 00:21:59.999 But I especially talked about, you know, what is Regex and 00:22:00.000 --> 00:22:02.959 Emacs? And I don't know if I had a lot of specific examples of 00:22:02.960 --> 00:22:06.079 it. I'm going to just describe kind of my, I guess, 00:22:06.080 --> 00:22:09.799 motivation, my impetus. Then I think something that's 00:22:09.800 --> 00:22:12.639 really something to chew on for the future. Do I have any 00:22:12.640 --> 00:22:15.799 concrete examples? Yes. Well, you can decide if they're 00:22:15.800 --> 00:22:22.799 concrete. Or am I just posturing? Also, yes. helm, rg. Helm, 00:22:22.800 --> 00:22:27.679 Erg, it's literally just M-x grep, it uses ripgrep, which 00:22:27.680 --> 00:22:31.999 is written by the same author of the Rust regex [??]. It 00:22:32.000 --> 00:22:36.199 happens to be very, very fast. In particular, I use this tool 00:22:36.200 --> 00:22:39.319 with ripgrep on the Twitter monorepo, and I was able to 00:22:39.320 --> 00:22:42.559 search very, very large amounts of code that was on my local 00:22:42.560 --> 00:22:46.399 machine using regular expressions. I think this is one 00:22:46.400 --> 00:22:49.199 thing that I think is really, really important, because 00:22:49.200 --> 00:22:52.919 when you want to scale, People say the word scaling and they 00:22:52.920 --> 00:22:56.719 assume there's a specific kind of answer for that. I've 00:22:56.720 --> 00:23:01.679 just found that text is not only flexible, it's actually 00:23:01.680 --> 00:23:04.359 something that can be more performant than the alternative 00:23:04.360 --> 00:23:07.399 and not only more performant, but more productive. It's 00:23:07.400 --> 00:23:10.359 again, it's just M-x grep using ripgrep. There's a tool 00:23:10.360 --> 00:23:12.719 deadgrep by Wilfred Hughes, which is also fantastic. I 00:23:12.720 --> 00:23:15.759 think it's actually better than this, but this one's mine so 00:23:15.760 --> 00:23:19.199 I can mess around with it. But this tool is kind of why, 00:23:19.200 --> 00:23:21.799 especially I started looking into Emacs and looking into 00:23:21.800 --> 00:23:24.919 changing the way that, or at least diving into how the 00:23:24.920 --> 00:23:27.559 regular expression matching actually kind of works, both 00:23:27.560 --> 00:23:30.359 in Emacs and then in ripgrep. We'll go to the next one. 00:23:30.360 --> 00:23:34.119 This is something that does exist and continues to exist. 00:23:34.120 --> 00:23:36.799 This is something that doesn't quite exist yet. I'm 00:23:36.800 --> 00:23:41.359 calling it telepathy grams. It's, you know, it's the name, 00:23:41.360 --> 00:23:44.719 and it's very, you know, it doesn't work, but it's a code 00:23:44.720 --> 00:23:47.919 search tool that, in this case, precompiles the database to 00:23:47.920 --> 00:23:51.879 execute NFAs against. I was thinking, how can I beat And 00:23:51.880 --> 00:23:55.039 the first thing I thought is, well, as I have worked on build 00:23:55.040 --> 00:23:57.759 tools, especially in monorepos, one of the things that the 00:23:57.760 --> 00:24:00.799 pants build tool from Twitter does is it uses a file watcher 00:24:00.800 --> 00:24:04.239 to ensure that instead of having to constantly read in the 00:24:04.240 --> 00:24:10.079 entire contents of a file, which may be very, very large, it 00:24:10.080 --> 00:24:13.679 only does so when the file has been changed. Finally, I 00:24:13.680 --> 00:24:16.919 want to conclude on this note, which is just that the stuff I 00:24:16.920 --> 00:24:20.839 didn't learn from emacs devel, I learned from Paul 00:24:20.840 --> 00:24:25.319 Wankadia, Jr., who is the RE2 maintainer, and he taught me 00:24:25.320 --> 00:24:32.399 quite a lot from 2023 to 2024. I'm thankful for the time 00:24:32.400 --> 00:24:37.959 that I learned from you, so thank you, Paul. With that, we're 00:24:37.960 --> 00:24:42.759 at point-max. Call me, beat me, if you want to reach me and or 00:24:42.760 --> 00:24:45.839 hire me. These are places that you can reach me at. There are 00:24:45.840 --> 00:24:49.719 probably others. Feel free to suggest other ways to contact 00:24:49.720 --> 00:24:53.199 me. But for now, this is the end. Thank you so much for your 00:24:53.200 --> 00:24:56.080 time. I really appreciate it.