1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
|
WEBVTT captioned by sachac
00:00:00.000 --> 00:00:13.359
Hello, I'm Danny McClanahan. This is EmacsConf 2024. And
00:00:13.360 --> 00:00:17.159
this presentation is ostensibly about Emacs Regex
00:00:17.160 --> 00:00:22.639
compilation. But it'll lead a lot more in future
00:00:22.640 --> 00:00:30.879
directions. Thanks for coming on this journey with me.
00:00:30.880 --> 00:00:36.719
This presentation is 50 slides, 50 footnotes, and that's
00:00:36.720 --> 00:00:40.679
intended for it to be a resource later on for your perusal. We
00:00:40.680 --> 00:00:44.199
are unfortunately not going to be able to go into all of it,
00:00:44.200 --> 00:00:49.439
but I will try to be within 20 minutes so we can make it
00:00:49.440 --> 00:00:56.199
throughout Q&A. This is the structure of the talk.
00:00:56.200 --> 00:01:03.519
But enough about me. Who are you? And why are you here?
00:01:03.520 --> 00:01:09.479
I'm Danny McClanahan.
00:01:09.480 --> 00:01:13.439
My experience is a lot in build tools, especially in the
00:01:13.440 --> 00:01:19.399
package managers. That started because I realized I was
00:01:19.400 --> 00:01:23.319
wasting a lot of time. Then I didn't like that. I
00:01:23.320 --> 00:01:29.439
started wasting a lot of time, trying to avoid wasting time.
00:01:29.440 --> 00:01:35.479
Then I ended up... going so far around that I ended up
00:01:35.480 --> 00:01:40.319
stopping other people from wasting their own time, in this
00:01:40.320 --> 00:01:44.359
case, regarding failing builds. But this is a kind of
00:01:44.360 --> 00:01:47.479
pattern that you'll see. I'm talking a lot about patterns in
00:01:47.480 --> 00:01:52.399
this presentation. Parsing in text is another one of
00:01:52.400 --> 00:01:57.479
those tendencies that I have. Why am I here? I've got a lot
00:01:57.480 --> 00:02:00.639
of feelings about text. For the next 20 minutes, I'm
00:02:00.640 --> 00:02:06.079
making it your problem.
00:02:06.080 --> 00:02:09.639
First off, a huge shout out to Emacs Devel and the Emacs
00:02:09.640 --> 00:02:12.919
community in general. I spent a lot of time learning about
00:02:12.920 --> 00:02:15.559
what I'm about to talk about. I was definitely super
00:02:15.560 --> 00:02:19.439
confused at first. Then when I became less confused and I
00:02:19.440 --> 00:02:23.919
decided I was going to look at the regular expressions of the
00:02:23.920 --> 00:02:28.039
Regex engine, I was like, oh, it's old C code. It's
00:02:28.040 --> 00:02:33.559
Emacs. We can just use modern techniques. Turns out that's
00:02:33.560 --> 00:02:37.839
wrong for kind of two reasons. One, because using modern
00:02:37.840 --> 00:02:41.479
techniques or other engines don't necessarily do what
00:02:41.480 --> 00:02:44.799
Emacs regex engine currently does. Then secondarily,
00:02:44.800 --> 00:02:48.719
that's not actually as interesting as the other kind of
00:02:48.720 --> 00:02:52.359
larger goals that emacs-devel discussed. Thank you, Eli
00:02:52.360 --> 00:02:56.279
Zaretskii, so, so much, especially Pip Cet and everyone else
00:02:56.280 --> 00:02:59.319
as well--I believe--Pip Cet, I hope I'm pronouncing that
00:02:59.320 --> 00:03:01.799
correctly. Thank you so much. I'll be shouting you out
00:03:01.800 --> 00:03:04.319
later as well. Then these larger goals ended up
00:03:04.320 --> 00:03:07.199
overlapping a lot with my own research interests. And
00:03:07.200 --> 00:03:09.879
that's very exciting. I'm hoping it's exciting for you
00:03:09.880 --> 00:03:14.079
too. What is a regular expression? And when and how does
00:03:14.080 --> 00:03:16.559
implementation match formal theory? So what does formal
00:03:16.560 --> 00:03:24.079
theory mean? And we'll talk about that.
00:03:24.080 --> 00:03:27.519
What is a regular expression? So I might ask you this
00:03:27.520 --> 00:03:30.799
question, and you might give an answer. Then I might ask
00:03:30.800 --> 00:03:33.519
someone else, and they might have an answer. Then I might
00:03:33.520 --> 00:03:38.039
ask myself, and I might try to think of an answer. Our
00:03:38.040 --> 00:03:41.799
answers would, you know, see, the thing is, they'd all be
00:03:41.800 --> 00:03:45.359
correct, but they'd probably be slightly different, and
00:03:45.360 --> 00:03:50.319
they'd be different in kind of important ways. I'm
00:03:50.320 --> 00:03:55.039
using formal theory to kind of describe what unifies these
00:03:55.040 --> 00:04:00.119
interpretations and what causes this sort of divergence,
00:04:00.120 --> 00:04:05.439
both over time and then across code bases. I'm kind of
00:04:05.440 --> 00:04:09.319
putting a flag in the ground here and saying formal theory is
00:04:09.320 --> 00:04:12.999
actually a really, really negative influence, I think, but
00:04:13.000 --> 00:04:15.999
it can be better. That's what I'm going to talk about in
00:04:16.000 --> 00:04:19.519
this talk, in this presentation. We might ask, how did
00:04:19.520 --> 00:04:26.679
this happen? and we might try to find a start state. We
00:04:26.680 --> 00:04:30.519
might put that place at the theories of formal languages
00:04:30.520 --> 00:04:34.679
that kind of arose, especially post Turing and post
00:04:34.680 --> 00:04:37.519
Chomsky. Especially there was this really, really
00:04:37.520 --> 00:04:40.119
interesting and powerful relationship with formal
00:04:40.120 --> 00:04:43.959
languages between representation and computation. And
00:04:43.960 --> 00:04:48.599
then on top of that, we have regex as this really powerful
00:04:48.600 --> 00:04:52.159
union of theory and practice And then, like I mentioned,
00:04:52.160 --> 00:04:55.799
this is kind of divergence that kind of occurs. This
00:04:55.800 --> 00:04:58.079
divergence happens for a good reason. This happens because
00:04:58.080 --> 00:05:01.999
people were adding implementations and people adding
00:05:02.000 --> 00:05:04.639
features to implementations. While the people adding
00:05:04.640 --> 00:05:06.679
these features were often academics, they were
00:05:06.680 --> 00:05:09.199
industries, people that were hobbyists, they were
00:05:09.200 --> 00:05:11.999
interested in building practical tools. This is a good
00:05:12.000 --> 00:05:14.879
thing. This is still a good thing, even though it moves a
00:05:14.880 --> 00:05:18.199
little bit away from formal theory. But we start seeing some
00:05:18.200 --> 00:05:22.639
cracks developing, and we'll go into that in a second. We're
00:05:22.640 --> 00:05:27.519
just going to kind of electric slide into the 1980s here, and
00:05:27.520 --> 00:05:31.879
we're going to be confronted with two occurrences very
00:05:31.880 --> 00:05:35.639
similarly. We might call it simultaneous discovery. In
00:05:35.640 --> 00:05:38.559
1983, you have Michael Jackson demonstrating the
00:05:38.560 --> 00:05:41.999
moonwalk. Three years later, we have backtracking
00:05:42.000 --> 00:05:44.999
developed to stimulate EGREP-style regular expressions.
00:05:45.000 --> 00:05:48.599
These would both be incredibly influential in their own
00:05:48.600 --> 00:05:54.039
kind of branching paths. Here's where the gloves come
00:05:54.040 --> 00:06:00.759
off. Formal theory, I claim, remains largely concerned
00:06:00.760 --> 00:06:03.359
with incremental improvements to artificial benchmarks,
00:06:03.360 --> 00:06:07.279
and much less with expanding models to cover actual user
00:06:07.280 --> 00:06:11.799
needs. This isn't just about, oh, if you listened to
00:06:11.800 --> 00:06:15.999
users, that you'd be a nicer person, you'd be a better
00:06:16.000 --> 00:06:19.359
engineer. What I'm actually saying is that they're missing
00:06:19.360 --> 00:06:23.919
out. When you don't listen to applications, you miss out on a
00:06:23.920 --> 00:06:26.639
lot of fantastic opportunities for novel theory. So
00:06:26.640 --> 00:06:30.839
this is, again, my complaint with formal theory as it
00:06:30.840 --> 00:06:34.599
stands. But we're gonna do better. Before we get better,
00:06:34.600 --> 00:06:36.959
we're gonna get, a little bit worse for a bit. We're going to
00:06:36.960 --> 00:06:40.359
actually get a little bit worse is better. What I mean by
00:06:40.360 --> 00:06:43.239
that is, by the 1990s, we start looking into these
00:06:43.240 --> 00:06:46.479
non-backtracking engines. This is a bit of a reaction to
00:06:46.480 --> 00:06:50.399
backtracking. The current ones include RE2,
00:06:50.400 --> 00:06:53.919
hyperscan, and the rust regex library. These are all
00:06:53.920 --> 00:06:56.439
great. I'll talk about them later as well. They make use
00:06:56.440 --> 00:06:58.719
of these. They kind of call back to the earlier formal
00:06:58.720 --> 00:07:01.479
theory. They have linear runtimes for well-specified
00:07:01.480 --> 00:07:02.519
search tasks.
00:07:02.520 --> 00:07:08.079
What happens if that doesn't fit your needs? We're going to
00:07:08.080 --> 00:07:11.479
talk about that. We're going to table that for a second,
00:07:11.480 --> 00:07:15.319
and we're going to focus more on Emacs, the subject of this
00:07:15.320 --> 00:07:19.359
conference. What are regex used for? And in this
00:07:19.360 --> 00:07:22.439
particular case, they're used for lots of things, with
00:07:22.440 --> 00:07:25.199
practically, and I think they should be. But more
00:07:25.200 --> 00:07:29.559
specifically, how do Emacs users use them? And I'm going to
00:07:29.560 --> 00:07:32.679
focus in on this text as input and output. I'll be kind of
00:07:32.680 --> 00:07:38.959
elaborating on this analogy as we continue. Why is text
00:07:38.960 --> 00:07:43.399
powerful? Text as I/O. The reason text programming
00:07:43.400 --> 00:07:45.759
languages and not just programming languages, but
00:07:45.760 --> 00:07:49.159
languages themselves, the reason why they're successful
00:07:49.160 --> 00:07:52.279
and why they propagate, I claim, is because text is both
00:07:52.280 --> 00:07:56.439
input readable and output writable. What this means
00:07:56.440 --> 00:08:01.199
is that if you receive something in text, you can read it, And
00:08:01.200 --> 00:08:04.239
then you can also write it, you can modify it, and you can
00:08:04.240 --> 00:08:06.959
produce a new version of it. You're on a kind of level
00:08:06.960 --> 00:08:10.959
playing field. That's not always the case, though. You
00:08:10.960 --> 00:08:15.959
recall that I've worked a lot with build systems and package
00:08:15.960 --> 00:08:20.999
managers. There's a discussion that goes by the name of
00:08:21.000 --> 00:08:25.319
software supply chain security. I think it's a massive
00:08:25.320 --> 00:08:29.079
joke. The reason why is because people largely raise it
00:08:29.080 --> 00:08:34.279
to explain why their for-profit company with their
00:08:34.280 --> 00:08:38.079
for-profit product is going to solve the problem for you, as
00:08:38.080 --> 00:08:41.959
opposed to the commons of open source. If you are unable to
00:08:41.960 --> 00:08:44.999
modify or deploy your code without employing an opaque
00:08:45.000 --> 00:08:48.599
external system, I think, then you have a hidden
00:08:48.600 --> 00:08:53.879
dependency. you don't remove a dependency, you just, by,
00:08:53.880 --> 00:08:59.239
for example, paying into a for-profit product or using a
00:08:59.240 --> 00:09:01.519
closed-off supply chain, you end up just having a hidden
00:09:01.520 --> 00:09:04.719
dependency, you end up just displacing that. This can
00:09:04.720 --> 00:09:07.639
actually exert arbitrary control over your programming
00:09:07.640 --> 00:09:11.279
output and potentially even your thoughts. This is really
00:09:11.280 --> 00:09:15.839
important. I'm going to dive in a little bit deeper and I'm
00:09:15.840 --> 00:09:18.999
going to overload the term locality here. I'm going to
00:09:19.000 --> 00:09:22.239
say, if you cannot reproduce a system locally, it becomes an
00:09:22.240 --> 00:09:24.999
opaque external system. I'm going to give examples
00:09:25.000 --> 00:09:27.479
here, and these are going to be a bit of a hot take. First
00:09:27.480 --> 00:09:30.519
off, GUI IDEs. I think we might, well, some of us might agree
00:09:30.520 --> 00:09:34.519
with that here. I say development environments that only
00:09:34.520 --> 00:09:38.119
allow you to use a graphical interface, do not expose
00:09:38.120 --> 00:09:42.799
interaction with text, are explicitly trying to kind of
00:09:42.800 --> 00:09:46.239
place you on a separate kind of plane where you're not an
00:09:46.240 --> 00:09:50.439
equal contributor to the people who make the development
00:09:50.440 --> 00:09:53.079
environment, make the development kind of frameworks
00:09:53.080 --> 00:09:57.399
here. We'll go one further. Cloud services are precisely,
00:09:57.400 --> 00:10:00.039
you know, they're useful for things that, you know, that
00:10:00.040 --> 00:10:04.399
require large domain computation, but, you know, Twitter,
00:10:04.400 --> 00:10:08.679
for example, didn't actually ever use any cloud services,
00:10:08.680 --> 00:10:12.199
external ones, because it was really important for them to
00:10:12.200 --> 00:10:14.999
actually own their own hardware, their own computation,
00:10:15.000 --> 00:10:20.199
their own thinking. Cloud services are a way to ensure
00:10:20.200 --> 00:10:24.919
that you're unable to reproduce a system without paying an
00:10:24.920 --> 00:10:28.039
amount per month, an amount per day, an amount per second, an
00:10:28.040 --> 00:10:32.439
amount per cycle to an external entity. I'm just going to
00:10:32.440 --> 00:10:35.559
conclude this with, I'd say, the argumentum ad absurdum,
00:10:35.560 --> 00:10:39.239
here, where large language models are all of these at once.
00:10:39.240 --> 00:10:42.879
They are a cloud service, specifically, and this is what
00:10:42.880 --> 00:10:48.439
makes them very evil, to make it so that, similar to GUI IDEs,
00:10:48.440 --> 00:10:52.919
so that text itself loses that ability to be both readable
00:10:52.920 --> 00:10:56.199
and writable. Instead, text is both unreadable, because
00:10:56.200 --> 00:10:59.519
it's produced by a machine, and then also unwritable,
00:10:59.520 --> 00:11:02.999
because you're subservient and subjugated to the machine,
00:11:03.000 --> 00:11:05.359
to the large language model to produce the code in the first
00:11:05.360 --> 00:11:08.919
place. You lose this input, output, readable, writable
00:11:08.920 --> 00:11:13.359
behavior that I claim text has specifically. To
00:11:13.360 --> 00:11:19.439
underline this, what is text? Text is local. Finally,
00:11:19.440 --> 00:11:23.639
we're at the subject of this conference. Emacs, I have
00:11:23.640 --> 00:11:27.479
double hearts with text. I start off the slide saying Emacs
00:11:27.480 --> 00:11:31.519
is a text editor. I think that's a good start. Which
00:11:31.520 --> 00:11:34.319
implements much of its own logic and user interface via
00:11:34.320 --> 00:11:38.399
text. What this means is that, you know, I say without
00:11:38.400 --> 00:11:42.639
trying, Emacs tries very hard, but without trying so hard,
00:11:42.640 --> 00:11:47.639
Emacs, is imbued with all of the capabilities that text has
00:11:47.640 --> 00:11:51.319
specifically. When you use text like Emacs does, and
00:11:51.320 --> 00:11:55.519
particularly you then start offering mechanisms to query,
00:11:55.520 --> 00:11:59.999
to transform, and to generally metaprogram text itself,
00:12:00.000 --> 00:12:03.319
you don't just have the ability to edit code in new ways. And
00:12:03.320 --> 00:12:06.999
this is something that I think is often lost, maybe not by
00:12:07.000 --> 00:12:11.239
participants of this conference, you particularly start
00:12:11.240 --> 00:12:14.319
being able to not only just edit code differently, but to
00:12:14.320 --> 00:12:16.599
change the way that you think about code and actually to
00:12:16.600 --> 00:12:20.239
expand your range of thought, the range of actions that you
00:12:20.240 --> 00:12:22.719
can perform. You can actually start then editing at the
00:12:22.720 --> 00:12:25.799
speed of thought. This is where especially Regex kind of
00:12:25.800 --> 00:12:30.319
comes into play. Finally, we get to the subject of the
00:12:30.320 --> 00:12:33.599
title of this talk. I'm about to disappoint a lot of
00:12:33.600 --> 00:12:38.759
people. I claim for good reason. Unfortunately, it's a
00:12:38.760 --> 00:12:41.599
very brief walkthrough, but I'm going to go over what the
00:12:41.600 --> 00:12:43.799
current Emacs Redix engine is. This is going to give us
00:12:43.800 --> 00:12:48.119
enough context for the next section on future directions.
00:12:48.120 --> 00:12:51.799
Quickly, it's a backtracking engine over a multi-byte
00:12:51.800 --> 00:12:53.919
code point. I'll define what that means. It's in
00:12:53.920 --> 00:12:58.439
regex-emacs.c. It's invoked in two ways, which you'll see
00:12:58.440 --> 00:13:01.759
is actually the same way, over a single contiguous string
00:13:01.760 --> 00:13:05.359
input. This is a Lisp string that you pass in. or over the
00:13:05.360 --> 00:13:07.039
two halves of the gap buffer. This is when you match
00:13:07.040 --> 00:13:11.879
against a buffer text. We'll go into that a little bit
00:13:11.880 --> 00:13:13.919
more, but this is one of the really actually interesting and
00:13:13.920 --> 00:13:17.839
specific things about Emacs Regex Engine as it stands. So
00:13:17.840 --> 00:13:21.559
very, very quickly, this is the data layout. This is just, if
00:13:21.560 --> 00:13:24.879
you're interested, this is where the code lies. So
00:13:24.880 --> 00:13:30.159
regex-emacs.h has re-pattern buffer, which is a struct
00:13:30.160 --> 00:13:34.239
Actually, you know, I love, by the way, I love the Emacs C
00:13:34.240 --> 00:13:37.359
source code. It's so nice to read. It made all this so, so
00:13:37.360 --> 00:13:41.119
easy. I really appreciated it. In this particular case,
00:13:41.120 --> 00:13:44.039
I'm just going to focus on re-pattern buffer actually has
00:13:44.040 --> 00:13:47.999
the compiler. It's a C struct. It has every single thing
00:13:48.000 --> 00:13:52.559
that is needed to execute the regular expression against a
00:13:52.560 --> 00:13:56.319
string input or against a buffer input. This buffer,
00:13:56.320 --> 00:13:59.839
It's not an Emacs buffer. It refers to just the instruction
00:13:59.840 --> 00:14:04.039
table and the match loop. Again, this is very, very
00:14:04.040 --> 00:14:07.839
brief, but I want to specifically focus on the first part. So
00:14:07.840 --> 00:14:11.879
this is this inner matching loop, and there's a prologue,
00:14:11.880 --> 00:14:15.679
and then there's a loop body, and there's an epilogue. And
00:14:15.680 --> 00:14:18.279
the prologue is the really, really interesting part. I say
00:14:18.280 --> 00:14:22.839
extract current and next char. What Emacs does here, it
00:14:22.840 --> 00:14:27.159
doesn't just reach for the next byte. It actually will
00:14:27.160 --> 00:14:31.879
perform lazily in some sense, this variable integer size
00:14:31.880 --> 00:14:36.039
VAR decoding for multi-byte, and it'll actually then
00:14:36.040 --> 00:14:43.959
decode the next one to four bytes. Up to 32 bits at once, and
00:14:43.960 --> 00:14:46.799
then it'll actually go into the loop. We'll talk about the
00:14:46.800 --> 00:14:52.519
implications of that later. Next, in the body of the loop, we
00:14:52.520 --> 00:14:54.239
read the instruction from the instruction pointer, which
00:14:54.240 --> 00:14:57.319
is, again, in that buffer field. Then we have this big
00:14:57.320 --> 00:14:59.479
switch statement, which is actually, love a big switch
00:14:59.480 --> 00:15:02.079
statement, super easy to read, super easy to understand
00:15:02.080 --> 00:15:05.399
kind of what's occurring. Then that's the loop body. And
00:15:05.400 --> 00:15:08.279
then at the end of it, we either increment the instruction
00:15:08.280 --> 00:15:11.119
pointer if it was matching a single character or something
00:15:11.120 --> 00:15:14.839
along those lines, or if it was a jump, we don't do that. A
00:15:14.840 --> 00:15:18.199
jump, however, it's not referring to a jump in the sense of a
00:15:18.200 --> 00:15:22.519
go-to, but a jump that's elsewhere within that table, that
00:15:22.520 --> 00:15:25.839
buffer field. If you've included a capture, we write
00:15:25.840 --> 00:15:29.479
that end position there. Of course, well, as you may
00:15:29.480 --> 00:15:34.439
recall, the zeroth capture is, of course, the entire match
00:15:34.440 --> 00:15:36.559
string. If the capture is zero, then we know we've
00:15:36.560 --> 00:15:39.839
actually completed that match. That's really great.
00:15:39.840 --> 00:15:43.599
I would love to receive Q&A about this as well. I've spent a
00:15:43.600 --> 00:15:46.719
lot of time kind of learning and understanding it. But it's
00:15:46.720 --> 00:15:49.879
really interesting that this can be described in a single
00:15:49.880 --> 00:15:52.159
slide because it's really simple. That simplicity is
00:15:52.160 --> 00:15:54.639
actually a really powerful thing. I'll mention that in
00:15:54.640 --> 00:15:58.759
the next section. I say, is that all? And I apologize for
00:15:58.760 --> 00:16:02.239
not doing so. But please, please ask questions in Q&A or
00:16:02.240 --> 00:16:04.999
message me about this, because I think it's really, really,
00:16:05.000 --> 00:16:07.079
again, interesting. Again, I find the code relatively
00:16:07.080 --> 00:16:11.999
easy to read. Now, here's, I think this is actually the
00:16:12.000 --> 00:16:15.519
point of the talk. The rest of it was, you know, I think just me
00:16:15.520 --> 00:16:18.839
posturing. This is the really, really interesting part.
00:16:18.840 --> 00:16:22.039
This is the ways that we can improve, well, not just we can
00:16:22.040 --> 00:16:25.839
improve stuff in Emacs, but why those are the right things to
00:16:25.840 --> 00:16:30.279
improve. Then also how that can be a model for even things
00:16:30.280 --> 00:16:35.079
outside of Emacs. This is gonna be a lot of text. I'm not
00:16:35.080 --> 00:16:38.879
gonna go through all of it. This is the one thing that I tried.
00:16:38.880 --> 00:16:42.239
This is the thing that I thought would be a slam dunk, easy
00:16:42.240 --> 00:16:47.439
solution. My initial thought process was, well, We tried
00:16:47.440 --> 00:16:52.919
very hard to do an LRU cache here. It works. It's actually
00:16:52.920 --> 00:16:57.399
very effective. However, though, we don't actually give
00:16:57.400 --> 00:17:00.479
the user, the list programmer, the ability to then say, I
00:17:00.480 --> 00:17:03.079
know that this regex is something that is going to be used
00:17:03.080 --> 00:17:06.399
again. I made an artificial benchmark. I made an
00:17:06.400 --> 00:17:10.039
artificial benchmark because I wanted to show there is one
00:17:10.040 --> 00:17:13.639
very specific case that it does solve, but it's the same
00:17:13.640 --> 00:17:16.919
issue with the artificial benchmarks. mentioned earlier.
00:17:16.920 --> 00:17:21.559
It's very specifically crafted in order to show that this
00:17:21.560 --> 00:17:25.319
particular solution would produce some speedup. What
00:17:25.320 --> 00:17:29.599
this means is it just creates more than 20 regexps in a row. It
00:17:29.600 --> 00:17:31.959
compiles them. Then, of course, because we just don't
00:17:31.960 --> 00:17:35.159
pay the compile costs, because we don't go through that
00:17:35.160 --> 00:17:39.079
cache eviction process, it ends up being faster. But this
00:17:39.080 --> 00:17:42.079
isn't really mean very much, particularly the goal here,
00:17:42.080 --> 00:17:45.559
you know, the goal would have been to show that the compile
00:17:45.560 --> 00:17:48.359
cache is actually causing the performance issue in
00:17:48.360 --> 00:17:51.359
comparison to pre-compiling it. That's not something
00:17:51.360 --> 00:17:56.039
I've been able to show. Match over bytes, not cars. So
00:17:56.040 --> 00:17:59.079
this is when I said at the beginning, oh, I came in and I think,
00:17:59.080 --> 00:18:02.079
oh, we can just use modern regex engine techniques. This is
00:18:02.080 --> 00:18:05.239
really what I meant. In particular, I mentioned in this
00:18:05.240 --> 00:18:09.279
match loop here that there's this, prolog that does this
00:18:09.280 --> 00:18:13.359
varring decoding. What this means is that every single
00:18:13.360 --> 00:18:18.519
iteration of that loop is going to be interspersed with this
00:18:18.520 --> 00:18:21.919
not being able to read a fixed number of bytes, but a variable
00:18:21.920 --> 00:18:24.359
number of bytes just depending upon the Unicode character
00:18:24.360 --> 00:18:27.039
or the Unicode code point or the multibyte code point. So
00:18:27.040 --> 00:18:29.799
this ends up, again, being relatively difficult to
00:18:29.800 --> 00:18:32.919
optimize because processors operate over bytes and not
00:18:32.920 --> 00:18:38.479
over code points. Yes, we might consider a multi-byte CPU at
00:18:38.480 --> 00:18:41.039
some point. But this is a really, really simple thing. It's
00:18:41.040 --> 00:18:44.999
just generating automata that operate over bytes as
00:18:45.000 --> 00:18:48.839
opposed to code points. This kind of goes into the much more
00:18:48.840 --> 00:18:51.839
abstract one. There's a lot of text here, and we're not
00:18:51.840 --> 00:18:56.159
going to go into it. But the really, really important point
00:18:56.160 --> 00:18:57.999
that I'm specifically mentioning here is this explicit
00:18:58.000 --> 00:19:02.079
control over linguistic complexity. That's the
00:19:02.080 --> 00:19:06.159
abstract kind of point. I want to introduce the inputs and
00:19:06.160 --> 00:19:11.279
the outputs. Basically, when you perform a search, or a
00:19:11.280 --> 00:19:14.799
match, or a parse, those are different tasks. They'll
00:19:14.800 --> 00:19:17.799
have different expected inputs and different desired
00:19:17.800 --> 00:19:21.559
outputs. Right now, Emacs, the API for the regular
00:19:21.560 --> 00:19:24.919
expression engine and for matching, It doesn't allow
00:19:24.920 --> 00:19:27.959
specialization on this. Or rather, if we do specialize on
00:19:27.960 --> 00:19:30.999
particular inputs, if we have a heuristic to check if a regex
00:19:31.000 --> 00:19:33.519
is actually a literal string, that's not something that the
00:19:33.520 --> 00:19:36.959
user actually has control over. For example, you can make
00:19:36.960 --> 00:19:38.999
a mistake escaping something, and then you don't have a
00:19:39.000 --> 00:19:42.039
literal, and then you accidentally have behavior that you
00:19:42.040 --> 00:19:44.279
totally didn't expect. Not just correctness issues, but
00:19:44.280 --> 00:19:48.599
also performance issues. I really like this one. I like
00:19:48.600 --> 00:19:52.239
this a lot, because I didn't think of it at all. I think it's
00:19:52.240 --> 00:19:58.119
better than in all of my ideas. This was proposed, at least
00:19:58.120 --> 00:20:01.839
to me, by Pip Cet, and I really hope that I'm pronouncing your
00:20:01.840 --> 00:20:04.479
name correctly. I'm sorry I didn't ask you beforehand,
00:20:04.480 --> 00:20:08.399
emacs-devel. In particular, this was after a couple of
00:20:08.400 --> 00:20:11.999
responses where I was trying to say, oh, I want to give the
00:20:12.000 --> 00:20:15.879
list programmer, way back in here, I want to give the list
00:20:15.880 --> 00:20:20.559
programmer the ability to control compilation in some
00:20:20.560 --> 00:20:25.759
sense. you know, he mentioned, I think he is correct, you
00:20:25.760 --> 00:20:28.439
know, there's no real introspection. That happens
00:20:28.440 --> 00:20:33.119
because it's written in C. I was thinking, oh, if I turn
00:20:33.120 --> 00:20:35.639
this into a list object that gives the list programmer the
00:20:35.640 --> 00:20:40.039
power and the ability to do more with that, but it doesn't
00:20:40.040 --> 00:20:42.839
actually because it's still in C. At first, I was
00:20:42.840 --> 00:20:46.679
thinking, oh, we can make the C part more flexible. But
00:20:46.680 --> 00:20:50.039
actually, especially if we want to do almost any of the
00:20:50.040 --> 00:20:52.719
things we previously mentioned, I think basically that
00:20:52.720 --> 00:20:56.599
this is... I think that if I'm not going to do it, somebody
00:20:56.600 --> 00:20:58.879
else really should do it, and I think we should maybe even do
00:20:58.880 --> 00:21:01.519
it together, because I think this is really, I think, how we
00:21:01.520 --> 00:21:04.079
can start experimenting, and not just experimenting, but
00:21:04.080 --> 00:21:07.039
because, as mentioned here, we have libgccjit, we have the
00:21:07.040 --> 00:21:09.519
native compiler, we have the ability to opt, like,
00:21:09.520 --> 00:21:12.639
specifically to generate specific code for this, so why not
00:21:12.640 --> 00:21:15.919
implement the or a Redix engine itself in list, And this
00:21:15.920 --> 00:21:18.359
gives us the ability to introspect it. That's one of the
00:21:18.360 --> 00:21:20.759
things I mentioned at the beginning. But it actually gives
00:21:20.760 --> 00:21:23.879
us the ability to then actually look at all the previous
00:21:23.880 --> 00:21:28.159
implementations, to explicitly compile beforehand, to
00:21:28.160 --> 00:21:32.519
match against bytes, to specialize and dispatch based upon
00:21:32.520 --> 00:21:36.799
input and output. This is something that, you know, it's
00:21:36.800 --> 00:21:37.999
super simple.
00:21:38.000 --> 00:21:40.799
It's really smart. I'm really, really glad that Pip
00:21:40.800 --> 00:21:44.839
mentioned this because it is, I think, the right way to solve
00:21:44.840 --> 00:21:49.879
the rest of it. We're at the final section. I talked a
00:21:49.880 --> 00:21:52.679
lot about, you know, kind of abstract, you know, thoughts.
00:21:52.680 --> 00:21:55.679
I talked a little about, you know, specific solutions.
00:21:55.680 --> 00:21:59.999
But I especially talked about, you know, what is Regex and
00:22:00.000 --> 00:22:02.959
Emacs? And I don't know if I had a lot of specific examples of
00:22:02.960 --> 00:22:06.079
it. I'm going to just describe kind of my, I guess,
00:22:06.080 --> 00:22:09.799
motivation, my impetus. Then I think something that's
00:22:09.800 --> 00:22:12.639
really something to chew on for the future. Do I have any
00:22:12.640 --> 00:22:15.799
concrete examples? Yes. Well, you can decide if they're
00:22:15.800 --> 00:22:22.799
concrete. Or am I just posturing? Also, yes. helm, rg. Helm,
00:22:22.800 --> 00:22:27.679
Erg, it's literally just M-x grep, it uses ripgrep, which
00:22:27.680 --> 00:22:31.999
is written by the same author of the Rust regex [??]. It
00:22:32.000 --> 00:22:36.199
happens to be very, very fast. In particular, I use this tool
00:22:36.200 --> 00:22:39.319
with ripgrep on the Twitter monorepo, and I was able to
00:22:39.320 --> 00:22:42.559
search very, very large amounts of code that was on my local
00:22:42.560 --> 00:22:46.399
machine using regular expressions. I think this is one
00:22:46.400 --> 00:22:49.199
thing that I think is really, really important, because
00:22:49.200 --> 00:22:52.919
when you want to scale, People say the word scaling and they
00:22:52.920 --> 00:22:56.719
assume there's a specific kind of answer for that. I've
00:22:56.720 --> 00:23:01.679
just found that text is not only flexible, it's actually
00:23:01.680 --> 00:23:04.359
something that can be more performant than the alternative
00:23:04.360 --> 00:23:07.399
and not only more performant, but more productive. It's
00:23:07.400 --> 00:23:10.359
again, it's just M-x grep using ripgrep. There's a tool
00:23:10.360 --> 00:23:12.719
deadgrep by Wilfred Hughes, which is also fantastic. I
00:23:12.720 --> 00:23:15.759
think it's actually better than this, but this one's mine so
00:23:15.760 --> 00:23:19.199
I can mess around with it. But this tool is kind of why,
00:23:19.200 --> 00:23:21.799
especially I started looking into Emacs and looking into
00:23:21.800 --> 00:23:24.919
changing the way that, or at least diving into how the
00:23:24.920 --> 00:23:27.559
regular expression matching actually kind of works, both
00:23:27.560 --> 00:23:30.359
in Emacs and then in ripgrep. We'll go to the next one.
00:23:30.360 --> 00:23:34.119
This is something that does exist and continues to exist.
00:23:34.120 --> 00:23:36.799
This is something that doesn't quite exist yet. I'm
00:23:36.800 --> 00:23:41.359
calling it telepathy grams. It's, you know, it's the name,
00:23:41.360 --> 00:23:44.719
and it's very, you know, it doesn't work, but it's a code
00:23:44.720 --> 00:23:47.919
search tool that, in this case, precompiles the database to
00:23:47.920 --> 00:23:51.879
execute NFAs against. I was thinking, how can I beat And
00:23:51.880 --> 00:23:55.039
the first thing I thought is, well, as I have worked on build
00:23:55.040 --> 00:23:57.759
tools, especially in monorepos, one of the things that the
00:23:57.760 --> 00:24:00.799
pants build tool from Twitter does is it uses a file watcher
00:24:00.800 --> 00:24:04.239
to ensure that instead of having to constantly read in the
00:24:04.240 --> 00:24:10.079
entire contents of a file, which may be very, very large, it
00:24:10.080 --> 00:24:13.679
only does so when the file has been changed. Finally, I
00:24:13.680 --> 00:24:16.919
want to conclude on this note, which is just that the stuff I
00:24:16.920 --> 00:24:20.839
didn't learn from emacs devel, I learned from Paul
00:24:20.840 --> 00:24:25.319
Wankadia, Jr., who is the RE2 maintainer, and he taught me
00:24:25.320 --> 00:24:32.399
quite a lot from 2023 to 2024. I'm thankful for the time
00:24:32.400 --> 00:24:37.959
that I learned from you, so thank you, Paul. With that, we're
00:24:37.960 --> 00:24:42.759
at point-max. Call me, beat me, if you want to reach me and or
00:24:42.760 --> 00:24:45.839
hire me. These are places that you can reach me at. There are
00:24:45.840 --> 00:24:49.719
probably others. Feel free to suggest other ways to contact
00:24:49.720 --> 00:24:53.199
me. But for now, this is the end. Thank you so much for your
00:24:53.200 --> 00:24:56.080
time. I really appreciate it.
|