2024/talks/regex.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125

[[!meta title="Emacs regex compilation and future directions for expressive pattern matching"]]
[[!meta copyright="Copyright &copy; 2024 Danny McClanahan"]]
[[!inline pages="internal(2024/info/regex-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->


# Emacs regex compilation and future directions for expressive pattern matching
Danny McClanahan (they/them) - Pronunciation: məˈklænəˌhæn / mk-CLAN-uh-han, han like "hand", IRC: cosmicexplorer, @hipsterelectron@circumstances.run on mastodon - @hipsterelectron on twitter -  @cosmicexplorer on github - @cosmicexplorer in #emacsconf on irc.libera.chat, <mailto:dmcC2@hypnicjerk.ai>

[[!inline pages="internal(2024/info/regex-before)" raw="yes"]]

Emacs is a delightful case study for the capabilities of regular expressions, because Emacs forms user interfaces via text, which retains the expressivity of a GUI with the user-level interactivity of written language. Because we use text for both input and output, regexps in Emacs form part of a user-level grammar for human thought. As a result, Emacs and Emacs users have a rich intuitive grasp of regular expressions, which provides a unique vantage point to consider how they may be improved in general.

When I began my investigation, I assumed that Emacs would be able to use an existing off-the-shelf regex engine, that this would be more performant than regex-emacs.c, and that the greatest challenge would be providing a sufficiently robust build process (see emacs-devel: <https://lists.gnu.org/archive/html/emacs-devel/2024-04/msg00142.html>). However, I quickly found that Emacs (as usual) is far more configurable than alternatives (see rust regex discussion: <https://github.com/rust-lang/regex/discussions/1167#discussioncomment-8585027>). Now don't get this twisted: emacs-devel was open to deprecating functionality that hampered optimization! But the biggest challenge by far is that regex-emacs.c is categorically more powerful than alternatives: it can match against <span class="underline">non-contiguous input</span> (across both halves of the gap buffer), as well as non-UTF8 text with its fantastic multibyte encoding (see <https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html>).

So a more complex picture begins to emerge: Emacs actually uses regexps far more widely and deeply than anywhere else, and its regex engine requirements aren't "legacy", but the result of caring more deeply about language than anywhere else. While regex engines have historically been known to introduce functionality not backed by formal theory that's later found to be hard to optimize, Emacs instead charts a path for other engines to follow. Formalizing backrefs is state-of-the-art, but possible (see <https://jamiejennings.com/posts/2021-09-23-dont-look-back-2/>), and I believe the same can be achieved for the other affordances Emacs users have come to expect. Subsequently, I have focused on identifying where we can constrain the problem space to improve performance without losing those affordances, such as explicit precompilation in lisp code (see <https://lists.gnu.org/archive/html/emacs-devel/2024-08/msg00108.html>).

There are many branching paths here. With the libgccjit native compiler, we can now implement regex matching in lisp itself. While \`rx' can compose patterns to an extent, we could provide a more powerful primitive than regular expressions alone for complex parsing tasks. And while many regex engines employ complex optimization heuristics, we can instead introduce specific functionality for e.g. SIMD literal search into lisp code, allowing lisp users to intelligently select for themselves how and when to employ less-powerful but more-performant search routines.

We don't need to backtrack! We can try all these paths at once.

About the speaker:

Me: Typing free software to break the shoulders of giants from golden handcuffs.
Work: Composeable build tools, parsing frameworks, and cryptographic messaging. All of these are misused to subjugate and oppress, but I build infrastructure for positive liberties.

This talk will cover my train of thought over the course of this year on how regex engines in general may be improved, and the discussions with emacs-devel that have helped me along. I hope this talk will convince people of the boundless future directions in text search. My PhD research will be inspired by the expressivity and power of Emacs.


# Discussion

## Questions and answers

-   Q: A bit off topic, but how did you get the emoji into your slides?
    I\'m assuming you exported via Beamer to PDF. Thank you very much
    for the swift answer. Great presentation, too🙏🏻
    -   A: \\usepackage{twemojis}!
        [https://ctan.math.washington.edu/tex-archive/macros/latex/contrib/twemojis/twemojis.pdf](https://ctan.math.washington.edu/tex-archive/macros/latex/contrib/twemojis/twemojis.pdf){rel="noreferrer noopener"}
        -   and yes beamer to pdf! i used org-beamer too
        -   had to break out of org a couple times
        -   For LaTeX packages supporting emojis cf.
            [https://www.ctan.org/search?phrase=emoji](https://www.ctan.org/search?phrase=emoji){rel="noreferrer noopener"}
            -   i tried just pasting unicode but had an error and
                couldn\'t figure it out in my mad dash for making this
                in time
                -   In order for this to work you need the same kind of
                    unicode support over the whole toolchain, from
                    editor to tex engine to font.
                -   i\'m a big fan of toolchains so this makes me want
                    to fix it more :) thanks!!

## Notes

- i have a 50-minute version of this talk which i will be posting
  somewhere on my page
  [https://hypnicjerk.ai](https://hypnicjerk.ai){rel="noreferrer noopener"}
  after the conference!
  - oh good! I wish the last talk I attended with this many slides could have done that (Florian Weimer's traditional future directions for glibc talk at the GNU Tools Cauldron: every year he gets through a third of it and puts the rest on the schedule for next year!)
- great, the slides are now available at https://media.emacsconf.org/2024/emacsconf-2024-regex--emacs-regex-compilation-and-future-directions-for-expressive-pattern-matching--danny-mcclanahan--slides.pdf and from the talk page
- i was not able to add subtitles in time for the conference, so
  please please ask questions here or on irc during the talk (even
  just asking for what i just said) and i will do my best to answer
  all of them!
- Something you might be interested in Rak a lesser known grep
  alternative dosent seem to have a emacs frontend though
  - oooh!
    [https://github.com/danlucraft/rak](https://github.com/danlucraft/rak){rel="noreferrer noopener"}
    this ?
  - helm-rg is based on helm-ag which i previously contributed to
    and i think ag and ack have some interesting features which
    avoid doing some online work we don't need to do
    - no emacs frontend? sounds like a challenge\...!
  - [https://github.com/lizmat/App-Rak](https://github.com/lizmat/App-Rak){rel="noreferrer noopener"}
    - thanks so much!!
  - [https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s](https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s){rel="noreferrer noopener"}
- followup on emacs-devel with NullNix's suggestion to make the cache
  buffer-local:
  [https://lists.gnu.org/archive/html/emacs-devel/2024-12/msg00299.html](https://lists.gnu.org/archive/html/emacs-devel/2024-12/msg00299.html){rel="noreferrer noopener"}

- I think having an LLM do this is just perfect! all the people asking for it want is comforting lies anyway, and LLMs are really good at those!
  - LLM's can be run locally. for example using localai
  - cosmicexplorer: yes! but the weights come from somewhere! they come from training in cloud services!
  - Running locally is not the same as reproduce it localy... I guess...
  - It is like having a proprietary binary blob running on the linux kernel.
  - cosmicexplorer: that is true. it does get a bit iffy when running open source models trained on remote services when using localai.io
  - cosmicexplorer: inflicting a hidden dependency on my users :( :(
- on other things you should never do, that AI adjustment of the speaker image is *really annoying*
  - I have literally blanked off that part of my screen with a piece of paper so I don't have to see it, sorry
  - excellent talk though!! wish it was twice as long
  - cosmicexplorer: yes :( thanks for feedback. next time i won't be so embarrassed with my bed
  - cosmicexplorer: i captured this live in obs with the filtering so i don't even have the video stream without it
  - I recently told somebody about Nvidia brodcast studio for the good green screen removel which annoyed me becouse it is 1 not open source. 2 I use amd and can't use it nor is it multiplatform 3 I use linux and don't know if you can run if from linux :( anybody know of a better solution?

- ohhh I never realised the reason the match data isn't reified was so tied up with the implementation. not too surprising in hindsight, thats the emacs way :)
- I would recommend having the regex cache be *in* a buffer-local variable. most of the speedups, works everywhere maybe?
  - cosmicexplorer: SMART!!!!!!
  - cosmicexplorer: could also then explicitly have a cache busting API
- Q: What about tree-sitter?  Is it better?  Does it uses regexps?
  - cosmicexplorer: basically: yes tree-sitter solves this, but no it does not use regexps
  - cosmicexplorer: so it's really much more applicable for well-specified programming language definitions
  - cosmicexplorer: but it means we can let tree-sitter solve problems we don't want to ourselves
  - cosmicexplorer: they depend on the current syntax table which is buffer-local, and on case-folding from the current buffer
    - so only buffer-local, so a buffer-local cache should be the right level then
    - just making sure it wasn't anything finer-grained than that :)
    - cosmicexplorer: yes! and also they very very rarely change
    - hm, you could probably share many buffer-local caches with identical values for syntax tables, case folding etc even :)
    - more complex though, and likely marginal gains
- worth making sure regexes can't depend on things like overlays, but I don't see how they could or a match over a buffer might require recompilation *in the middle of a match*, which is overkill even for emacs :P
- using orderless and consult with consult ripgrep means you have the performant part outside of emacs and powerful emacs facilites don't need to be as performant while still being fast
  - cosmicexplorer: looking up orderless and consult now :)
- I was meaning with the consult-ripgrep command from the consult package
- wonders about combining this with p-search from yesterday... :)

- This is brilliant
- Great talk!
- That was great, thank you, your enthusiasm is infectious!
- That was a great talk, thanks!

[[!inline pages="internal(2024/info/regex-after)" raw="yes"]]

[[!inline pages="internal(2024/info/regex-nav)" raw="yes"]]