From c29b14845a8c5e0e9f530134e6f95a051cf697db Mon Sep 17 00:00:00 2001 From: Sacha Chua Date: Sat, 6 Mar 2021 00:12:31 -0500 Subject: Transcript for #23 main talk --- ...-emacs-tree-sitter--tuan-anh-nguyen-autogen.vtt | 1407 -------------------- ...ing-with-emacs-tree-sitter--tuan-anh-nguyen.vtt | 1235 +++++++++++++++++ 2 files changed, 1235 insertions(+), 1407 deletions(-) delete mode 100644 2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen-autogen.vtt create mode 100644 2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen.vtt (limited to '2020/subtitles') diff --git a/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen-autogen.vtt b/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen-autogen.vtt deleted file mode 100644 index 62ad5f65..00000000 --- a/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen-autogen.vtt +++ /dev/null @@ -1,1407 +0,0 @@ -WEBVTT - -00:00:01.520 --> 00:00:04.400 -Hello, everyone! My name is Tuấn-Anh. - -00:00:04.400 --> 00:00:07.200 -I've been using Emacs for about 10 years. - -00:00:07.200 --> 00:00:09.280 -Today, I'm going to talk about tree-sitter, - -00:00:09.280 --> 00:00:11.351 -a new Emacs package that allows Emacs - -00:00:11.351 --> 00:00:17.840 -to parse multiple programming languages -in real-time. - -00:00:17.840 --> 00:00:21.840 -So what is the problem statement? - -00:00:21.840 --> 00:00:24.131 -In order to support programming -functionalities - -00:00:24.131 --> 00:00:25.760 -for a particular language, - -00:00:25.760 --> 00:00:27.680 -a text editor needs to have some degree - -00:00:27.680 --> 00:00:29.679 -of language understanding. - -00:00:29.679 --> 00:00:31.840 -Traditionally, text editors have relied - -00:00:31.840 --> 00:00:34.960 -very heavily on regular expressions for -this. - -00:00:34.960 --> 00:00:37.013 -Emacs is no different. - -00:00:37.013 --> 00:00:40.170 -Most language major modes use regular -expressions - -00:00:40.170 --> 00:00:42.960 -for syntax-highlighting, code navigation, - -00:00:42.960 --> 00:00:46.618 -folding, indexing, and so on. - -00:00:46.618 --> 00:00:50.559 -Regular expressions are problematic for -a couple of reasons. - -00:00:50.559 --> 00:00:53.778 -They're slow and inaccurate. - -00:00:53.778 --> 00:00:56.800 -They also make the code hard to read and -write. - -00:00:56.800 --> 00:01:01.199 -Sometimes it's because the regular -expressions themselves are very hairy, - -00:01:01.199 --> 00:01:05.199 -and sometimes because they are just not -powerful enough. - -00:01:05.199 --> 00:01:08.625 -Some helper code is usually needed - -00:01:08.625 --> 00:01:11.200 -to parse more intricate language -features. - -00:01:11.200 --> 00:01:16.159 -That also illustrates the core problem -with regular expressions, - -00:01:16.159 --> 00:01:21.119 -in that they are not powerful enough to -parse programming languages. - -00:01:21.119 --> 00:01:25.040 -An example feature that regular -expressions cannot handle very well - -00:01:25.040 --> 00:01:28.320 -is string interpolation, which is a very -common feature - -00:01:28.320 --> 00:01:31.680 -in many modern programming languages. - -00:01:31.680 --> 00:01:34.079 -It would be much nicer if Emacs somehow - -00:01:34.079 --> 00:01:39.520 -had structural understanding of source -code, like IDEs do. - -00:01:39.520 --> 00:01:41.981 -There have been multiple efforts - -00:01:41.981 --> 00:01:45.280 -to bring this kind of programming -language understanding into Emacs. - -00:01:45.280 --> 00:01:47.119 -There are language-specific parsers - -00:01:47.119 --> 00:01:48.640 -written in Elisp - -00:01:48.640 --> 00:01:50.675 -that can be thought of - -00:01:50.675 --> 00:01:51.989 -as the next logical step -of the glue code - -00:01:51.989 --> 00:01:53.856 -on top of regular expressions, - -00:01:53.856 --> 00:01:57.356 -moving from partial local pattern -recognition - -00:01:57.356 --> 00:01:59.840 -into a full-fledged parser. - -00:01:59.840 --> 00:02:02.023 -The most prominent example of this -approach - -00:02:02.023 --> 00:02:06.479 -is probably the famous js2-mode. - -00:02:06.479 --> 00:02:10.080 -However, this approach has several issues. - -00:02:10.080 --> 00:02:12.606 -Parsing is computationally expensive, - -00:02:12.606 --> 00:02:16.800 -and Emacs Lisp is not good at that kind -of stuff. - -00:02:16.800 --> 00:02:19.156 -Furthermore, maintenance is very -troublesome. - -00:02:19.156 --> 00:02:22.160 -In order to work on these parsers, - -00:02:22.160 --> 00:02:24.239 -first, you have to know Elisp -well enough, - -00:02:24.239 --> 00:02:26.606 -and then you have to be comfortable with - -00:02:26.606 --> 00:02:29.739 -writing a recursive descending parser, - -00:02:29.739 --> 00:02:34.000 -while constantly keeping up with changes -to the language itself, - -00:02:34.000 --> 00:02:36.356 -which can be evolving very quickly, - -00:02:36.356 --> 00:02:39.360 -like Javascript, for example. - -00:02:39.360 --> 00:02:42.373 -Together, these constraints -significantly reduce - -00:02:42.373 --> 00:02:45.680 -the pool of potential maintainers. - -00:02:45.680 --> 00:02:47.760 -The biggest issue, though, in my opinion, - -00:02:47.760 --> 00:02:52.139 -is lack of the set of generic and -reusable APIs. - -00:02:52.139 --> 00:02:54.319 -This makes them very hard to use - -00:02:54.319 --> 00:02:55.920 -for minor modes that want to deal with - -00:02:55.920 --> 00:02:59.920 -cross-cutting concerns across multiple -languages. - -00:02:59.920 --> 00:03:01.760 -The other approach which has been - -00:03:01.760 --> 00:03:04.319 -gaining a lot of momentum -in recent years - -00:03:04.319 --> 00:03:06.560 -is externalizing language understanding - -00:03:06.560 --> 00:03:08.159 -to another process, - -00:03:08.159 --> 00:03:12.239 -also known as language server protocol. - -00:03:12.239 --> 00:03:16.560 -This second approach is actually a very -interesting one. - -00:03:16.560 --> 00:03:18.400 -By decoupling language understanding - -00:03:18.400 --> 00:03:21.280 -from the editing facility itself, - -00:03:21.280 --> 00:03:25.120 -the LSP servers can attract a lot more -contributors, - -00:03:25.120 --> 00:03:27.189 -which makes maintenance easier. - -00:03:27.189 --> 00:03:32.400 -However, they also have several issues -of their own. - -00:03:32.400 --> 00:03:34.089 -Being a separate process, - -00:03:34.089 --> 00:03:37.073 -they are usually more -resource-intensive, - -00:03:37.073 --> 00:03:39.920 -and depending on the language, - -00:03:39.920 --> 00:03:42.159 -the LSP server itself can bring with it - -00:03:42.159 --> 00:03:44.640 -a host of additional dependencies - -00:03:44.640 --> 00:03:50.640 -external to Emacs, which may be messy to -install and manage. - -00:03:50.640 --> 00:03:55.120 -Furthermore, JSON over RPC has pretty -high latency. - -00:03:55.120 --> 00:03:57.840 -For one-off tasks like jumping to source - -00:03:57.840 --> 00:04:00.879 -or on-demand completion, it's great. - -00:04:00.879 --> 00:04:03.040 -But for things like code highlighting, - -00:04:03.040 --> 00:04:06.000 -the latency is just too much. - -00:04:06.000 --> 00:04:08.319 -I was using Rust and I was following the - -00:04:08.319 --> 00:04:11.760 -community effort to improve its -IDE support, - -00:04:11.760 --> 00:04:15.760 -hoping to integrate some of that into -Emacs itself. - -00:04:15.760 --> 00:04:19.759 -Then I heard someone from the community -mention tree-sitter, - -00:04:19.759 --> 00:04:23.360 -and I decided to check it out. - -00:04:23.360 --> 00:04:28.720 -Basically, tree-sitter is an incremental -parsing library and a parser generator. - -00:04:28.720 --> 00:04:33.040 -It was introduced by the Atom editor in -2018. - -00:04:33.040 --> 00:04:35.923 -Besides Atom, it is also being -integrated - -00:04:35.923 --> 00:04:37.623 -into the NeoVim editor, - -00:04:37.623 --> 00:04:41.040 -and Github is using it to power - -00:04:41.040 --> 00:04:42.423 -their source code analysis - -00:04:42.423 --> 00:04:45.840 -and navigation features. - -00:04:45.840 --> 00:04:48.639 -It is written in C and can be compiled - -00:04:48.639 --> 00:04:50.623 -for all major platforms. - -00:04:50.623 --> 00:04:53.120 -It can even be compiled - -00:04:53.120 --> 00:04:55.323 -to web assembly to run on the web. - -00:04:55.323 --> 00:05:00.800 -That's how Github is using it -on their website. - -00:05:00.800 --> 00:05:05.840 -So why is tree-sitter an interesting -solution to this problem? - -00:05:05.840 --> 00:05:10.000 -There are multiple features that make it -an attractive option. - -00:05:10.000 --> 00:05:11.839 -It is designed to be fast. - -00:05:11.839 --> 00:05:13.680 -By being incremental, - -00:05:13.680 --> 00:05:15.680 -the initial parse of a typical big file - -00:05:15.680 --> 00:05:18.160 -can take tens of milliseconds, - -00:05:18.160 --> 00:05:20.240 -while subsequent incremental processes - -00:05:20.240 --> 00:05:22.560 -are sub-millisecond. - -00:05:22.560 --> 00:05:26.240 -It achieves this by using -structural sharing, - -00:05:26.240 --> 00:05:29.360 -meaning replacing only affected nodes - -00:05:29.360 --> 00:05:32.960 -in the old tree when it needs to. - -00:05:32.960 --> 00:05:37.120 -Also, unlike LSP, being in -the same process, - -00:05:37.120 --> 00:05:40.639 -it has much lower latency. - -00:05:40.639 --> 00:05:44.960 -Secondly, it provides a uniform -programming interface. - -00:05:44.960 --> 00:05:47.039 -The same data structures and functions - -00:05:47.039 --> 00:05:50.400 -work on parse trees of different -languages. - -00:05:50.400 --> 00:05:52.160 -Syntax nodes of different languages - -00:05:52.160 --> 00:05:54.160 -differ only by their types - -00:05:54.160 --> 00:05:55.723 -and their possible child nodes. - -00:05:55.723 --> 00:06:02.240 -This is a big advantage over -language-specific parsers. - -00:06:02.240 --> 00:06:06.880 -Thirdly, it's written in self-contained -embeddable C. - -00:06:06.880 --> 00:06:11.723 -As I mentioned previously, it can even -be compiled to webassembly. - -00:06:11.723 --> 00:06:16.106 -This makes integrating it into various -editors quite easy - -00:06:16.106 --> 00:06:22.880 -without having to install any external -dependencies. - -00:06:22.880 --> 00:06:25.503 -One thing that is not mentioned here - -00:06:25.503 --> 00:06:28.000 -is that being a parser generator, - -00:06:28.000 --> 00:06:31.039 -its grammars are declarative. - -00:06:31.039 --> 00:06:34.880 -Together with being editor-independent, - -00:06:34.880 --> 00:06:39.139 -this makes the pool of potential -contributors much larger. - -00:06:39.139 --> 00:06:45.520 -So I was convinced that tree-sitter is a -good fit for Emacs. - -00:06:45.520 --> 00:06:48.000 -Last year, I started writing the bindings - -00:06:48.000 --> 00:06:53.280 -using dynamic module support introduced -in Emacs 25. - -00:06:53.280 --> 00:06:58.479 -Dynamic module means there is -platform-specific native code involved, - -00:06:58.479 --> 00:07:00.560 -but since there are pre-compiled binaries - -00:07:00.560 --> 00:07:02.880 -for the three major platforms, - -00:07:02.880 --> 00:07:04.706 -it should work in most places. - -00:07:04.706 --> 00:07:09.440 -Currently, the core functionalities are -in a pretty good shape. - -00:07:09.440 --> 00:07:12.560 -Syntax highlighting is working nicely. - -00:07:12.560 --> 00:07:16.080 -The whole thing is split into three -packages. - -00:07:16.080 --> 00:07:20.319 -tree-sitter is the main package that -other packages should depend on. - -00:07:20.319 --> 00:07:22.800 -tree-sitter-langs is the language bundle - -00:07:22.800 --> 00:07:24.000 -that includes support - -00:07:24.000 --> 00:07:27.199 -for most common languages. - -00:07:27.199 --> 00:07:32.160 -And finally, the core APIs are in the -package tsc, - -00:07:32.160 --> 00:07:36.160 -which stands for tree-sitter-core. - -00:07:36.160 --> 00:07:38.800 -It is the implicit dependency of the - -00:07:38.800 --> 00:07:43.520 -tree-sitter package. - -00:07:43.520 --> 00:07:47.520 -The main package includes the minor mode -tree-sitter-mode. - -00:07:47.520 --> 00:07:52.560 -This provides the base for other major -or minor modes to build on. - -00:07:52.560 --> 00:07:54.839 -Using Emacs's change tracking hooks, - -00:07:54.839 --> 00:07:57.073 -it enables incremental parsing - -00:07:57.073 --> 00:08:00.800 -and provides a syntax tree that is -always up to date - -00:08:00.800 --> 00:08:04.080 -after any edits in a buffer. - -00:08:04.080 --> 00:08:06.223 -There is also a basic debug mode - -00:08:06.223 --> 00:08:10.080 -that shows the parse tree in -another buffer. - -00:08:10.080 --> 00:08:13.360 -Here is a quick demo. - -00:08:13.360 --> 00:08:15.673 -Here I'm in an empty Python buffer - -00:08:15.673 --> 00:08:17.520 -with tree-sitter enabled. - -00:08:17.520 --> 00:08:19.440 -I'm going to turn on the debug mode to - -00:08:19.440 --> 00:08:26.560 -see the parse tree. - -00:08:26.560 --> 00:08:28.106 -Since the buffer is empty, - -00:08:28.106 --> 00:08:30.423 -there is only one node in the -syntax tree: - -00:08:30.423 --> 00:08:33.279 -the top-level module node. - -00:08:33.279 --> 00:09:11.040 -Let's try typing some code. - -00:09:11.040 --> 00:09:14.640 -As you can see, as I type into the -Python buffer, - -00:09:14.640 --> 00:09:19.120 -the syntax tree updates in real time. - -00:09:19.120 --> 00:09:22.039 -The other minor mode included in the -main package - -00:09:22.039 --> 00:09:24.389 -is tree-sitter-hl-mode. - -00:09:24.389 --> 00:09:26.349 -It overrides font-lock mode - -00:09:26.349 --> 00:09:28.480 -and provides its own set of phases - -00:09:28.480 --> 00:09:30.139 -and customization options - -00:09:30.139 --> 00:09:32.800 -It is query-driven. - -00:09:32.800 --> 00:09:36.240 -That means instead of regular -expressions, - -00:09:36.240 --> 00:09:39.518 -it uses a Lisp-like query language - -00:09:39.518 --> 00:09:40.320 -to map syntax nodes - -00:09:40.320 --> 00:09:41.923 -to highlighting phrases. - -00:09:41.923 --> 00:09:45.760 -I'm going to open a python file with -small snippets - -00:09:45.760 --> 00:09:54.320 -that showcase syntax highlighting. - -00:09:54.320 --> 00:09:55.920 -So this is the default highlighting - -00:09:55.920 --> 00:10:00.880 -provided by python-mode. - -00:10:00.880 --> 00:10:04.640 -This is the highlighting enabled -by tree-sitter. - -00:10:04.640 --> 00:10:07.680 -as you can see string interpolation - -00:10:07.680 --> 00:10:11.680 -and decorators are highlighted correctly - -00:10:11.680 --> 00:10:17.440 -function calls are also highlighted - -00:10:17.440 --> 00:10:20.240 -you can also note that property - -00:10:20.240 --> 00:10:21.839 -assessors - -00:10:21.839 --> 00:10:24.640 -and property assignments are highlighted - -00:10:24.640 --> 00:10:27.440 -differently - -00:10:27.440 --> 00:10:29.360 -what I like the most about this is that - -00:10:29.360 --> 00:10:30.880 -new bindings are consistently - -00:10:30.880 --> 00:10:32.640 -highlighted - -00:10:32.640 --> 00:10:36.320 -this included local variable - -00:10:36.320 --> 00:10:39.760 -function parameters and property - -00:10:39.760 --> 00:10:45.760 -mutations - -00:10:45.760 --> 00:10:48.000 -before going through the three queries - -00:10:48.000 --> 00:10:49.279 -and the syntax highlighting - -00:10:49.279 --> 00:10:51.680 -customization options - -00:10:51.680 --> 00:10:53.760 -let's take a brief look at the core data - -00:10:53.760 --> 00:10:55.040 -structures and functions - -00:10:55.040 --> 00:10:58.079 -that tree sitter provides - -00:10:58.079 --> 00:10:59.839 -so parsing is done with the help of a - -00:10:59.839 --> 00:11:02.240 -generic parser object - -00:11:02.240 --> 00:11:04.160 -a single parser object can be used to - -00:11:04.160 --> 00:11:06.000 -pass different languages - -00:11:06.000 --> 00:11:08.320 -by sending different language objects to - -00:11:08.320 --> 00:11:09.279 -it - -00:11:09.279 --> 00:11:10.880 -the language objects themselves are - -00:11:10.880 --> 00:11:14.079 -loaded from shared libraries - -00:11:14.079 --> 00:11:16.079 -since three seater mode already handles - -00:11:16.079 --> 00:11:17.360 -the parsing part - -00:11:17.360 --> 00:11:19.440 -we will instead focus on the functions - -00:11:19.440 --> 00:11:20.800 -that inspect nodes - -00:11:20.800 --> 00:11:25.279 -and in the resulting path tree - -00:11:25.279 --> 00:11:27.200 -we can ask tree sitter what is the - -00:11:27.200 --> 00:11:44.240 -syntax node at point - -00:11:44.240 --> 00:11:47.200 -uh is it an opaque object so this is not - -00:11:47.200 --> 00:11:48.480 -very useful - -00:11:48.480 --> 00:12:03.760 -we can instead ask what is its type - -00:12:03.760 --> 00:12:06.560 -so his type is the symbol comparison - -00:12:06.560 --> 00:12:08.959 -operator - -00:12:08.959 --> 00:12:11.600 -trees there are two kinds of nodes - -00:12:11.600 --> 00:12:13.680 -anonymous nodes and named nodes - -00:12:13.680 --> 00:12:15.519 -anonymous nodes correspond to simple - -00:12:15.519 --> 00:12:17.040 -grammar elements - -00:12:17.040 --> 00:12:19.839 -like keywords operators punctuations and - -00:12:19.839 --> 00:12:21.279 -so on - -00:12:21.279 --> 00:12:24.160 -name nodes on the other hand grammar - -00:12:24.160 --> 00:12:25.920 -elements that are interesting enough for - -00:12:25.920 --> 00:12:26.639 -their own - -00:12:26.639 --> 00:12:30.320 -to have a name like an identifier an - -00:12:30.320 --> 00:12:31.839 -expression - -00:12:31.839 --> 00:12:35.440 -or a function definition - -00:12:35.440 --> 00:12:37.760 -name node types are symbols while - -00:12:37.760 --> 00:12:42.639 -anonymous node types are strings - -00:12:42.639 --> 00:12:46.320 -for example if we are on this - -00:12:46.320 --> 00:12:49.760 -comparison operator - -00:12:49.760 --> 00:12:55.920 -the node type should be a string - -00:12:55.920 --> 00:12:57.920 -we can also get other information about - -00:12:57.920 --> 00:12:58.959 -the node - -00:12:58.959 --> 00:13:09.680 -for example what is this text - -00:13:09.680 --> 00:13:20.800 -or where it is in the buffer - -00:13:20.800 --> 00:13:43.199 -or what is its parent - -00:13:43.199 --> 00:13:46.160 -there are many other apis to query or - -00:13:46.160 --> 00:13:46.839 -not - -00:13:46.839 --> 00:13:52.639 -properties - -00:13:52.639 --> 00:13:54.399 -tree sitter allows searching for - -00:13:54.399 --> 00:13:58.240 -structural patterns within a parse tree - -00:13:58.240 --> 00:14:01.440 -it does so through a list like language - -00:14:01.440 --> 00:14:03.519 -this language supports by the matching - -00:14:03.519 --> 00:14:04.639 -by node types - -00:14:04.639 --> 00:14:07.760 -field names and predicates - -00:14:07.760 --> 00:14:10.079 -it also allows capturing nodes for - -00:14:10.079 --> 00:14:12.639 -further processing - -00:14:12.639 --> 00:14:37.680 -let's try to see some examples - -00:14:37.680 --> 00:14:41.040 -so in this very simple query we just - -00:14:41.040 --> 00:14:43.839 -try to highlight all the identifiers in - -00:14:43.839 --> 00:14:49.040 -the buffer - -00:14:49.040 --> 00:14:51.920 -this s side tells trisito to capture a - -00:14:51.920 --> 00:14:53.120 -node - -00:14:53.120 --> 00:14:55.839 -in the context of the query builder it's - -00:14:55.839 --> 00:14:57.360 -not very important - -00:14:57.360 --> 00:15:00.320 -but in normal highlighting query this - -00:15:00.320 --> 00:15:01.760 -will determine - -00:15:01.760 --> 00:15:06.639 -the face used to highlight the note - -00:15:06.639 --> 00:15:08.800 -suppose we want to capture all the - -00:15:08.800 --> 00:15:10.320 -function names - -00:15:10.320 --> 00:15:13.519 -instead of just any identifier - -00:15:13.519 --> 00:15:29.440 -you can improve the query like this - -00:15:29.440 --> 00:15:31.600 -uh this will highlight the whole - -00:15:31.600 --> 00:15:32.639 -definition - -00:15:32.639 --> 00:15:35.519 -but we only want to capture the function - -00:15:35.519 --> 00:15:36.399 -name - -00:15:36.399 --> 00:15:39.600 -which means the identifier - -00:15:39.600 --> 00:15:42.800 -here so we - -00:15:42.800 --> 00:15:46.320 -move the capture to after the identifier - -00:15:46.320 --> 00:15:49.600 -node - -00:15:49.600 --> 00:15:51.759 -if we want to capture the class names as - -00:15:51.759 --> 00:15:52.959 -well - -00:15:52.959 --> 00:16:10.079 -we just add another pattern - -00:16:10.079 --> 00:16:20.320 -let's look at a more practical example - -00:16:20.320 --> 00:16:22.959 -here we can see that single quotes - -00:16:22.959 --> 00:16:23.759 -strings and - -00:16:23.759 --> 00:16:25.600 -double quotes screens are highlighted - -00:16:25.600 --> 00:16:27.279 -the same - -00:16:27.279 --> 00:16:30.399 -but in some places - -00:16:30.399 --> 00:16:33.440 -because of some coding conventions - -00:16:33.440 --> 00:16:35.440 -it may be desirable to highlight them - -00:16:35.440 --> 00:16:37.279 -differently for example if - -00:16:37.279 --> 00:16:39.680 -the string is single quoted we may want - -00:16:39.680 --> 00:16:40.880 -to highlight it - -00:16:40.880 --> 00:16:44.399 -as a constant - -00:16:44.399 --> 00:16:46.160 -let's try to see whether we can - -00:16:46.160 --> 00:16:47.600 -distinguish these - -00:16:47.600 --> 00:16:56.240 -two cases - -00:16:56.240 --> 00:17:00.639 -so here we get all the strings - -00:17:00.639 --> 00:17:04.079 -if we want to see if it's single quotes - -00:17:04.079 --> 00:17:04.559 -or - -00:17:04.559 --> 00:17:08.799 -double quote strings - -00:17:08.799 --> 00:17:11.039 -we can try looking at the first - -00:17:11.039 --> 00:17:12.480 -character - -00:17:12.480 --> 00:17:15.280 -of the string I mean the first character - -00:17:15.280 --> 00:17:16.720 -of the note - -00:17:16.720 --> 00:17:19.360 -to check whether it's a single quote or - -00:17:19.360 --> 00:17:33.600 -a double quote - -00:17:33.600 --> 00:17:36.080 -yeah so for that we use the three - -00:17:36.080 --> 00:17:36.799 -setters - -00:17:36.799 --> 00:17:40.160 -support for predicate in this case - -00:17:40.160 --> 00:17:43.360 -we use a match predicate - -00:17:43.360 --> 00:17:46.080 -to check whether the string where the - -00:17:46.080 --> 00:17:46.799 -note - -00:17:46.799 --> 00:17:50.320 -starts with a single quote and with this - -00:17:50.320 --> 00:17:51.280 -pattern - -00:17:51.280 --> 00:17:58.840 -we only capture the single quotes - -00:17:58.840 --> 00:18:00.400 -strings - -00:18:00.400 --> 00:18:03.760 -let's try to give it a different face - -00:18:03.760 --> 00:18:13.039 -so we copy the pattern - -00:18:13.039 --> 00:18:18.640 -and we add this pattern - -00:18:18.640 --> 00:18:25.120 -pop item only - -00:18:25.120 --> 00:18:28.400 -but we also want to give the - -00:18:28.400 --> 00:18:31.440 -capture a different name - -00:18:31.440 --> 00:18:40.840 -let's say we want to highlight it as a - -00:18:40.840 --> 00:18:46.559 -keyword - -00:18:46.559 --> 00:19:06.320 -and now if we refresh the buffer - -00:19:06.320 --> 00:19:08.799 -we see that single quote strings are - -00:19:08.799 --> 00:19:10.320 -highlighted as - -00:19:10.320 --> 00:19:14.400 -keywords - -00:19:14.400 --> 00:19:16.400 -the highlighting patterns can also be - -00:19:16.400 --> 00:19:19.200 -set for a single project - -00:19:19.200 --> 00:19:23.440 -using directory local variable - -00:19:23.440 --> 00:19:26.880 -for example let's take a look at - -00:19:26.880 --> 00:19:35.760 -ems source code - -00:19:35.760 --> 00:19:40.400 -so in image c source there are a lot of - -00:19:40.400 --> 00:19:43.760 -uses of these different macros - -00:19:43.760 --> 00:19:47.679 -to define functions - -00:19:47.679 --> 00:19:51.200 -and you can see - -00:19:51.200 --> 00:19:53.520 -this is actually the function name but - -00:19:53.520 --> 00:19:55.760 -it's highlighted as the - -00:19:55.760 --> 00:19:59.120 -string so what we want - -00:19:59.120 --> 00:20:03.679 -is to somehow recognize this pattern - -00:20:03.679 --> 00:20:07.600 -and highlight it - -00:20:07.600 --> 00:20:11.280 -as highlight this part - -00:20:11.280 --> 00:20:14.559 -with the function phase instead - -00:20:14.559 --> 00:20:17.679 -in order to do that - -00:20:17.679 --> 00:20:20.240 -we put a pattern in this project - -00:20:20.240 --> 00:20:21.760 -directory local - -00:20:21.760 --> 00:20:31.760 -settings file - -00:20:31.760 --> 00:20:34.799 -so we can put this button in the c - -00:20:34.799 --> 00:20:40.159 -mode section - -00:20:40.159 --> 00:20:48.000 -and now if we enable tree sitter - -00:20:48.000 --> 00:20:50.480 -you can see that this is the highlighted - -00:20:50.480 --> 00:20:53.200 -uh - -00:20:53.200 --> 00:20:55.520 -as a normal function definition so this - -00:20:55.520 --> 00:20:56.559 -is the function - -00:20:56.559 --> 00:21:01.200 -face like we wanted - -00:21:01.200 --> 00:21:03.760 -the pattern for this is actually pretty - -00:21:03.760 --> 00:21:07.200 -simple - -00:21:07.200 --> 00:21:10.720 -it's only - -00:21:10.720 --> 00:21:14.720 -only this part so - -00:21:14.720 --> 00:21:17.440 -if it's a function call where the name - -00:21:17.440 --> 00:21:19.679 -of the function is different - -00:21:19.679 --> 00:21:21.600 -then we highlight the different as a - -00:21:21.600 --> 00:21:24.240 -keyword - -00:21:24.240 --> 00:21:27.360 -and then the first string element we - -00:21:27.360 --> 00:21:28.159 -highlighted - -00:21:28.159 --> 00:21:35.360 -as a function name - -00:21:35.360 --> 00:21:37.679 -since the language objects are actually - -00:21:37.679 --> 00:21:39.280 -native code - -00:21:39.280 --> 00:21:40.799 -they have to be compiled for each - -00:21:40.799 --> 00:21:43.440 -platform that we want to support - -00:21:43.440 --> 00:21:45.600 -this will become a big obstacle for - -00:21:45.600 --> 00:21:48.159 -3-seater adoption - -00:21:48.159 --> 00:21:50.240 -therefore I've created a language window - -00:21:50.240 --> 00:21:52.960 -package 3-seater length - -00:21:52.960 --> 00:21:54.960 -that takes care of pre-compiling the - -00:21:54.960 --> 00:21:56.320 -grammars the - -00:21:56.320 --> 00:21:59.679 -most common grammars for all three major - -00:21:59.679 --> 00:22:01.600 -platforms - -00:22:01.600 --> 00:22:04.080 -it also takes care of distributing these - -00:22:04.080 --> 00:22:05.360 -binaries - -00:22:05.360 --> 00:22:08.080 -and provides some highlighting queries - -00:22:08.080 --> 00:22:11.440 -for some of the languages - -00:22:11.440 --> 00:22:13.760 -it should be noted that this package - -00:22:13.760 --> 00:22:15.919 -should be treated as a temporary - -00:22:15.919 --> 00:22:19.919 -distribution mechanism only - -00:22:19.919 --> 00:22:22.240 -to help with bootstrapping three-seaters - -00:22:22.240 --> 00:22:24.720 -adoption - -00:22:24.720 --> 00:22:27.760 -the plan is that eventually these files - -00:22:27.760 --> 00:22:29.760 -should be provided by the language major - -00:22:29.760 --> 00:22:32.480 -modes themselves - -00:22:32.480 --> 00:22:35.120 -but in order to do that we need better - -00:22:35.120 --> 00:22:36.320 -tooling - -00:22:36.320 --> 00:22:40.240 -so we're not there yet - -00:22:40.240 --> 00:22:42.559 -since the call already works reasonably - -00:22:42.559 --> 00:22:43.280 -well - -00:22:43.280 --> 00:22:44.640 -there are several areas that would - -00:22:44.640 --> 00:22:46.320 -benefit from the community's - -00:22:46.320 --> 00:22:49.120 -contribution - -00:22:49.120 --> 00:22:51.520 -so three seaters upstream language - -00:22:51.520 --> 00:22:52.640 -prepositories - -00:22:52.640 --> 00:22:54.400 -already contain highlighting queries on - -00:22:54.400 --> 00:22:55.679 -their own - -00:22:55.679 --> 00:22:58.480 -however they are pretty basic and they - -00:22:58.480 --> 00:23:00.480 -may not fit well with existing emax - -00:23:00.480 --> 00:23:02.559 -conventions - -00:23:02.559 --> 00:23:04.320 -therefore the language bundle has its - -00:23:04.320 --> 00:23:07.120 -own set of highlighting queries - -00:23:07.120 --> 00:23:10.559 -this requires maintenance until language - -00:23:10.559 --> 00:23:11.600 -measurements adopt - -00:23:11.600 --> 00:23:13.760 -three sitter and maintain the queries on - -00:23:13.760 --> 00:23:16.640 -their own - -00:23:16.640 --> 00:23:18.480 -the queries are actually quite easy to - -00:23:18.480 --> 00:23:22.000 -write as you've already seen - -00:23:22.000 --> 00:23:24.240 -you just need to be familiar with the - -00:23:24.240 --> 00:23:25.360 -language - -00:23:25.360 --> 00:23:30.000 -familiar enough to come up with sensible - -00:23:30.000 --> 00:23:35.200 -highlighting patterns - -00:23:35.200 --> 00:23:37.600 -and if you are a maintainer of a - -00:23:37.600 --> 00:23:39.679 -language major mode - -00:23:39.679 --> 00:23:42.320 -you may want to consider integrating - -00:23:42.320 --> 00:23:43.360 -tree sitter into - -00:23:43.360 --> 00:23:46.960 -your mode initially maybe as an - -00:23:46.960 --> 00:23:50.080 -optional feature the integration is - -00:23:50.080 --> 00:23:53.279 -actually pretty straightforward - -00:23:53.279 --> 00:23:56.640 -especially for syntax highlighting - -00:23:56.640 --> 00:24:01.520 -or alternatively - -00:24:01.520 --> 00:24:03.760 -you can also try writing a new major - -00:24:03.760 --> 00:24:04.640 -mode - -00:24:04.640 --> 00:24:08.000 -from scratch that relies on tree sitter - -00:24:08.000 --> 00:24:12.559 -from the very beginning - -00:24:12.559 --> 00:24:16.320 -the code for such a major mode is - -00:24:16.320 --> 00:24:19.679 -quite simple for example - -00:24:19.679 --> 00:24:23.200 -this is the proposed - -00:24:23.200 --> 00:24:26.240 -what mode for web assembly - -00:24:26.240 --> 00:24:31.039 -the code is just - -00:24:31.039 --> 00:24:34.559 -like one page of code not - -00:24:34.559 --> 00:24:39.520 -not a lot - -00:24:39.520 --> 00:24:42.720 -you can also try writing new minor modes - -00:24:42.720 --> 00:24:46.559 -or writing integration packages - -00:24:46.559 --> 00:24:50.080 -for example a lot of package a lot of - -00:24:50.080 --> 00:24:50.880 -packages - -00:24:50.880 --> 00:24:54.559 -may benefit from tree sitter integration - -00:24:54.559 --> 00:24:58.840 -but no one has written the integration - -00:24:58.840 --> 00:25:02.960 -yet - -00:25:02.960 --> 00:25:05.039 -if you are interested in 3-seater you - -00:25:05.039 --> 00:25:06.720 -can use these links to - -00:25:06.720 --> 00:25:10.320 -learn more about it I think that's it - -00:25:10.320 --> 00:25:11.440 -for me today - -00:25:11.440 --> 00:25:18.159 -I'm happy to answer any questions diff --git a/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen.vtt b/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen.vtt new file mode 100644 index 00000000..276f3150 --- /dev/null +++ b/2020/subtitles/emacsconf-2020--23-incremental-parsing-with-emacs-tree-sitter--tuan-anh-nguyen.vtt @@ -0,0 +1,1235 @@ +WEBVTT + +00:00:01.520 --> 00:00:04.400 +Hello, everyone! My name is Tuấn-Anh. + +00:00:04.400 --> 00:00:07.200 +I've been using Emacs for about 10 years. + +00:00:07.200 --> 00:00:09.280 +Today, I'm going to talk about tree-sitter, + +00:00:09.280 --> 00:00:11.351 +a new Emacs package that allows Emacs + +00:00:11.351 --> 00:00:17.840 +to parse multiple programming languages +in real-time. + +00:00:17.840 --> 00:00:21.840 +So what is the problem statement? + +00:00:21.840 --> 00:00:24.131 +In order to support programming +functionalities + +00:00:24.131 --> 00:00:25.760 +for a particular language, + +00:00:25.760 --> 00:00:27.680 +a text editor needs to have some degree + +00:00:27.680 --> 00:00:29.679 +of language understanding. + +00:00:29.679 --> 00:00:31.840 +Traditionally, text editors have relied + +00:00:31.840 --> 00:00:34.960 +very heavily on regular expressions for +this. + +00:00:34.960 --> 00:00:37.013 +Emacs is no different. + +00:00:37.013 --> 00:00:40.170 +Most language major modes use regular +expressions + +00:00:40.170 --> 00:00:42.960 +for syntax-highlighting, code navigation, + +00:00:42.960 --> 00:00:46.618 +folding, indexing, and so on. + +00:00:46.618 --> 00:00:50.559 +Regular expressions are problematic for +a couple of reasons. + +00:00:50.559 --> 00:00:53.778 +They're slow and inaccurate. + +00:00:53.778 --> 00:00:56.800 +They also make the code hard to read and +write. + +00:00:56.800 --> 00:01:01.199 +Sometimes it's because the regular +expressions themselves are very hairy, + +00:01:01.199 --> 00:01:05.199 +and sometimes because they are just not +powerful enough. + +00:01:05.199 --> 00:01:08.625 +Some helper code is usually needed + +00:01:08.625 --> 00:01:11.200 +to parse more intricate language +features. + +00:01:11.200 --> 00:01:16.159 +That also illustrates the core problem +with regular expressions, + +00:01:16.159 --> 00:01:21.119 +in that they are not powerful enough to +parse programming languages. + +00:01:21.119 --> 00:01:25.040 +An example feature that regular +expressions cannot handle very well + +00:01:25.040 --> 00:01:28.320 +is string interpolation, which is a very +common feature + +00:01:28.320 --> 00:01:31.680 +in many modern programming languages. + +00:01:31.680 --> 00:01:34.079 +It would be much nicer if Emacs somehow + +00:01:34.079 --> 00:01:39.520 +had structural understanding of source +code, like IDEs do. + +00:01:39.520 --> 00:01:41.981 +There have been multiple efforts + +00:01:41.981 --> 00:01:45.280 +to bring this kind of programming +language understanding into Emacs. + +00:01:45.280 --> 00:01:47.119 +There are language-specific parsers + +00:01:47.119 --> 00:01:48.640 +written in Elisp + +00:01:48.640 --> 00:01:50.675 +that can be thought of + +00:01:50.675 --> 00:01:51.989 +as the next logical step +of the glue code + +00:01:51.989 --> 00:01:53.856 +on top of regular expressions, + +00:01:53.856 --> 00:01:57.356 +moving from partial local pattern +recognition + +00:01:57.356 --> 00:01:59.840 +into a full-fledged parser. + +00:01:59.840 --> 00:02:02.023 +The most prominent example of this +approach + +00:02:02.023 --> 00:02:06.479 +is probably the famous js2-mode. + +00:02:06.479 --> 00:02:10.080 +However, this approach has several issues. + +00:02:10.080 --> 00:02:12.606 +Parsing is computationally expensive, + +00:02:12.606 --> 00:02:16.800 +and Emacs Lisp is not good at that kind +of stuff. + +00:02:16.800 --> 00:02:19.156 +Furthermore, maintenance is very +troublesome. + +00:02:19.156 --> 00:02:22.160 +In order to work on these parsers, + +00:02:22.160 --> 00:02:24.239 +first, you have to know Elisp +well enough, + +00:02:24.239 --> 00:02:26.606 +and then you have to be comfortable with + +00:02:26.606 --> 00:02:29.739 +writing a recursive descending parser, + +00:02:29.739 --> 00:02:34.000 +while constantly keeping up with changes +to the language itself, + +00:02:34.000 --> 00:02:36.356 +which can be evolving very quickly, + +00:02:36.356 --> 00:02:39.360 +like Javascript, for example. + +00:02:39.360 --> 00:02:42.373 +Together, these constraints +significantly reduce + +00:02:42.373 --> 00:02:45.680 +the pool of potential maintainers. + +00:02:45.680 --> 00:02:47.760 +The biggest issue, though, in my opinion, + +00:02:47.760 --> 00:02:52.139 +is lack of the set of generic and +reusable APIs. + +00:02:52.139 --> 00:02:54.319 +This makes them very hard to use + +00:02:54.319 --> 00:02:55.920 +for minor modes that want to deal with + +00:02:55.920 --> 00:02:59.920 +cross-cutting concerns across multiple +languages. + +00:02:59.920 --> 00:03:01.760 +The other approach which has been + +00:03:01.760 --> 00:03:04.319 +gaining a lot of momentum +in recent years + +00:03:04.319 --> 00:03:06.560 +is externalizing language understanding + +00:03:06.560 --> 00:03:08.159 +to another process, + +00:03:08.159 --> 00:03:12.239 +also known as language server protocol. + +00:03:12.239 --> 00:03:16.560 +This second approach is actually a very +interesting one. + +00:03:16.560 --> 00:03:18.400 +By decoupling language understanding + +00:03:18.400 --> 00:03:21.280 +from the editing facility itself, + +00:03:21.280 --> 00:03:25.120 +the LSP servers can attract a lot more +contributors, + +00:03:25.120 --> 00:03:27.189 +which makes maintenance easier. + +00:03:27.189 --> 00:03:32.400 +However, they also have several issues +of their own. + +00:03:32.400 --> 00:03:34.089 +Being a separate process, + +00:03:34.089 --> 00:03:37.073 +they are usually more +resource-intensive, + +00:03:37.073 --> 00:03:39.920 +and depending on the language, + +00:03:39.920 --> 00:03:42.159 +the LSP server itself can bring with it + +00:03:42.159 --> 00:03:44.640 +a host of additional dependencies + +00:03:44.640 --> 00:03:50.640 +external to Emacs, which may be messy to +install and manage. + +00:03:50.640 --> 00:03:55.120 +Furthermore, JSON over RPC has pretty +high latency. + +00:03:55.120 --> 00:03:57.840 +For one-off tasks like jumping to source + +00:03:57.840 --> 00:04:00.879 +or on-demand completion, it's great. + +00:04:00.879 --> 00:04:03.040 +But for things like code highlighting, + +00:04:03.040 --> 00:04:06.000 +the latency is just too much. + +00:04:06.000 --> 00:04:08.319 +I was using Rust and I was following the + +00:04:08.319 --> 00:04:11.760 +community effort to improve its +IDE support, + +00:04:11.760 --> 00:04:15.760 +hoping to integrate some of that into +Emacs itself. + +00:04:15.760 --> 00:04:19.759 +Then I heard someone from the community +mention tree-sitter, + +00:04:19.759 --> 00:04:23.360 +and I decided to check it out. + +00:04:23.360 --> 00:04:28.720 +Basically, tree-sitter is an incremental +parsing library and a parser generator. + +00:04:28.720 --> 00:04:33.040 +It was introduced by the Atom editor in +2018. + +00:04:33.040 --> 00:04:35.923 +Besides Atom, it is also being +integrated + +00:04:35.923 --> 00:04:37.623 +into the NeoVim editor, + +00:04:37.623 --> 00:04:41.040 +and Github is using it to power + +00:04:41.040 --> 00:04:42.423 +their source code analysis + +00:04:42.423 --> 00:04:45.840 +and navigation features. + +00:04:45.840 --> 00:04:48.639 +It is written in C and can be compiled + +00:04:48.639 --> 00:04:50.623 +for all major platforms. + +00:04:50.623 --> 00:04:53.120 +It can even be compiled + +00:04:53.120 --> 00:04:55.323 +to web assembly to run on the web. + +00:04:55.323 --> 00:05:00.800 +That's how Github is using it +on their website. + +00:05:00.800 --> 00:05:05.840 +So why is tree-sitter an interesting +solution to this problem? + +00:05:05.840 --> 00:05:10.000 +There are multiple features that make it +an attractive option. + +00:05:10.000 --> 00:05:11.839 +It is designed to be fast. + +00:05:11.839 --> 00:05:13.680 +By being incremental, + +00:05:13.680 --> 00:05:15.680 +the initial parse of a typical big file + +00:05:15.680 --> 00:05:18.160 +can take tens of milliseconds, + +00:05:18.160 --> 00:05:20.240 +while subsequent incremental processes + +00:05:20.240 --> 00:05:22.560 +are sub-millisecond. + +00:05:22.560 --> 00:05:26.240 +It achieves this by using +structural sharing, + +00:05:26.240 --> 00:05:29.360 +meaning replacing only affected nodes + +00:05:29.360 --> 00:05:32.960 +in the old tree when it needs to. + +00:05:32.960 --> 00:05:37.120 +Also, unlike LSP, being in +the same process, + +00:05:37.120 --> 00:05:40.639 +it has much lower latency. + +00:05:40.639 --> 00:05:44.960 +Secondly, it provides a uniform +programming interface. + +00:05:44.960 --> 00:05:47.039 +The same data structures and functions + +00:05:47.039 --> 00:05:50.400 +work on parse trees of different +languages. + +00:05:50.400 --> 00:05:52.160 +Syntax nodes of different languages + +00:05:52.160 --> 00:05:54.160 +differ only by their types + +00:05:54.160 --> 00:05:55.723 +and their possible child nodes. + +00:05:55.723 --> 00:06:02.240 +This is a big advantage over +language-specific parsers. + +00:06:02.240 --> 00:06:06.880 +Thirdly, it's written in self-contained +embeddable C. + +00:06:06.880 --> 00:06:11.723 +As I mentioned previously, it can even +be compiled to webassembly. + +00:06:11.723 --> 00:06:16.106 +This makes integrating it into various +editors quite easy + +00:06:16.106 --> 00:06:22.880 +without having to install any external +dependencies. + +00:06:22.880 --> 00:06:25.503 +One thing that is not mentioned here + +00:06:25.503 --> 00:06:28.000 +is that being a parser generator, + +00:06:28.000 --> 00:06:31.039 +its grammars are declarative. + +00:06:31.039 --> 00:06:34.880 +Together with being editor-independent, + +00:06:34.880 --> 00:06:39.139 +this makes the pool of potential +contributors much larger. + +00:06:39.139 --> 00:06:45.520 +So I was convinced that tree-sitter is a +good fit for Emacs. + +00:06:45.520 --> 00:06:48.000 +Last year, I started writing the bindings + +00:06:48.000 --> 00:06:53.280 +using dynamic module support introduced +in Emacs 25. + +00:06:53.280 --> 00:06:58.479 +Dynamic module means there is +platform-specific native code involved, + +00:06:58.479 --> 00:07:00.560 +but since there are pre-compiled binaries + +00:07:00.560 --> 00:07:02.880 +for the three major platforms, + +00:07:02.880 --> 00:07:04.706 +it should work in most places. + +00:07:04.706 --> 00:07:09.440 +Currently, the core functionalities are +in a pretty good shape. + +00:07:09.440 --> 00:07:12.560 +Syntax highlighting is working nicely. + +00:07:12.560 --> 00:07:16.080 +The whole thing is split into three +packages. + +00:07:16.080 --> 00:07:20.319 +tree-sitter is the main package that +other packages should depend on. + +00:07:20.319 --> 00:07:22.800 +tree-sitter-langs is the language bundle + +00:07:22.800 --> 00:07:24.000 +that includes support + +00:07:24.000 --> 00:07:27.199 +for most common languages. + +00:07:27.199 --> 00:07:32.160 +And finally, the core APIs are in the +package tsc, + +00:07:32.160 --> 00:07:36.160 +which stands for tree-sitter-core. + +00:07:36.160 --> 00:07:38.800 +It is the implicit dependency of the + +00:07:38.800 --> 00:07:43.520 +tree-sitter package. + +00:07:43.520 --> 00:07:47.520 +The main package includes the minor mode +tree-sitter-mode. + +00:07:47.520 --> 00:07:52.560 +This provides the base for other major +or minor modes to build on. + +00:07:52.560 --> 00:07:54.839 +Using Emacs's change tracking hooks, + +00:07:54.839 --> 00:07:57.073 +it enables incremental parsing + +00:07:57.073 --> 00:08:00.800 +and provides a syntax tree that is +always up to date + +00:08:00.800 --> 00:08:04.080 +after any edits in a buffer. + +00:08:04.080 --> 00:08:06.223 +There is also a basic debug mode + +00:08:06.223 --> 00:08:10.080 +that shows the parse tree in +another buffer. + +00:08:10.080 --> 00:08:13.360 +Here is a quick demo. + +00:08:13.360 --> 00:08:15.673 +Here I'm in an empty Python buffer + +00:08:15.673 --> 00:08:17.520 +with tree-sitter enabled. + +00:08:17.520 --> 00:08:19.440 +I'm going to turn on the debug mode to + +00:08:19.440 --> 00:08:26.560 +see the parse tree. + +00:08:26.560 --> 00:08:28.106 +Since the buffer is empty, + +00:08:28.106 --> 00:08:30.423 +there is only one node in the +syntax tree: + +00:08:30.423 --> 00:08:33.279 +the top-level module node. + +00:08:33.279 --> 00:09:11.040 +Let's try typing some code. + +00:09:11.040 --> 00:09:14.640 +As you can see, as I type into the +Python buffer, + +00:09:14.640 --> 00:09:19.120 +the syntax tree updates in real time. + +00:09:19.120 --> 00:09:22.039 +The other minor mode included in the +main package + +00:09:22.039 --> 00:09:24.389 +is tree-sitter-hl-mode. + +00:09:24.389 --> 00:09:26.349 +It overrides font-lock mode + +00:09:26.349 --> 00:09:28.480 +and provides its own set of phases + +00:09:28.480 --> 00:09:30.139 +and customization options + +00:09:30.139 --> 00:09:32.800 +It is query-driven. + +00:09:32.800 --> 00:09:36.240 +That means instead of regular +expressions, + +00:09:36.240 --> 00:09:39.518 +it uses a Lisp-like query language + +00:09:39.518 --> 00:09:40.320 +to map syntax nodes + +00:09:40.320 --> 00:09:41.923 +to highlighting phrases. + +00:09:41.923 --> 00:09:45.760 +I'm going to open a python file with +small snippets + +00:09:45.760 --> 00:09:54.320 +that showcase syntax highlighting. + +00:09:54.320 --> 00:09:55.920 +So this is the default highlighting + +00:09:55.920 --> 00:10:00.880 +provided by python-mode. + +00:10:00.880 --> 00:10:04.640 +This is the highlighting enabled +by tree-sitter. + +00:10:04.640 --> 00:10:07.680 +As you can see, string interpolation + +00:10:07.680 --> 00:10:11.680 +and decorators are highlighted correctly. + +00:10:11.680 --> 00:10:17.440 +Function calls are also highlighted. + +00:10:17.440 --> 00:10:21.839 +You can also note that +property accessors + +00:10:21.839 --> 00:10:27.440 +and property assignments are highlighted +differently. + +00:10:27.440 --> 00:10:29.360 +What I like the most about this is that + +00:10:29.360 --> 00:10:32.640 +new bindings are consistently +highlighted. + +00:10:32.640 --> 00:10:36.320 +This included local variables, + +00:10:36.320 --> 00:10:45.760 +function parameters, and property +mutations. + +00:10:45.760 --> 00:10:48.000 +Before going through the tree queries + +00:10:48.000 --> 00:10:49.279 +and the syntax highlighting + +00:10:49.279 --> 00:10:51.680 +customization options, + +00:10:51.680 --> 00:10:53.339 +let's take a brief look at + +00:10:53.339 --> 00:10:55.040 +the core data structures and functions + +00:10:55.040 --> 00:10:58.079 +that tree-sitter provides. + +00:10:58.079 --> 00:11:00.743 +So parsing is done with the help of + +00:11:00.743 --> 00:11:02.240 +a generic parser object. + +00:11:02.240 --> 00:11:04.160 +A single parser object can be used to + +00:11:04.160 --> 00:11:06.000 +parse different languages + +00:11:06.000 --> 00:11:09.279 +by sending different language objects to +it. + +00:11:09.279 --> 00:11:10.880 +The language objects themselves are + +00:11:10.880 --> 00:11:14.079 +loaded from shared libraries. + +00:11:14.079 --> 00:11:16.079 +Since tree-sitter-mmode already handles + +00:11:16.079 --> 00:11:17.360 +the parsing part, + +00:11:17.360 --> 00:11:19.440 +we will instead focus on the functions + +00:11:19.440 --> 00:11:20.800 +that inspect nodes, + +00:11:20.800 --> 00:11:25.279 +and in the resulting path tree, + +00:11:25.279 --> 00:11:27.030 +we can ask tree-sitter what is + +00:11:27.030 --> 00:11:44.240 +the syntax node at point. + +00:11:44.240 --> 00:11:48.480 +This is an opaque object, so this is not +very useful. + +00:11:48.480 --> 00:12:03.760 +We can instead ask what is its type. + +00:12:03.760 --> 00:12:08.959 +So its type is the symbol comparison +operator. + +00:12:08.959 --> 00:12:11.600 +In tree-sitter, there are two kinds of nodes, + +00:12:11.600 --> 00:12:13.680 +anonymous nodes and named nodes. + +00:12:13.680 --> 00:12:17.040 +Anonymous nodes correspond to simple +grammar elements + +00:12:17.040 --> 00:12:21.279 +like keywords, operators, punctuations, +and so on. + +00:12:21.279 --> 00:12:24.656 +Name nodes, on the other hand, are +grammar elements + +00:12:24.656 --> 00:12:26.639 +that are interesting enough +on their own + +00:12:26.639 --> 00:12:30.029 +to have a name, like an identifier, + +00:12:30.029 --> 00:12:35.440 +an expression, or a function definition. + +00:12:35.440 --> 00:12:37.323 +Name node types are symbols, + +00:12:37.323 --> 00:12:42.639 +while anonymous node types are strings. + +00:12:42.639 --> 00:12:49.760 +For example, if we are on this +comparison operator, + +00:12:49.760 --> 00:12:55.920 +the node type should be a string. + +00:12:55.920 --> 00:12:58.959 +We can also get other information about +the node. + +00:12:58.959 --> 00:13:09.680 +For example: what is this text, + +00:13:09.680 --> 00:13:20.800 +or where it is in the buffer, + +00:13:20.800 --> 00:13:43.199 +or what is its parent. + +00:13:43.199 --> 00:13:46.106 +There are many other APIs to query + +00:13:46.106 --> 00:13:52.639 +our node's properties. + +00:13:52.639 --> 00:13:54.234 +tree-sitter allows searching + +00:13:54.234 --> 00:13:58.240 +for structural patterns +within a parse tree. + +00:13:58.240 --> 00:14:01.440 +It does so through a Lisp-like language. + +00:14:01.440 --> 00:14:04.639 +This language supports matching +by node types, + +00:14:04.639 --> 00:14:07.760 +field names, and predicates. + +00:14:07.760 --> 00:14:12.639 +It also allows capturing nodes for +further processing. + +00:14:12.639 --> 00:14:37.680 +Let's try to see some examples. + +00:14:37.680 --> 00:14:40.206 +So in this very simple query, + +00:14:40.206 --> 00:14:49.040 +we just try to highlight all the +identifiers in the buffer. + +00:14:49.040 --> 00:14:53.120 +This s side tells tree-sitter +to capture a node. + +00:14:53.120 --> 00:14:55.507 +In the context of the query builder, + +00:14:55.507 --> 00:14:57.360 +it's not very important, + +00:14:57.360 --> 00:14:59.706 +but in normal highlighting query, + +00:14:59.706 --> 00:15:01.760 +this will determine + +00:15:01.760 --> 00:15:06.639 +the face used to highlight the note. + +00:15:06.639 --> 00:15:08.256 +Suppose we want to capture + +00:15:08.256 --> 00:15:10.320 +all the function names, + +00:15:10.320 --> 00:15:13.519 +instead of just any identifier. + +00:15:13.519 --> 00:15:29.440 +You can improve the query like this. + +00:15:29.440 --> 00:15:32.639 +This will highlight the whole definition. + +00:15:32.639 --> 00:15:36.399 +But we only want to capture +the function name, + +00:15:36.399 --> 00:15:41.054 +which means the identifier here. + +00:15:41.054 --> 00:15:49.600 +So we move the capture to after the +identifier node. + +00:15:49.600 --> 00:15:52.959 +If we want to capture the +class names as well, + +00:15:52.959 --> 00:16:10.079 +we just add another pattern. + +00:16:10.079 --> 00:16:20.320 +Let's look at a more practical example. + +00:16:20.320 --> 00:16:23.468 +Here we can see that +single-quoted strings + +00:16:23.468 --> 00:16:27.279 +and double-quoted strings are +highlighted the same. + +00:16:27.279 --> 00:16:30.399 +But in some places, + +00:16:30.399 --> 00:16:33.440 +because of some coding conventions, + +00:16:33.440 --> 00:16:36.373 +it may be desirable to highlight them +differently. + +00:16:36.373 --> 00:16:39.073 +For example, if the string is +single-quoted, + +00:16:39.073 --> 00:16:44.399 +we may want to highlight it as a +constant. + +00:16:44.399 --> 00:16:46.160 +Let's try to see whether we can + +00:16:46.160 --> 00:16:56.240 +distinguish these two cases. + +00:16:56.240 --> 00:17:00.639 +So here we get all the strings. + +00:17:00.639 --> 00:17:04.079 +If we want to see if it's single quotes + +00:17:04.079 --> 00:17:08.799 +or double quote strings, + +00:17:08.799 --> 00:17:13.436 +we can try looking at the first +character of the string-- + +00:17:13.436 --> 00:17:16.720 +I mean the first character of the node-- + +00:17:16.720 --> 00:17:33.600 +to check whether it's a single quote or +a double quote. + +00:17:33.600 --> 00:17:38.920 +So for that, we use tree-sitter's +support for predicates. + +00:17:38.920 --> 00:17:43.360 +In this case, we use a match predicate + +00:17:43.360 --> 00:17:47.339 +to check whether the string-- +whether the node starts + +00:17:47.339 --> 00:17:49.556 +with a single quote. + +00:17:49.556 --> 00:17:51.280 +And with this pattern, + +00:17:51.280 --> 00:18:00.400 +we only capture the single-quotes +strings. + +00:18:00.400 --> 00:18:03.760 +Let's try to give it a different face. + +00:18:03.760 --> 00:18:13.039 +So we copy the pattern, + +00:18:13.039 --> 00:18:25.120 +and we add this pattern for Python only. + +00:18:25.120 --> 00:18:31.440 +But we also want to give the capture +a different name. + +00:18:31.440 --> 00:18:46.559 +Let's say we want to highlight it +as a keyword. + +00:18:46.559 --> 00:19:06.320 +And now, if we refresh the buffer, + +00:19:06.320 --> 00:19:08.523 +we see that single quote strings + +00:19:08.523 --> 00:19:14.400 +are highlighted as keywords. + +00:19:14.400 --> 00:19:15.751 +The highlighting patterns + +00:19:15.751 --> 00:19:19.200 +can also be set for a single project + +00:19:19.200 --> 00:19:23.440 +using directory-local variables. + +00:19:23.440 --> 00:19:35.760 +For example, let's take a look at +Emacs's source code. + +00:19:35.760 --> 00:19:41.123 +So in Emacs's C source, +there are a lot of uses + +00:19:41.123 --> 00:19:43.760 +of these different macros + +00:19:43.760 --> 00:19:47.679 +to define functions, + +00:19:47.679 --> 00:19:53.256 +and you can see this is actually +the function name, + +00:19:53.256 --> 00:19:56.373 +but it's highlighted as the string. + +00:19:56.373 --> 00:20:03.679 +So what we want is to somehow +recognize this pattern + +00:20:03.679 --> 00:20:07.600 +and highlight it. + +00:20:07.600 --> 00:20:11.280 +Highlight this part + +00:20:11.280 --> 00:20:14.559 +with the function face instead. + +00:20:14.559 --> 00:20:17.679 +In order to do that, + +00:20:17.679 --> 00:20:31.760 +we put a pattern in this project's +directory-local settings file. + +00:20:31.760 --> 00:20:40.159 +So we can put this button in +the C mode section. + +00:20:40.159 --> 00:20:48.000 +And now, if we enable tree-sitter, + +00:20:48.000 --> 00:20:50.480 +you can see that this is highlighted + +00:20:53.200 --> 00:20:55.056 +as a normal function definition. + +00:20:55.056 --> 00:21:01.200 +So this is the function face +like we wanted. + +00:21:01.200 --> 00:21:07.200 +The pattern for this is +actually pretty simple. + +00:21:07.200 --> 00:21:12.373 +It's only this part. + +00:21:12.373 --> 00:21:16.456 +So if it's a function call + +00:21:16.456 --> 00:21:19.679 +where the name of the function is +defun, + +00:21:19.679 --> 00:21:24.240 +then we highlight the defun as a +keyword, + +00:21:24.240 --> 00:21:26.923 +and then the first string element, + +00:21:26.923 --> 00:21:35.360 +we highlight it as a function name. + +00:21:35.360 --> 00:21:39.280 +Since the language objects are actually +native code, + +00:21:39.280 --> 00:21:41.459 +they have to be compiled +for each platform + +00:21:41.459 --> 00:21:43.440 +that we want to support. + +00:21:43.440 --> 00:21:48.159 +This will become a big obstacle for +tree-sitter adoption. + +00:21:48.159 --> 00:21:52.960 +Therefore, I've created a language bundle +package, tree-sitter-langs, + +00:21:52.960 --> 00:21:55.773 +that takes care of pre-compiling the +grammars, + +00:21:55.773 --> 00:22:01.600 +the most common grammars for all three +major platforms. + +00:22:01.600 --> 00:22:05.360 +It also takes care of distributing +these binaries + +00:22:05.360 --> 00:22:08.080 +and provides some highlighting queries + +00:22:08.080 --> 00:22:11.440 +for some of the languages. + +00:22:11.440 --> 00:22:13.760 +It should be noted that this package + +00:22:13.760 --> 00:22:19.919 +should be treated as a temporary +distribution mechanism only, + +00:22:19.919 --> 00:22:24.720 +to help with bootstrapping +tree-sitter adoption. + +00:22:24.720 --> 00:22:27.760 +The plan is that eventually these files + +00:22:27.760 --> 00:22:29.156 +should be provided by + +00:22:29.156 --> 00:22:32.480 +the language major modes themselves. + +00:22:32.480 --> 00:22:36.320 +But in order to do that, we need better +tooling, + +00:22:36.320 --> 00:22:40.240 +so we're not there yet. + +00:22:40.240 --> 00:22:43.280 +Since the core already works +reasonably well, + +00:22:43.280 --> 00:22:45.289 +there are several areas +that would benefit + +00:22:45.289 --> 00:22:49.120 +from the community's contribution. + +00:22:49.120 --> 00:22:52.640 +So tree-sitter's upstream language +repositories + +00:22:52.640 --> 00:22:55.679 +already contain highlighting queries on +their own. + +00:22:55.679 --> 00:22:57.573 +However, they are pretty basic, + +00:22:57.573 --> 00:23:02.559 +and they may not fit well with existing +Emacs conventions. + +00:23:02.559 --> 00:23:07.120 +Therefore, the language bundle has its +own set of highlighting queries. + +00:23:07.120 --> 00:23:12.556 +This requires maintenance until language +major modes adopt tree-sitter + +00:23:12.556 --> 00:23:16.640 +and maintain the queries on their own. + +00:23:16.640 --> 00:23:19.056 +The queries are actually +quite easy to write, + +00:23:19.056 --> 00:23:22.000 +as you've already seen. + +00:23:22.000 --> 00:23:25.360 +You just need to be familiar +with the language, + +00:23:25.360 --> 00:23:35.200 +familiar enough to come up with sensible +highlighting patterns. + +00:23:35.200 --> 00:23:39.679 +And if you are a maintainer of a +language major mode, + +00:23:39.679 --> 00:23:44.189 +you may want to consider integrating +tree-sitter into your mode, + +00:23:44.189 --> 00:23:48.573 +initially maybe as an optional feature. + +00:23:48.573 --> 00:23:53.279 +The integration is actually pretty +straightforward, + +00:23:53.279 --> 00:23:56.640 +especially for syntax highlighting. + +00:23:56.640 --> 00:24:01.520 +Or alternatively, + +00:24:01.520 --> 00:24:05.760 +you can also try writing a new major +mode from scratch + +00:24:05.760 --> 00:24:08.000 +that relies on tree-sitter + +00:24:08.000 --> 00:24:12.559 +from the very beginning. + +00:24:12.559 --> 00:24:17.523 +The code for such a major mode is +quite simple. + +00:24:17.523 --> 00:24:23.200 +For example, this is the proposed + +00:24:23.200 --> 00:24:26.240 +wat-mode for web assembly. + +00:24:26.240 --> 00:24:39.520 +The code is just one page of code, +not a lot. + +00:24:39.520 --> 00:24:42.720 +You can also try writing new minor modes + +00:24:42.720 --> 00:24:46.559 +or writing integration packages. + +00:24:46.559 --> 00:24:50.880 +For example, a lot of packages + +00:24:50.880 --> 00:24:54.559 +may benefit from tree-sitter integration, + +00:24:54.559 --> 00:25:02.960 +but no one has written +the integration yet. + +00:25:02.960 --> 00:25:04.836 +If you are interested in tree-sitter, + +00:25:04.836 --> 00:25:08.023 +you can use these links to learn more +about it. + +00:25:08.023 --> 00:25:11.440 +I think that's it for me today. + +00:25:11.440 --> 00:25:18.159 +I'm happy to answer any questions. -- cgit v1.2.3