WEBVTT 00:00:01.520 --> 00:00:04.400 Hello, everyone! My name is Tuấn-Anh. 00:00:04.400 --> 00:00:07.200 I've been using Emacs for about 10 years. 00:00:07.200 --> 00:00:09.280 Today, I'm going to talk about tree-sitter, 00:00:09.280 --> 00:00:11.351 a new Emacs package that allows Emacs 00:00:11.351 --> 00:00:17.840 to parse multiple programming languages in real-time. 00:00:17.840 --> 00:00:21.840 So what is the problem statement? 00:00:21.840 --> 00:00:24.131 In order to support programming functionalities 00:00:24.131 --> 00:00:25.760 for a particular language, 00:00:25.760 --> 00:00:27.680 a text editor needs to have some degree 00:00:27.680 --> 00:00:29.679 of language understanding. 00:00:29.679 --> 00:00:31.840 Traditionally, text editors have relied 00:00:31.840 --> 00:00:34.960 very heavily on regular expressions for this. 00:00:34.960 --> 00:00:37.013 Emacs is no different. 00:00:37.013 --> 00:00:40.170 Most language major modes use regular expressions 00:00:40.170 --> 00:00:42.960 for syntax-highlighting, code navigation, 00:00:42.960 --> 00:00:46.618 folding, indexing, and so on. 00:00:46.618 --> 00:00:50.559 Regular expressions are problematic for a couple of reasons. 00:00:50.559 --> 00:00:53.778 They're slow and inaccurate. 00:00:53.778 --> 00:00:56.800 They also make the code hard to read and write. 00:00:56.800 --> 00:01:01.199 Sometimes it's because the regular expressions themselves are very hairy, 00:01:01.199 --> 00:01:05.199 and sometimes because they are just not powerful enough. 00:01:05.199 --> 00:01:08.625 Some helper code is usually needed 00:01:08.625 --> 00:01:11.200 to parse more intricate language features. 00:01:11.200 --> 00:01:16.159 That also illustrates the core problem with regular expressions, 00:01:16.159 --> 00:01:21.119 in that they are not powerful enough to parse programming languages. 00:01:21.119 --> 00:01:25.040 An example feature that regular expressions cannot handle very well 00:01:25.040 --> 00:01:28.320 is string interpolation, which is a very common feature 00:01:28.320 --> 00:01:31.680 in many modern programming languages. 00:01:31.680 --> 00:01:34.079 It would be much nicer if Emacs somehow 00:01:34.079 --> 00:01:39.520 had structural understanding of source code, like IDEs do. 00:01:39.520 --> 00:01:41.981 There have been multiple efforts 00:01:41.981 --> 00:01:45.280 to bring this kind of programming language understanding into Emacs. 00:01:45.280 --> 00:01:47.119 There are language-specific parsers 00:01:47.119 --> 00:01:48.640 written in Elisp 00:01:48.640 --> 00:01:50.675 that can be thought of 00:01:50.675 --> 00:01:51.989 as the next logical step of the glue code 00:01:51.989 --> 00:01:53.856 on top of regular expressions, 00:01:53.856 --> 00:01:57.356 moving from partial local pattern recognition 00:01:57.356 --> 00:01:59.840 into a full-fledged parser. 00:01:59.840 --> 00:02:02.023 The most prominent example of this approach 00:02:02.023 --> 00:02:06.479 is probably the famous js2-mode. 00:02:06.479 --> 00:02:10.080 However, this approach has several issues. 00:02:10.080 --> 00:02:12.606 Parsing is computationally expensive, 00:02:12.606 --> 00:02:16.800 and Emacs Lisp is not good at that kind of stuff. 00:02:16.800 --> 00:02:19.156 Furthermore, maintenance is very troublesome. 00:02:19.156 --> 00:02:22.160 In order to work on these parsers, 00:02:22.160 --> 00:02:24.239 first, you have to know Elisp well enough, 00:02:24.239 --> 00:02:26.606 and then you have to be comfortable with 00:02:26.606 --> 00:02:29.739 writing a recursive descending parser, 00:02:29.739 --> 00:02:34.000 while constantly keeping up with changes to the language itself, 00:02:34.000 --> 00:02:36.356 which can be evolving very quickly, 00:02:36.356 --> 00:02:39.360 like Javascript, for example. 00:02:39.360 --> 00:02:42.373 Together, these constraints significantly reduce 00:02:42.373 --> 00:02:45.680 the pool of potential maintainers. 00:02:45.680 --> 00:02:47.760 The biggest issue, though, in my opinion, 00:02:47.760 --> 00:02:52.139 is lack of the set of generic and reusable APIs. 00:02:52.139 --> 00:02:54.319 This makes them very hard to use 00:02:54.319 --> 00:02:55.920 for minor modes that want to deal with 00:02:55.920 --> 00:02:59.920 cross-cutting concerns across multiple languages. 00:02:59.920 --> 00:03:01.760 The other approach which has been 00:03:01.760 --> 00:03:04.319 gaining a lot of momentum in recent years 00:03:04.319 --> 00:03:06.560 is externalizing language understanding 00:03:06.560 --> 00:03:08.159 to another process, 00:03:08.159 --> 00:03:12.239 also known as language server protocol. 00:03:12.239 --> 00:03:16.560 This second approach is actually a very interesting one. 00:03:16.560 --> 00:03:18.400 By decoupling language understanding 00:03:18.400 --> 00:03:21.280 from the editing facility itself, 00:03:21.280 --> 00:03:25.120 the LSP servers can attract a lot more contributors, 00:03:25.120 --> 00:03:27.189 which makes maintenance easier. 00:03:27.189 --> 00:03:32.400 However, they also have several issues of their own. 00:03:32.400 --> 00:03:34.089 Being a separate process, 00:03:34.089 --> 00:03:37.073 they are usually more resource-intensive, 00:03:37.073 --> 00:03:39.920 and depending on the language, 00:03:39.920 --> 00:03:42.159 the LSP server itself can bring with it 00:03:42.159 --> 00:03:44.640 a host of additional dependencies 00:03:44.640 --> 00:03:50.640 external to Emacs, which may be messy to install and manage. 00:03:50.640 --> 00:03:55.120 Furthermore, JSON over RPC has pretty high latency. 00:03:55.120 --> 00:03:57.840 For one-off tasks like jumping to source 00:03:57.840 --> 00:04:00.879 or on-demand completion, it's great. 00:04:00.879 --> 00:04:03.040 But for things like code highlighting, 00:04:03.040 --> 00:04:06.000 the latency is just too much. 00:04:06.000 --> 00:04:08.319 I was using Rust and I was following the 00:04:08.319 --> 00:04:11.760 community effort to improve its IDE support, 00:04:11.760 --> 00:04:15.760 hoping to integrate some of that into Emacs itself. 00:04:15.760 --> 00:04:19.759 Then I heard someone from the community mention tree-sitter, 00:04:19.759 --> 00:04:23.360 and I decided to check it out. 00:04:23.360 --> 00:04:28.720 Basically, tree-sitter is an incremental parsing library and a parser generator. 00:04:28.720 --> 00:04:33.040 It was introduced by the Atom editor in 2018. 00:04:33.040 --> 00:04:35.923 Besides Atom, it is also being integrated 00:04:35.923 --> 00:04:37.623 into the NeoVim editor, 00:04:37.623 --> 00:04:41.040 and Github is using it to power 00:04:41.040 --> 00:04:42.423 their source code analysis 00:04:42.423 --> 00:04:45.840 and navigation features. 00:04:45.840 --> 00:04:48.639 It is written in C and can be compiled 00:04:48.639 --> 00:04:50.623 for all major platforms. 00:04:50.623 --> 00:04:53.120 It can even be compiled 00:04:53.120 --> 00:04:55.323 to web assembly to run on the web. 00:04:55.323 --> 00:05:00.800 That's how Github is using it on their website. 00:05:00.800 --> 00:05:05.840 So why is tree-sitter an interesting solution to this problem? 00:05:05.840 --> 00:05:10.000 There are multiple features that make it an attractive option. 00:05:10.000 --> 00:05:11.839 It is designed to be fast. 00:05:11.839 --> 00:05:13.680 By being incremental, 00:05:13.680 --> 00:05:15.680 the initial parse of a typical big file 00:05:15.680 --> 00:05:18.160 can take tens of milliseconds, 00:05:18.160 --> 00:05:20.240 while subsequent incremental processes 00:05:20.240 --> 00:05:22.560 are sub-millisecond. 00:05:22.560 --> 00:05:26.240 It achieves this by using structural sharing, 00:05:26.240 --> 00:05:29.360 meaning replacing only affected nodes 00:05:29.360 --> 00:05:32.960 in the old tree when it needs to. 00:05:32.960 --> 00:05:37.120 Also, unlike LSP, being in the same process, 00:05:37.120 --> 00:05:40.639 it has much lower latency. 00:05:40.639 --> 00:05:44.960 Secondly, it provides a uniform programming interface. 00:05:44.960 --> 00:05:47.039 The same data structures and functions 00:05:47.039 --> 00:05:50.400 work on parse trees of different languages. 00:05:50.400 --> 00:05:52.160 Syntax nodes of different languages 00:05:52.160 --> 00:05:54.160 differ only by their types 00:05:54.160 --> 00:05:55.723 and their possible child nodes. 00:05:55.723 --> 00:06:02.240 This is a big advantage over language-specific parsers. 00:06:02.240 --> 00:06:06.880 Thirdly, it's written in self-contained embeddable C. 00:06:06.880 --> 00:06:11.723 As I mentioned previously, it can even be compiled to webassembly. 00:06:11.723 --> 00:06:16.106 This makes integrating it into various editors quite easy 00:06:16.106 --> 00:06:22.880 without having to install any external dependencies. 00:06:22.880 --> 00:06:25.503 One thing that is not mentioned here 00:06:25.503 --> 00:06:28.000 is that being a parser generator, 00:06:28.000 --> 00:06:31.039 its grammars are declarative. 00:06:31.039 --> 00:06:34.880 Together with being editor-independent, 00:06:34.880 --> 00:06:39.139 this makes the pool of potential contributors much larger. 00:06:39.139 --> 00:06:45.520 So I was convinced that tree-sitter is a good fit for Emacs. 00:06:45.520 --> 00:06:48.000 Last year, I started writing the bindings 00:06:48.000 --> 00:06:53.280 using dynamic module support introduced in Emacs 25. 00:06:53.280 --> 00:06:58.479 Dynamic module means there is platform-specific native code involved, 00:06:58.479 --> 00:07:00.560 but since there are pre-compiled binaries 00:07:00.560 --> 00:07:02.880 for the three major platforms, 00:07:02.880 --> 00:07:04.706 it should work in most places. 00:07:04.706 --> 00:07:09.440 Currently, the core functionalities are in a pretty good shape. 00:07:09.440 --> 00:07:12.560 Syntax highlighting is working nicely. 00:07:12.560 --> 00:07:16.080 The whole thing is split into three packages. 00:07:16.080 --> 00:07:20.319 tree-sitter is the main package that other packages should depend on. 00:07:20.319 --> 00:07:22.800 tree-sitter-langs is the language bundle 00:07:22.800 --> 00:07:24.000 that includes support 00:07:24.000 --> 00:07:27.199 for most common languages. 00:07:27.199 --> 00:07:32.160 And finally, the core APIs are in the package tsc, 00:07:32.160 --> 00:07:36.160 which stands for tree-sitter-core. 00:07:36.160 --> 00:07:38.800 It is the implicit dependency of the 00:07:38.800 --> 00:07:43.520 tree-sitter package. 00:07:43.520 --> 00:07:47.520 The main package includes the minor mode tree-sitter-mode. 00:07:47.520 --> 00:07:52.560 This provides the base for other major or minor modes to build on. 00:07:52.560 --> 00:07:54.839 Using Emacs's change tracking hooks, 00:07:54.839 --> 00:07:57.073 it enables incremental parsing 00:07:57.073 --> 00:08:00.800 and provides a syntax tree that is always up to date 00:08:00.800 --> 00:08:04.080 after any edits in a buffer. 00:08:04.080 --> 00:08:06.223 There is also a basic debug mode 00:08:06.223 --> 00:08:10.080 that shows the parse tree in another buffer. 00:08:10.080 --> 00:08:13.360 Here is a quick demo. 00:08:13.360 --> 00:08:15.673 Here I'm in an empty Python buffer 00:08:15.673 --> 00:08:17.520 with tree-sitter enabled. 00:08:17.520 --> 00:08:19.440 I'm going to turn on the debug mode to 00:08:19.440 --> 00:08:26.560 see the parse tree. 00:08:26.560 --> 00:08:28.106 Since the buffer is empty, 00:08:28.106 --> 00:08:30.423 there is only one node in the syntax tree: 00:08:30.423 --> 00:08:33.279 the top-level module node. 00:08:33.279 --> 00:09:11.040 Let's try typing some code. 00:09:11.040 --> 00:09:14.640 As you can see, as I type into the Python buffer, 00:09:14.640 --> 00:09:19.120 the syntax tree updates in real time. 00:09:19.120 --> 00:09:22.039 The other minor mode included in the main package 00:09:22.039 --> 00:09:24.389 is tree-sitter-hl-mode. 00:09:24.389 --> 00:09:26.349 It overrides font-lock mode 00:09:26.349 --> 00:09:28.480 and provides its own set of phases 00:09:28.480 --> 00:09:30.139 and customization options 00:09:30.139 --> 00:09:32.800 It is query-driven. 00:09:32.800 --> 00:09:36.240 That means instead of regular expressions, 00:09:36.240 --> 00:09:39.518 it uses a Lisp-like query language 00:09:39.518 --> 00:09:40.320 to map syntax nodes 00:09:40.320 --> 00:09:41.923 to highlighting phrases. 00:09:41.923 --> 00:09:45.760 I'm going to open a python file with small snippets 00:09:45.760 --> 00:09:54.320 that showcase syntax highlighting. 00:09:54.320 --> 00:09:55.920 So this is the default highlighting 00:09:55.920 --> 00:10:00.880 provided by python-mode. 00:10:00.880 --> 00:10:04.640 This is the highlighting enabled by tree-sitter. 00:10:04.640 --> 00:10:07.680 As you can see, string interpolation 00:10:07.680 --> 00:10:11.680 and decorators are highlighted correctly. 00:10:11.680 --> 00:10:17.440 Function calls are also highlighted. 00:10:17.440 --> 00:10:21.839 You can also note that property accessors 00:10:21.839 --> 00:10:27.440 and property assignments are highlighted differently. 00:10:27.440 --> 00:10:29.360 What I like the most about this is that 00:10:29.360 --> 00:10:32.640 new bindings are consistently highlighted. 00:10:32.640 --> 00:10:36.320 This included local variables, 00:10:36.320 --> 00:10:45.760 function parameters, and property mutations. 00:10:45.760 --> 00:10:48.000 Before going through the tree queries 00:10:48.000 --> 00:10:49.279 and the syntax highlighting 00:10:49.279 --> 00:10:51.680 customization options, 00:10:51.680 --> 00:10:53.339 let's take a brief look at 00:10:53.339 --> 00:10:55.040 the core data structures and functions 00:10:55.040 --> 00:10:58.079 that tree-sitter provides. 00:10:58.079 --> 00:11:00.743 So parsing is done with the help of 00:11:00.743 --> 00:11:02.240 a generic parser object. 00:11:02.240 --> 00:11:04.160 A single parser object can be used to 00:11:04.160 --> 00:11:06.000 parse different languages 00:11:06.000 --> 00:11:09.279 by sending different language objects to it. 00:11:09.279 --> 00:11:10.880 The language objects themselves are 00:11:10.880 --> 00:11:14.079 loaded from shared libraries. 00:11:14.079 --> 00:11:16.079 Since tree-sitter-mmode already handles 00:11:16.079 --> 00:11:17.360 the parsing part, 00:11:17.360 --> 00:11:19.440 we will instead focus on the functions 00:11:19.440 --> 00:11:20.800 that inspect nodes, 00:11:20.800 --> 00:11:25.279 and in the resulting path tree, 00:11:25.279 --> 00:11:27.030 we can ask tree-sitter what is 00:11:27.030 --> 00:11:44.240 the syntax node at point. 00:11:44.240 --> 00:11:48.480 This is an opaque object, so this is not very useful. 00:11:48.480 --> 00:12:03.760 We can instead ask what is its type. 00:12:03.760 --> 00:12:08.959 So its type is the symbol comparison operator. 00:12:08.959 --> 00:12:11.600 In tree-sitter, there are two kinds of nodes, 00:12:11.600 --> 00:12:13.680 anonymous nodes and named nodes. 00:12:13.680 --> 00:12:17.040 Anonymous nodes correspond to simple grammar elements 00:12:17.040 --> 00:12:21.279 like keywords, operators, punctuations, and so on. 00:12:21.279 --> 00:12:24.656 Name nodes, on the other hand, are grammar elements 00:12:24.656 --> 00:12:26.639 that are interesting enough on their own 00:12:26.639 --> 00:12:30.029 to have a name, like an identifier, 00:12:30.029 --> 00:12:35.440 an expression, or a function definition. 00:12:35.440 --> 00:12:37.323 Name node types are symbols, 00:12:37.323 --> 00:12:42.639 while anonymous node types are strings. 00:12:42.639 --> 00:12:49.760 For example, if we are on this comparison operator, 00:12:49.760 --> 00:12:55.920 the node type should be a string. 00:12:55.920 --> 00:12:58.959 We can also get other information about the node. 00:12:58.959 --> 00:13:09.680 For example: what is this text, 00:13:09.680 --> 00:13:20.800 or where it is in the buffer, 00:13:20.800 --> 00:13:43.199 or what is its parent. 00:13:43.199 --> 00:13:46.106 There are many other APIs to query 00:13:46.106 --> 00:13:52.639 our node's properties. 00:13:52.639 --> 00:13:54.234 tree-sitter allows searching 00:13:54.234 --> 00:13:58.240 for structural patterns within a parse tree. 00:13:58.240 --> 00:14:01.440 It does so through a Lisp-like language. 00:14:01.440 --> 00:14:04.639 This language supports matching by node types, 00:14:04.639 --> 00:14:07.760 field names, and predicates. 00:14:07.760 --> 00:14:12.639 It also allows capturing nodes for further processing. 00:14:12.639 --> 00:14:37.680 Let's try to see some examples. 00:14:37.680 --> 00:14:40.206 So in this very simple query, 00:14:40.206 --> 00:14:49.040 we just try to highlight all the identifiers in the buffer. 00:14:49.040 --> 00:14:53.120 This s side tells tree-sitter to capture a node. 00:14:53.120 --> 00:14:55.507 In the context of the query builder, 00:14:55.507 --> 00:14:57.360 it's not very important, 00:14:57.360 --> 00:14:59.706 but in normal highlighting query, 00:14:59.706 --> 00:15:01.760 this will determine 00:15:01.760 --> 00:15:06.639 the face used to highlight the note. 00:15:06.639 --> 00:15:08.256 Suppose we want to capture 00:15:08.256 --> 00:15:10.320 all the function names, 00:15:10.320 --> 00:15:13.519 instead of just any identifier. 00:15:13.519 --> 00:15:29.440 You can improve the query like this. 00:15:29.440 --> 00:15:32.639 This will highlight the whole definition. 00:15:32.639 --> 00:15:36.399 But we only want to capture the function name, 00:15:36.399 --> 00:15:41.054 which means the identifier here. 00:15:41.054 --> 00:15:49.600 So we move the capture to after the identifier node. 00:15:49.600 --> 00:15:52.959 If we want to capture the class names as well, 00:15:52.959 --> 00:16:10.079 we just add another pattern. 00:16:10.079 --> 00:16:20.320 Let's look at a more practical example. 00:16:20.320 --> 00:16:23.468 Here we can see that single-quoted strings 00:16:23.468 --> 00:16:27.279 and double-quoted strings are highlighted the same. 00:16:27.279 --> 00:16:30.399 But in some places, 00:16:30.399 --> 00:16:33.440 because of some coding conventions, 00:16:33.440 --> 00:16:36.373 it may be desirable to highlight them differently. 00:16:36.373 --> 00:16:39.073 For example, if the string is single-quoted, 00:16:39.073 --> 00:16:44.399 we may want to highlight it as a constant. 00:16:44.399 --> 00:16:46.160 Let's try to see whether we can 00:16:46.160 --> 00:16:56.240 distinguish these two cases. 00:16:56.240 --> 00:17:00.639 So here we get all the strings. 00:17:00.639 --> 00:17:04.079 If we want to see if it's single quotes 00:17:04.079 --> 00:17:08.799 or double quote strings, 00:17:08.799 --> 00:17:13.436 we can try looking at the first character of the string-- 00:17:13.436 --> 00:17:16.720 I mean the first character of the node-- 00:17:16.720 --> 00:17:33.600 to check whether it's a single quote or a double quote. 00:17:33.600 --> 00:17:38.920 So for that, we use tree-sitter's support for predicates. 00:17:38.920 --> 00:17:43.360 In this case, we use a match predicate 00:17:43.360 --> 00:17:47.339 to check whether the string-- whether the node starts 00:17:47.339 --> 00:17:49.556 with a single quote. 00:17:49.556 --> 00:17:51.280 And with this pattern, 00:17:51.280 --> 00:18:00.400 we only capture the single-quotes strings. 00:18:00.400 --> 00:18:03.760 Let's try to give it a different face. 00:18:03.760 --> 00:18:13.039 So we copy the pattern, 00:18:13.039 --> 00:18:25.120 and we add this pattern for Python only. 00:18:25.120 --> 00:18:31.440 But we also want to give the capture a different name. 00:18:31.440 --> 00:18:46.559 Let's say we want to highlight it as a keyword. 00:18:46.559 --> 00:19:06.320 And now, if we refresh the buffer, 00:19:06.320 --> 00:19:08.523 we see that single quote strings 00:19:08.523 --> 00:19:14.400 are highlighted as keywords. 00:19:14.400 --> 00:19:15.751 The highlighting patterns 00:19:15.751 --> 00:19:19.200 can also be set for a single project 00:19:19.200 --> 00:19:23.440 using directory-local variables. 00:19:23.440 --> 00:19:35.760 For example, let's take a look at Emacs's source code. 00:19:35.760 --> 00:19:41.123 So in Emacs's C source, there are a lot of uses 00:19:41.123 --> 00:19:43.760 of these different macros 00:19:43.760 --> 00:19:47.679 to define functions, 00:19:47.679 --> 00:19:53.256 and you can see this is actually the function name, 00:19:53.256 --> 00:19:56.373 but it's highlighted as the string. 00:19:56.373 --> 00:20:03.679 So what we want is to somehow recognize this pattern 00:20:03.679 --> 00:20:07.600 and highlight it. 00:20:07.600 --> 00:20:11.280 Highlight this part 00:20:11.280 --> 00:20:14.559 with the function face instead. 00:20:14.559 --> 00:20:17.679 In order to do that, 00:20:17.679 --> 00:20:31.760 we put a pattern in this project's directory-local settings file. 00:20:31.760 --> 00:20:40.159 So we can put this button in the C mode section. 00:20:40.159 --> 00:20:48.000 And now, if we enable tree-sitter, 00:20:48.000 --> 00:20:50.480 you can see that this is highlighted 00:20:53.200 --> 00:20:55.056 as a normal function definition. 00:20:55.056 --> 00:21:01.200 So this is the function face like we wanted. 00:21:01.200 --> 00:21:07.200 The pattern for this is actually pretty simple. 00:21:07.200 --> 00:21:12.373 It's only this part. 00:21:12.373 --> 00:21:16.456 So if it's a function call 00:21:16.456 --> 00:21:19.679 where the name of the function is defun, 00:21:19.679 --> 00:21:24.240 then we highlight the defun as a keyword, 00:21:24.240 --> 00:21:26.923 and then the first string element, 00:21:26.923 --> 00:21:35.360 we highlight it as a function name. 00:21:35.360 --> 00:21:39.280 Since the language objects are actually native code, 00:21:39.280 --> 00:21:41.459 they have to be compiled for each platform 00:21:41.459 --> 00:21:43.440 that we want to support. 00:21:43.440 --> 00:21:48.159 This will become a big obstacle for tree-sitter adoption. 00:21:48.159 --> 00:21:52.960 Therefore, I've created a language bundle package, tree-sitter-langs, 00:21:52.960 --> 00:21:55.773 that takes care of pre-compiling the grammars, 00:21:55.773 --> 00:22:01.600 the most common grammars for all three major platforms. 00:22:01.600 --> 00:22:05.360 It also takes care of distributing these binaries 00:22:05.360 --> 00:22:08.080 and provides some highlighting queries 00:22:08.080 --> 00:22:11.440 for some of the languages. 00:22:11.440 --> 00:22:13.760 It should be noted that this package 00:22:13.760 --> 00:22:19.919 should be treated as a temporary distribution mechanism only, 00:22:19.919 --> 00:22:24.720 to help with bootstrapping tree-sitter adoption. 00:22:24.720 --> 00:22:27.760 The plan is that eventually these files 00:22:27.760 --> 00:22:29.156 should be provided by 00:22:29.156 --> 00:22:32.480 the language major modes themselves. 00:22:32.480 --> 00:22:36.320 But in order to do that, we need better tooling, 00:22:36.320 --> 00:22:40.240 so we're not there yet. 00:22:40.240 --> 00:22:43.280 Since the core already works reasonably well, 00:22:43.280 --> 00:22:45.289 there are several areas that would benefit 00:22:45.289 --> 00:22:49.120 from the community's contribution. 00:22:49.120 --> 00:22:52.640 So tree-sitter's upstream language repositories 00:22:52.640 --> 00:22:55.679 already contain highlighting queries on their own. 00:22:55.679 --> 00:22:57.573 However, they are pretty basic, 00:22:57.573 --> 00:23:02.559 and they may not fit well with existing Emacs conventions. 00:23:02.559 --> 00:23:07.120 Therefore, the language bundle has its own set of highlighting queries. 00:23:07.120 --> 00:23:12.556 This requires maintenance until language major modes adopt tree-sitter 00:23:12.556 --> 00:23:16.640 and maintain the queries on their own. 00:23:16.640 --> 00:23:19.056 The queries are actually quite easy to write, 00:23:19.056 --> 00:23:22.000 as you've already seen. 00:23:22.000 --> 00:23:25.360 You just need to be familiar with the language, 00:23:25.360 --> 00:23:35.200 familiar enough to come up with sensible highlighting patterns. 00:23:35.200 --> 00:23:39.679 And if you are a maintainer of a language major mode, 00:23:39.679 --> 00:23:44.189 you may want to consider integrating tree-sitter into your mode, 00:23:44.189 --> 00:23:48.573 initially maybe as an optional feature. 00:23:48.573 --> 00:23:53.279 The integration is actually pretty straightforward, 00:23:53.279 --> 00:23:56.640 especially for syntax highlighting. 00:23:56.640 --> 00:24:01.520 Or alternatively, 00:24:01.520 --> 00:24:05.760 you can also try writing a new major mode from scratch 00:24:05.760 --> 00:24:08.000 that relies on tree-sitter 00:24:08.000 --> 00:24:12.559 from the very beginning. 00:24:12.559 --> 00:24:17.523 The code for such a major mode is quite simple. 00:24:17.523 --> 00:24:23.200 For example, this is the proposed 00:24:23.200 --> 00:24:26.240 wat-mode for web assembly. 00:24:26.240 --> 00:24:39.520 The code is just one page of code, not a lot. 00:24:39.520 --> 00:24:42.720 You can also try writing new minor modes 00:24:42.720 --> 00:24:46.559 or writing integration packages. 00:24:46.559 --> 00:24:50.880 For example, a lot of packages 00:24:50.880 --> 00:24:54.559 may benefit from tree-sitter integration, 00:24:54.559 --> 00:25:02.960 but no one has written the integration yet. 00:25:02.960 --> 00:25:04.836 If you are interested in tree-sitter, 00:25:04.836 --> 00:25:08.023 you can use these links to learn more about it. 00:25:08.023 --> 00:25:11.440 I think that's it for me today. 00:25:11.440 --> 00:25:18.159 I'm happy to answer any questions.