Automated commit

author: EmacsConf <emacsconf-org@gnu.org> 2024-12-07 11:20:32 -0500
committer: EmacsConf <emacsconf-org@gnu.org> 2024-12-07 11:20:32 -0500
commit: 01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347 (patch)
tree: 867908ce646247f6158b5c2d926aa2ec6fedaa0d /2024
parent: 46dfcc1bf722423af72cd47bb5a58c3c280793b9 (diff)
download: emacsconf-wiki-01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347.tar.xz
emacsconf-wiki-01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347.zip
4 files changed, 1382 insertions, 1 deletions
diff --git a/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main--chapters.vtt b/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main--chapters.vtt
new file mode 100644
index 00000000..3c81657f
--- /dev/null
+++ b/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main--chapters.vtt
@@ -0,0 +1,38 @@
+WEBVTT
+
+
+00:00:00.000 --> 00:01:24.199
+Search in daily workflows
+
+00:01:24.200 --> 00:03:58.232
+Problems with editor search tools
+
+00:03:58.233 --> 00:04:34.295
+Information retrieval
+
+00:04:34.296 --> 00:06:21.756
+Search engine in Emacs: the index
+
+00:06:21.757 --> 00:06:43.552
+Search engine in Emacs: Ranking
+
+00:06:43.553 --> 00:07:41.159
+tf-idf: term-frequency x inverse-document-frequency
+
+00:07:41.160 --> 00:08:41.199
+BM25
+
+00:08:41.200 --> 00:10:41.456
+Searching with p-search
+
+00:10:41.457 --> 00:16:06.770
+Flight AF 447
+
+00:16:06.771 --> 00:20:40.404
+Modifying priors
+
+00:20:40.405 --> 00:21:38.559
+Importance
+
+00:21:38.560 --> 00:22:42.240
+Complement or inverse
diff --git a/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt b/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt
new file mode 100644
index 00000000..111f2728
--- /dev/null
+++ b/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt
@@ -0,0 +1,970 @@
+WEBVTT captioned by sachac
+
+NOTE Search in daily workflows
+
+00:00:00.000 --> 00:00:03.399
+Hello, my name is Zachary Romero, and today I'll be going
+
+00:00:03.400 --> 00:00:08.115
+over p-search, a local search engine in Emacs.
+
+00:00:08.116 --> 00:00:12.398
+Search these days is everywhere in software, from text editors,
+
+00:00:12.399 --> 00:00:18.359
+to IDEs, to most online websites. These tools tend to fall
+
+00:00:18.360 --> 00:00:25.839
+into one of two categories. One are tools that run locally,
+
+00:00:25.840 --> 00:00:31.279
+and work by matching string to text. The most common
+
+00:00:31.280 --> 00:00:35.639
+example of this is grep. In Emacs, there are a lot of
+
+00:00:35.640 --> 00:00:38.959
+extensions which provide functionality on top of these
+
+00:00:38.960 --> 00:00:42.388
+tools, such as projectile-grep, deadgrep,
+
+00:00:42.389 --> 00:00:46.849
+consult-ripgrep. Most editors have some sort of
+
+00:00:46.850 --> 00:00:52.691
+search current project feature. Most of the time,
+
+00:00:52.692 --> 00:00:56.393
+some of these tools have features like regular expressions,
+
+00:00:56.394 --> 00:00:59.215
+or you can specify file extension,
+
+00:00:59.216 --> 00:01:01.636
+or a directory you want to search in,
+
+00:01:01.637 --> 00:01:03.957
+but features are pretty limited.
+
+00:01:03.958 --> 00:01:07.919
+The other kind of search we use are usually hosted online,
+
+00:01:07.920 --> 00:01:12.302
+and they usually search a vast corpus of data.
+
+00:01:12.303 --> 00:01:15.639
+These are usually proprietary
+
+00:01:15.640 --> 00:01:18.765
+online services such as Google, GitHub,
+
+00:01:18.766 --> 00:01:24.199
+SourceGraph for code.
+
+NOTE Problems with editor search tools
+
+00:01:24.200 --> 00:01:28.839
+The kind of search feature that editors
+
+00:01:28.840 --> 00:01:36.719
+usually have have a lot of downsides to them. For one, a lot
+
+00:01:36.720 --> 00:01:38.839
+of times you don't know the exact search string you're
+
+00:01:38.840 --> 00:01:42.783
+searching for. Some complicated term like this
+
+00:01:42.784 --> 00:01:46.860
+high volume demand partner, you know, do you know if...
+
+00:01:46.861 --> 00:01:49.708
+Are some words abbreviated, is it capitalized,
+
+00:01:49.709 --> 00:01:53.089
+is it in kebab case, camel case, snake case?
+
+00:01:53.090 --> 00:01:57.571
+You often have to search all these variations.
+
+00:01:57.572 --> 00:02:05.434
+Another downside is that the search results returned
+
+00:02:05.435 --> 00:02:07.769
+contain a lot of noise. For example,
+
+00:02:07.770 --> 00:02:10.816
+you may get a lot of test files.
+
+00:02:10.817 --> 00:02:13.537
+If the tool hits your vendor directory,
+
+00:02:13.538 --> 00:02:17.199
+it may get a bunch of results from libraries
+
+00:02:17.200 --> 00:02:22.879
+you're using, which most are not helpful. Another downside
+
+00:02:22.880 --> 00:02:26.679
+is that the order given is, well, there's no meaning to the
+
+00:02:26.680 --> 00:02:30.319
+order. It's usually just the search order that the tool
+
+00:02:30.320 --> 00:02:34.639
+happens to look in first.
+
+00:02:34.640 --> 00:02:38.639
+Another thing is, so when you're searching, you oftentimes
+
+00:02:38.640 --> 00:02:41.639
+have to keep the state of the searches in your head. For
+
+00:02:41.640 --> 00:02:46.639
+example, you try one search, you see the results, find the
+
+00:02:46.640 --> 00:02:49.639
+results you think are relevant, keep them in your head, run
+
+00:02:49.640 --> 00:02:52.519
+search number two, look through the results, kind of
+
+00:02:52.520 --> 00:02:56.119
+combine these different search results in your head until
+
+00:02:56.120 --> 00:02:59.970
+you get an idea of which ones might be relevant.
+
+00:02:59.971 --> 00:03:04.515
+Another thing is that the search primitives are fairly limited.
+
+00:03:04.516 --> 00:03:10.599
+So yeah, you can search regular expressions, but you can't
+
+00:03:10.600 --> 00:03:14.719
+really define complex things like, I want to search files in
+
+00:03:14.720 --> 00:03:18.439
+this directory, and this directory, and this directory,
+
+00:03:18.440 --> 00:03:22.319
+except these subdirectories, and accept test files, and I
+
+00:03:22.320 --> 00:03:25.559
+only want files with this file extension. Criteria like
+
+00:03:25.560 --> 00:03:28.919
+that are really hard to... I'm sure they're possible in tools
+
+00:03:28.920 --> 00:03:34.479
+like grep, but they're pretty hard to construct.
+
+00:03:34.480 --> 00:03:38.199
+And lastly, there's no notion of any relevance. All the
+
+00:03:38.200 --> 00:03:42.039
+results you get back, I mean, you don't know, is the search
+
+00:03:42.040 --> 00:03:43.095
+more relevant? Is it twice as relevant? Is it
+
+00:03:43.096 --> 00:03:52.279
+100 times more relevant? These tools usually don't provide
+
+00:03:52.280 --> 00:03:58.232
+such information.
+
+NOTE Information retrieval
+
+00:03:58.233 --> 00:04:00.394
+There's a field called information retrieval,
+
+00:04:00.395 --> 00:04:02.616
+and this deals with this exact problem.
+
+00:04:02.617 --> 00:04:04.718
+You have lots of data you're searching for.
+
+00:04:04.719 --> 00:04:09.261
+How do you construct a search query?
+
+00:04:09.262 --> 00:04:09.839
+How do you get results back fast? How do you
+
+00:04:09.840 --> 00:04:14.519
+rank which ones are most relevant? How do you evaluate
+
+00:04:14.520 --> 00:04:20.079
+your search system to see if it's getting better or worse?
+
+00:04:20.080 --> 00:04:23.119
+There's a lot of work, a lot of books written on the topic of
+
+00:04:23.120 --> 00:04:28.159
+information retrieval. If one wants to improve
+
+00:04:28.160 --> 00:04:31.879
+searching in Emacs, then drawing inspiration from this
+
+00:04:31.880 --> 00:04:34.295
+field is necessary.
+
+NOTE Search engine in Emacs: the index
+
+00:04:34.296 --> 00:04:41.383
+The first aspect of information retrieval is the index.
+
+00:04:41.384 --> 00:04:46.608
+The reverse index is what search engines use to find results really fast.
+
+00:04:46.609 --> 00:04:51.454
+Essentially, it's a map of search term
+
+00:04:51.455 --> 00:04:54.738
+to locations where that term is located.
+
+00:04:54.739 --> 00:04:57.079
+You'll have all the terms or maybe even parts of
+
+00:04:57.080 --> 00:04:59.159
+the terms, and then you'll have all the locations where
+
+00:04:59.160 --> 00:05:02.119
+they're located. Any query could easily look up
+
+00:05:02.120 --> 00:05:05.919
+where things are located, join results together, and
+
+00:05:05.920 --> 00:05:12.879
+that's how they get the results to be really fast. For this
+
+00:05:12.880 --> 00:05:19.159
+project, I decided to forgo creating an index altogether.
+
+00:05:19.160 --> 00:05:23.759
+An index is pretty complicated to maintain because
+
+00:05:23.760 --> 00:05:27.319
+it always has to be in sync. Any time you open a file and save
+
+00:05:27.320 --> 00:05:29.959
+it, you would have to re-index, you would have to make sure
+
+00:05:29.960 --> 00:05:32.559
+that file is re-indexed properly. Then you have the
+
+00:05:32.560 --> 00:05:36.119
+whole issue of, well, if you're searching in Emacs,
+
+00:05:36.120 --> 00:05:38.799
+you have all these projects, this directory,
+
+00:05:38.800 --> 00:05:42.479
+that directory, how do you know which? Do you always have to
+
+00:05:42.480 --> 00:05:47.399
+keep them in sync? It's quite a hard task to handle
+
+00:05:47.400 --> 00:05:53.079
+that. Then on the other end, tools like ripgrep can
+
+00:05:53.080 --> 00:05:59.119
+search very fast. Even though they can't search maybe on the
+
+00:05:59.120 --> 00:06:03.919
+order of tens of thousands of repositories, for a local
+
+00:06:03.920 --> 00:06:06.039
+setting, they should be plenty fast enough.
+
+00:06:06.040 --> 00:06:12.239
+I benchmarked. Ripgrep, for example, is
+
+00:06:12.240 --> 00:06:15.959
+on the order of gigabytes per second.
+
+00:06:15.960 --> 00:06:19.239
+Definitely, it can search a few pretty big size
+
+00:06:19.240 --> 00:06:21.756
+repositories.
+
+NOTE Search engine in Emacs: Ranking
+
+00:06:21.757 --> 00:06:24.799
+Next main task. We decided not to use an
+
+00:06:24.800 --> 00:06:29.959
+index. Next task is how do we rank search results? So there's
+
+00:06:29.960 --> 00:06:33.439
+two main algorithms that are used these days. The first
+
+00:06:33.440 --> 00:06:36.519
+one is tf-idf, which stands for term frequency, inverse
+
+00:06:36.520 --> 00:06:43.039
+target frequency. Then there's BM25, which is sort of a
+
+00:06:43.040 --> 00:06:43.552
+modified tf-idf algorithm.
+
+NOTE tf-idf: term-frequency x inverse-document-frequency
+
+00:06:43.553 --> 00:06:45.679
+tf-idf, without going into
+
+00:06:45.680 --> 00:06:49.159
+too much detail, essentially multiplies two terms. One
+
+00:06:49.160 --> 00:06:51.879
+is the term frequency, and then you multiply it by the
+
+00:06:51.880 --> 00:06:54.559
+inverse document frequency. The term frequency is a
+
+00:06:54.560 --> 00:06:58.519
+measure of how often that search term occurs. The
+
+00:06:58.520 --> 00:07:00.799
+inverse document frequency is a measure of how much
+
+00:07:00.800 --> 00:07:06.199
+information that term provides. If the term occurs a lot,
+
+00:07:06.200 --> 00:07:08.719
+then it gets a higher score in the term frequency section.
+
+00:07:08.720 --> 00:07:12.399
+But if it's a common word that exists in a lot of documents,
+
+00:07:12.400 --> 00:07:13.900
+then its inverse document frequency goes down.
+
+00:07:13.901 --> 00:07:20.879
+It kind of scores it less. You'll find that words like the,
+
+00:07:20.880 --> 00:07:25.959
+in, is, these really common words, since they occur
+
+00:07:25.960 --> 00:07:29.199
+everywhere, their inverse document frequency is
+
+00:07:29.200 --> 00:07:32.479
+essentially zero. They don't really count towards a
+
+00:07:32.480 --> 00:07:35.679
+score. But when you have rare words that only occur in a
+
+00:07:35.680 --> 00:07:37.679
+few documents, they're weighted a lot more. So the more
+
+00:07:37.680 --> 00:07:41.159
+those rare words occur, they boost the score higher.
+
+NOTE BM25
+
+00:07:41.160 --> 00:07:48.839
+BM25 is a modification of this. It's essentially TF, it's
+
+00:07:48.840 --> 00:07:53.119
+essentially the previous one, except it dampens out terms
+
+00:07:53.120 --> 00:07:55.439
+that occur more often. Imagine you have a bunch of
+
+00:07:55.440 --> 00:07:59.359
+documents. One has a term 10 times, one has a term, that same
+
+00:07:59.360 --> 00:08:02.439
+term a hundred times, another has a thousand times.
+
+00:08:02.440 --> 00:08:06.799
+You'll see the score dampens off as the number of
+
+00:08:06.800 --> 00:08:10.639
+occurrences increases. That prevents any one term from
+
+00:08:10.640 --> 00:08:16.559
+overpowering the score. This is the algorithm I ended up
+
+00:08:16.560 --> 00:08:21.039
+choosing for my implementation. So with a plan of using a
+
+00:08:21.040 --> 00:08:29.559
+command line tool like ripgrep to get term occurrences, and
+
+00:08:29.560 --> 00:08:36.799
+then using a scoring algorithm like BM25 to rank the terms,
+
+00:08:36.800 --> 00:08:40.079
+we can combine this together and create a simple search
+
+00:08:40.080 --> 00:08:41.199
+mechanism.
+
+NOTE Searching with p-search
+
+00:08:41.200 --> 00:08:47.439
+Here we're in the directory for the Emacs source code.
+
+00:08:47.440 --> 00:08:53.479
+Let's say we want to search for the display code. We
+
+00:08:53.480 --> 00:08:58.679
+run the p-search command, starting the search engine. It
+
+00:08:58.680 --> 00:09:01.159
+opens up. We notice it has three sections, the candidate
+
+00:09:01.160 --> 00:09:05.199
+generators, the priors, and the search results. The
+
+00:09:05.200 --> 00:09:09.999
+candidate generators generates the search space we're
+
+00:09:10.000 --> 00:09:14.719
+looking on. These are all composable and you can add as
+
+00:09:14.720 --> 00:09:19.719
+many as you want. So with this, it specifies that here
+
+00:09:19.720 --> 00:09:25.239
+we're searching on the file system and we're searching in
+
+00:09:25.240 --> 00:09:30.799
+this directory. We're using the ripgrep tool to search
+
+00:09:30.800 --> 00:09:33.359
+with, and we want to make sure that we're searching only on
+
+00:09:33.360 --> 00:09:40.479
+files committed to Git. Here we see the search results.
+
+00:09:40.480 --> 00:09:45.159
+Notice here is their final probability. Here, notice
+
+00:09:45.160 --> 00:09:47.079
+that they're all the same, and they're the same because we
+
+00:09:47.080 --> 00:09:50.719
+don't have any search criteria specified here. Suppose
+
+00:09:50.720 --> 00:09:55.679
+we want to search for display-related code. We add a
+
+00:09:55.680 --> 00:09:57.359
+query: display.
+
+00:09:57.360 --> 00:10:06.559
+So then it spins off the processes, gets the search term
+
+00:10:06.560 --> 00:10:10.879
+counts and calculates the new scores. Notice here that
+
+00:10:10.880 --> 00:10:15.759
+the results that come on top are just at first glance appear
+
+00:10:15.760 --> 00:10:19.919
+to be relevant to display. Remember, if we compare
+
+00:10:19.920 --> 00:10:25.079
+that to just running a ripgrep raw, notice here we're
+
+00:10:25.080 --> 00:10:31.279
+getting 53,000 results and it's pretty hard to go through
+
+00:10:31.280 --> 00:10:34.319
+these results and make sense of it.
+
+00:10:34.320 --> 00:10:41.456
+So that's p-search in a nutshell.
+
+NOTE Flight AF 447
+
+00:10:41.457 --> 00:10:45.982
+Next, I wanted to talk about the story of Flight 447.
+
+00:10:45.983 --> 00:10:49.326
+Flight 447 going from Rio de Janeiro to Paris
+
+00:10:49.327 --> 00:10:51.509
+crashed somewhere in the Atlantic Ocean
+
+00:10:51.510 --> 00:10:54.713
+on June 1st, 2009, killing everyone on board.
+
+00:10:54.714 --> 00:10:56.894
+Four search attempts were made to find the wreckage.
+
+00:10:56.895 --> 00:11:01.075
+None of them were successful, except the finding of some debris
+
+00:11:01.076 --> 00:11:05.479
+and a dead body. It was decided that they really wanted
+
+00:11:05.480 --> 00:11:09.519
+to find the wreckage to retrieve data as to why the search
+
+00:11:09.520 --> 00:11:14.639
+occurred. This occurred two years after the
+
+00:11:14.640 --> 00:11:19.959
+initial crash. With this next search attempt, they
+
+00:11:19.960 --> 00:11:23.199
+wanted to create a probability distribution of where the
+
+00:11:23.200 --> 00:11:26.759
+crash could be. The only piece of concrete data they had
+
+00:11:26.760 --> 00:11:35.079
+was a GPS signal from the ship at 210 containing the GPS
+
+00:11:35.080 --> 00:11:40.239
+location of the plane was at 2.98 degrees north, 30.59
+
+00:11:40.240 --> 00:11:44.719
+degrees west. That was the only data they had to go off of.
+
+00:11:44.720 --> 00:11:50.079
+So they drew a circle around that point
+
+00:11:50.080 --> 00:11:54.679
+with a radius of 40 nautical miles. They assumed that
+
+00:11:54.680 --> 00:11:57.479
+anything outside the circle would have been impossible for
+
+00:11:57.480 --> 00:12:01.239
+the ship to reach. This was the starting point for
+
+00:12:01.240 --> 00:12:04.799
+creating the probability distribution of where the
+
+00:12:04.800 --> 00:12:08.119
+wreckage occurred. Anything outside the circle, they
+
+00:12:08.120 --> 00:12:09.639
+assumed it was impossible to reach.
+
+00:12:09.640 --> 00:12:16.479
+The only other pieces of data were the four failed search
+
+00:12:16.480 --> 00:12:21.719
+attempts and then some of the debris found. One thing they
+
+00:12:21.720 --> 00:12:26.159
+did decide was to look at similar crashes where control was
+
+00:12:26.160 --> 00:12:30.319
+lost to analyze where the crashes landed, compared to where
+
+00:12:30.320 --> 00:12:37.399
+the loss of control started. This probability
+
+00:12:37.400 --> 00:12:43.479
+distribution, the circular normal distribution was
+
+00:12:43.480 --> 00:12:47.919
+decided upon. Here you can see that the center has a lot
+
+00:12:47.920 --> 00:12:51.879
+higher chance of finding the wreckage. As you go away
+
+00:12:51.880 --> 00:12:55.399
+from the center, the probability of finding the wreckage
+
+00:12:55.400 --> 00:13:02.319
+decreases a lot. The next thing they looked at was, well,
+
+00:13:02.320 --> 00:13:05.959
+they noticed they had retrieved some dead bodies from the
+
+00:13:05.960 --> 00:13:12.959
+wreckage. So they thought that they could calculate the
+
+00:13:12.960 --> 00:13:18.439
+backward drift on that particular day to find where the
+
+00:13:18.440 --> 00:13:21.479
+crash might've occurred. If they found bodies at a
+
+00:13:21.480 --> 00:13:25.119
+particular location, they can kind of work backwards from
+
+00:13:25.120 --> 00:13:30.665
+that in order to find where the initial crash occurred.
+
+00:13:30.666 --> 00:13:34.719
+So here you can see the probability distribution based off of
+
+00:13:34.720 --> 00:13:40.279
+the backward drift model. Here you see the darker colors
+
+00:13:40.280 --> 00:13:46.159
+have a higher probability of finding the location. So
+
+00:13:46.160 --> 00:13:50.679
+with all these pieces of data, so with that circular 40
+
+00:13:50.680 --> 00:13:54.959
+nautical mile uniform distribution, with that circular
+
+00:13:54.960 --> 00:14:02.199
+normal distribution of comparing similar crashes, as well
+
+00:14:02.200 --> 00:14:07.439
+as with the backward drift, they were able to combine all
+
+00:14:07.440 --> 00:14:08.559
+three of these pieces
+
+00:14:08.560 --> 00:14:14.599
+in order to come up with a final prior distribution of where
+
+00:14:14.600 --> 00:14:19.519
+the wreckage occurred. So this is what the final model
+
+00:14:19.520 --> 00:14:24.719
+they came upon. Here you can see it has that 40 nautical
+
+00:14:24.720 --> 00:14:29.679
+mile radius circle. It has that darker center, which
+
+00:14:29.680 --> 00:14:32.039
+indicates a higher probability because of the
+
+00:14:32.040 --> 00:14:38.959
+crash similarity. Then here you also see along this line
+
+00:14:38.960 --> 00:14:50.799
+has a slightly higher probability due to the backward drift
+
+00:14:50.800 --> 00:14:52.119
+distribution.
+
+00:14:52.120 --> 00:14:56.559
+So the next thing is, since they had performed searches,
+
+00:14:56.560 --> 00:15:00.559
+they decided to incorporate the data from those searches
+
+00:15:00.560 --> 00:15:04.759
+into their new distribution. Here you can see places
+
+00:15:04.760 --> 00:15:08.879
+where they searched initially. If you think about it,
+
+00:15:08.880 --> 00:15:11.399
+you can assume that, well, if you search for something,
+
+00:15:11.400 --> 00:15:14.199
+there's a good chance you'll find it, but not necessarily.
+
+00:15:14.200 --> 00:15:18.439
+Anywhere where they searched, the probability of it
+
+00:15:18.440 --> 00:15:22.839
+finding it there is greatly reduced. It's not zero because
+
+00:15:22.840 --> 00:15:26.879
+obviously you can look for something and miss it, but it kind
+
+00:15:26.880 --> 00:15:31.119
+of reduces the probability that we would expect to find it in
+
+00:15:31.120 --> 00:15:36.679
+those already searched locations. This is the
+
+00:15:36.680 --> 00:15:41.919
+posterior distribution or distribution after counting
+
+00:15:41.920 --> 00:15:44.559
+observations made.
+
+00:15:44.560 --> 00:15:48.759
+Here we can see kind of these cutouts of where the
+
+00:15:48.760 --> 00:15:53.959
+previous searches occurred. This is the final
+
+00:15:53.960 --> 00:15:56.999
+distribution they went off of to perform the subsequent
+
+00:15:57.000 --> 00:16:01.999
+search. In the end, the wreckage was found at a point close to
+
+00:16:02.000 --> 00:16:06.770
+the center here, thus validating this methodology.
+
+NOTE Modifying priors
+
+00:16:06.771 --> 00:16:10.332
+We can see the power of this Bayesian search methodology
+
+00:16:10.333 --> 00:16:13.999
+in the way that we could take information from all the sources we had.
+
+00:16:14.000 --> 00:16:19.237
+We could draw analogies to similar situations.
+
+00:16:19.238 --> 00:16:22.479
+We can quantify these, combine them into a model,
+
+00:16:22.480 --> 00:16:27.893
+and then also update our model according to each observation we make.
+
+00:16:27.894 --> 00:16:30.359
+I think there's a lot of similarities to be drawn with
+
+00:16:30.360 --> 00:16:35.159
+searching on a computer in the sense that when we search for
+
+00:16:35.160 --> 00:16:39.399
+something, there's oftentimes a story we kind of have as to
+
+00:16:39.400 --> 00:16:43.959
+what search terms exist, where we expect to find the file.
+
+00:16:43.960 --> 00:16:46.719
+For example, if you're implementing a new feature, you'll
+
+00:16:46.720 --> 00:16:49.919
+often have some search terms in mind that you think will be
+
+00:16:49.920 --> 00:16:54.719
+relevant. Some search terms, you might think they have a
+
+00:16:54.720 --> 00:16:57.599
+possibility of being relevant, but maybe you're not sure.
+
+00:16:57.600 --> 00:17:02.879
+There's some directories where you know that they're not
+
+00:17:02.880 --> 00:17:07.759
+relevant. There's other criteria like, well, you know that
+
+00:17:07.760 --> 00:17:11.399
+maybe somebody in particular worked on this code.
+
+00:17:11.400 --> 00:17:16.319
+What if you could incorporate that information? Like, I know
+
+00:17:16.320 --> 00:17:21.399
+this author, he's always working on this feature. What if
+
+00:17:21.400 --> 00:17:25.519
+I just give the files that this person works on a higher
+
+00:17:25.520 --> 00:17:32.599
+probability than ones he doesn't work on? Or maybe you think
+
+00:17:32.600 --> 00:17:38.599
+that this is a file that's committed too often. You think
+
+00:17:38.600 --> 00:17:43.439
+that maybe the amount of times of commits it receives
+
+00:17:43.440 --> 00:17:47.719
+should change your probability of this file being
+
+00:17:47.720 --> 00:17:52.839
+relevant. That's where p-search comes in.
+
+00:17:52.840 --> 00:17:57.679
+Its aim is to be a framework in order to incorporate all these
+
+00:17:57.680 --> 00:18:01.359
+sorts of different prior information into your searching
+
+00:18:01.360 --> 00:18:05.999
+process. You're able to say things like, I want files
+
+00:18:06.000 --> 00:18:11.119
+authored by this user to be given higher probability. I want
+
+00:18:11.120 --> 00:18:13.919
+this author to be given a lower priority. I know this author
+
+00:18:13.920 --> 00:18:18.759
+never works on this code. If he has a commit, then lower its
+
+00:18:18.760 --> 00:18:24.679
+probability, or you can specify specific paths, or you can
+
+00:18:24.680 --> 00:18:30.199
+specify multiple search terms, weighing different ones
+
+00:18:30.200 --> 00:18:38.919
+according to how you think those terms should be relevant.
+
+00:18:38.920 --> 00:18:42.079
+So with p-search, we're able to incorporate information
+
+00:18:42.080 --> 00:18:46.279
+from multiple sources. Here, for example, we have a prior
+
+00:18:46.280 --> 00:18:52.079
+of type git author, and we're looking for all of the files
+
+00:18:52.080 --> 00:18:56.719
+that are committed to by Lars. So the more commits he has,
+
+00:18:56.720 --> 00:19:01.399
+the higher probability is given to that file. Suppose
+
+00:19:01.400 --> 00:19:04.559
+there's a feature I know he worked on, but I don't know the
+
+00:19:04.560 --> 00:19:09.159
+file or necessarily even key terms of it. Well, with this, I
+
+00:19:09.160 --> 00:19:12.140
+can incorporate that information.
+
+00:19:12.141 --> 00:19:15.999
+So let's search again. Let's add display.
+
+00:19:16.000 --> 00:19:22.959
+Let's see what responses we get back here. We can add
+
+00:19:22.960 --> 00:19:27.199
+as many of these criteria as we want. We can even specify that
+
+00:19:27.200 --> 00:19:31.519
+the title of the file name should be a certain type. Let's
+
+00:19:31.520 --> 00:19:36.599
+say we're only concerned about C files. We add the file
+
+00:19:36.600 --> 00:19:45.399
+name should contain .c in it. With this, now we
+
+00:19:45.400 --> 00:19:51.319
+notice that all of the C files containing display authored
+
+00:19:51.320 --> 00:19:56.279
+by Lars should be given higher probability. We can
+
+00:19:56.280 --> 00:20:02.719
+continue to add these priors as we feel fit. The workflow
+
+00:20:02.720 --> 00:20:07.519
+that I found helps when searching is that you'll add
+
+00:20:07.520 --> 00:20:11.359
+criteria, you'll see some good results come up and some bad
+
+00:20:11.360 --> 00:20:15.319
+results come up. So you'll often find a pattern in those
+
+00:20:15.320 --> 00:20:18.839
+bad results, like, oh, I don't want test files, or this
+
+00:20:18.840 --> 00:20:22.679
+directory isn't relevant, or something like that. Then
+
+00:20:22.680 --> 00:20:27.199
+you can update your prior distribution, adding its
+
+00:20:27.200 --> 00:20:31.119
+criteria, and then rerun it, and then it will get different
+
+00:20:31.120 --> 00:20:35.159
+probabilities for the files. So in the end, you'll have a
+
+00:20:35.160 --> 00:20:37.639
+list of results that's tailor-made to the thing you're
+
+00:20:37.640 --> 00:20:40.404
+searching for.
+
+NOTE Importance
+
+00:20:40.405 --> 00:20:41.639
+There's a couple of other features I
+
+00:20:41.640 --> 00:20:49.079
+want to go through. One thing is that each of these priors,
+
+00:20:49.080 --> 00:20:55.839
+you can specify the importance. In other words, how
+
+00:20:55.840 --> 00:21:01.119
+important is this particular piece of information to your
+
+00:21:01.120 --> 00:21:05.199
+search? So here, everything is of importance medium. But
+
+00:21:05.200 --> 00:21:07.879
+let's say I really care about something having the word
+
+00:21:07.880 --> 00:21:12.679
+display in it. I'm going to change its importance.
+
+00:21:12.680 --> 00:21:18.599
+Instead of medium, I'll change its importance to high.
+
+00:21:18.600 --> 00:21:23.279
+What that does essentially is things that don't have
+
+00:21:23.280 --> 00:21:28.079
+display in it are given a much bigger penalty and things with
+
+00:21:28.080 --> 00:21:28.128
+the word display in it are rated much higher.
+
+00:21:28.129 --> 00:21:38.559
+With this, we're able to fine-tune the results that we get.
+
+NOTE Complement or inverse
+
+00:21:38.560 --> 00:21:45.639
+Another thing you can do is that you can add the complement or
+
+00:21:45.640 --> 00:21:49.759
+the inverse of certain queries. Let's say you want to
+
+00:21:49.760 --> 00:21:53.239
+search for display, but you don't want it to contain the word
+
+00:21:53.240 --> 00:21:58.039
+frame. With the complement option on, when we create this
+
+00:21:58.040 --> 00:22:01.839
+search prior, now it's going to be searching for frame, but
+
+00:22:01.840 --> 00:22:04.959
+instead of increasing the search score, it's going to
+
+00:22:04.960 --> 00:22:06.999
+decrease it if it contains the word frame.
+
+00:22:07.000 --> 00:22:14.319
+So here, things related to frame are kind of
+
+00:22:14.320 --> 00:22:18.079
+deprioritized. We can also say that we really don't want
+
+00:22:18.080 --> 00:22:21.599
+the search to contain the word frame by increasing its
+
+00:22:21.600 --> 00:22:27.199
+importance. So with all these composable pieces, we can
+
+00:22:27.200 --> 00:22:33.412
+create kind of a search that's tailor-made to our needs.
+
+00:22:33.413 --> 00:22:35.759
+That concludes this talk. There's a lot more I could talk
+
+00:22:35.760 --> 00:22:37.799
+about with regards to research, so definitely follow the
+
+00:22:37.800 --> 00:22:40.639
+project if you're interested. Thanks for watching, and I
+
+00:22:40.640 --> 00:22:42.240
+hope you enjoy the rest of the conference.
diff --git a/2024/info/p-search-after.md b/2024/info/p-search-after.md
index 6850a230..b3932bdd 100644
--- a/2024/info/p-search-after.md
+++ b/2024/info/p-search-after.md
@@ -1,6 +1,365 @@
 <!-- Automatically generated by emacsconf-publish-after-page -->
 
 
+<a name="p-search-mainVideo-transcript"></a>
+# Transcript
+
+
+[[!template new="1" text="""Search in daily workflows""" start="00:00:00.000" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""Hello, my name is Zachary Romero, and today I'll be going""" start="00:00:00.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""over p-search, a local search engine in Emacs.""" start="00:00:03.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Search these days is everywhere in software, from text editors,""" start="00:00:08.116" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""to IDEs, to most online websites. These tools tend to fall""" start="00:00:12.399" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""into one of two categories. One are tools that run locally,""" start="00:00:18.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""and work by matching string to text. The most common""" start="00:00:25.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""example of this is grep. In Emacs, there are a lot of""" start="00:00:31.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""extensions which provide functionality on top of these""" start="00:00:35.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""tools, such as projectile-grep, deadgrep,""" start="00:00:38.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""consult-ripgrep. Most editors have some sort of""" start="00:00:42.389" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search current project feature. Most of the time,""" start="00:00:46.850" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""some of these tools have features like regular expressions,""" start="00:00:52.692" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""or you can specify file extension,""" start="00:00:56.394" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""or a directory you want to search in,""" start="00:00:59.216" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""but features are pretty limited.""" start="00:01:01.637" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""The other kind of search we use are usually hosted online,""" start="00:01:03.958" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""and they usually search a vast corpus of data.""" start="00:01:07.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""These are usually proprietary""" start="00:01:12.303" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""online services such as Google, GitHub,""" start="00:01:15.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""SourceGraph for code.""" start="00:01:18.766" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Problems with editor search tools""" start="00:01:24.200" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""The kind of search feature that editors""" start="00:01:24.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""usually have have a lot of downsides to them. For one, a lot""" start="00:01:28.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""of times you don't know the exact search string you're""" start="00:01:36.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""searching for. Some complicated term like this""" start="00:01:38.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""high volume demand partner, you know, do you know if...""" start="00:01:42.784" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Are some words abbreviated, is it capitalized,""" start="00:01:46.861" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""is it in kebab case, camel case, snake case?""" start="00:01:49.709" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""You often have to search all these variations.""" start="00:01:53.090" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Another downside is that the search results returned""" start="00:01:57.572" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""contain a lot of noise. For example,""" start="00:02:05.435" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you may get a lot of test files.""" start="00:02:07.770" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""If the tool hits your vendor directory,""" start="00:02:10.817" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""it may get a bunch of results from libraries""" start="00:02:13.538" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you're using, which most are not helpful. Another downside""" start="00:02:17.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""is that the order given is, well, there's no meaning to the""" start="00:02:22.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""order. It's usually just the search order that the tool""" start="00:02:26.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""happens to look in first.""" start="00:02:30.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Another thing is, so when you're searching, you oftentimes""" start="00:02:34.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""have to keep the state of the searches in your head. For""" start="00:02:38.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""example, you try one search, you see the results, find the""" start="00:02:41.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""results you think are relevant, keep them in your head, run""" start="00:02:46.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search number two, look through the results, kind of""" start="00:02:49.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""combine these different search results in your head until""" start="00:02:52.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you get an idea of which ones might be relevant.""" start="00:02:56.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Another thing is that the search primitives are fairly limited.""" start="00:02:59.971" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So yeah, you can search regular expressions, but you can't""" start="00:03:04.516" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""really define complex things like, I want to search files in""" start="00:03:10.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""this directory, and this directory, and this directory,""" start="00:03:14.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""except these subdirectories, and accept test files, and I""" start="00:03:18.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""only want files with this file extension. Criteria like""" start="00:03:22.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that are really hard to... I'm sure they're possible in tools""" start="00:03:25.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""like grep, but they're pretty hard to construct.""" start="00:03:28.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""And lastly, there's no notion of any relevance. All the""" start="00:03:34.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""results you get back, I mean, you don't know, is the search""" start="00:03:38.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""more relevant? Is it twice as relevant? Is it""" start="00:03:42.040" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""100 times more relevant? These tools usually don't provide""" start="00:03:43.096" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""such information.""" start="00:03:52.280" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Information retrieval""" start="00:03:58.233" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""There's a field called information retrieval,""" start="00:03:58.233" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""and this deals with this exact problem.""" start="00:04:00.395" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""You have lots of data you're searching for.""" start="00:04:02.617" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""How do you construct a search query?""" start="00:04:04.719" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""How do you get results back fast? How do you""" start="00:04:09.262" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""rank which ones are most relevant? How do you evaluate""" start="00:04:09.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""your search system to see if it's getting better or worse?""" start="00:04:14.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""There's a lot of work, a lot of books written on the topic of""" start="00:04:20.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""information retrieval. If one wants to improve""" start="00:04:23.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""searching in Emacs, then drawing inspiration from this""" start="00:04:28.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""field is necessary.""" start="00:04:31.880" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Search engine in Emacs: the index""" start="00:04:34.296" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""The first aspect of information retrieval is the index.""" start="00:04:34.296" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""The reverse index is what search engines use to find results really fast.""" start="00:04:41.384" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Essentially, it's a map of search term""" start="00:04:46.609" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""to locations where that term is located.""" start="00:04:51.455" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""You'll have all the terms or maybe even parts of""" start="00:04:54.739" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the terms, and then you'll have all the locations where""" start="00:04:57.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""they're located. Any query could easily look up""" start="00:04:59.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""where things are located, join results together, and""" start="00:05:02.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that's how they get the results to be really fast. For this""" start="00:05:05.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""project, I decided to forgo creating an index altogether.""" start="00:05:12.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""An index is pretty complicated to maintain because""" start="00:05:19.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""it always has to be in sync. Any time you open a file and save""" start="00:05:23.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""it, you would have to re-index, you would have to make sure""" start="00:05:27.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that file is re-indexed properly. Then you have the""" start="00:05:29.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""whole issue of, well, if you're searching in Emacs,""" start="00:05:32.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you have all these projects, this directory,""" start="00:05:36.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that directory, how do you know which? Do you always have to""" start="00:05:38.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""keep them in sync? It's quite a hard task to handle""" start="00:05:42.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that. Then on the other end, tools like ripgrep can""" start="00:05:47.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search very fast. Even though they can't search maybe on the""" start="00:05:53.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""order of tens of thousands of repositories, for a local""" start="00:05:59.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""setting, they should be plenty fast enough.""" start="00:06:03.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""I benchmarked. Ripgrep, for example, is""" start="00:06:06.040" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""on the order of gigabytes per second.""" start="00:06:12.240" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Definitely, it can search a few pretty big size""" start="00:06:15.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""repositories.""" start="00:06:19.240" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Search engine in Emacs: Ranking""" start="00:06:21.757" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""Next main task. We decided not to use an""" start="00:06:21.757" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""index. Next task is how do we rank search results? So there's""" start="00:06:24.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""two main algorithms that are used these days. The first""" start="00:06:29.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""one is tf-idf, which stands for term frequency, inverse""" start="00:06:33.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""target frequency. Then there's BM25, which is sort of a""" start="00:06:36.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""modified tf-idf algorithm.""" start="00:06:43.040" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""tf-idf: term-frequency x inverse-document-frequency""" start="00:06:43.553" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""tf-idf, without going into""" start="00:06:43.553" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""too much detail, essentially multiplies two terms. One""" start="00:06:45.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""is the term frequency, and then you multiply it by the""" start="00:06:49.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""inverse document frequency. The term frequency is a""" start="00:06:51.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""measure of how often that search term occurs. The""" start="00:06:54.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""inverse document frequency is a measure of how much""" start="00:06:58.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""information that term provides. If the term occurs a lot,""" start="00:07:00.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""then it gets a higher score in the term frequency section.""" start="00:07:06.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""But if it's a common word that exists in a lot of documents,""" start="00:07:08.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""then its inverse document frequency goes down.""" start="00:07:12.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""It kind of scores it less. You'll find that words like the,""" start="00:07:13.901" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""in, is, these really common words, since they occur""" start="00:07:20.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""everywhere, their inverse document frequency is""" start="00:07:25.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""essentially zero. They don't really count towards a""" start="00:07:29.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""score. But when you have rare words that only occur in a""" start="00:07:32.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""few documents, they're weighted a lot more. So the more""" start="00:07:35.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""those rare words occur, they boost the score higher.""" start="00:07:37.680" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""BM25""" start="00:07:41.160" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""BM25 is a modification of this. It's essentially TF, it's""" start="00:07:41.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""essentially the previous one, except it dampens out terms""" start="00:07:48.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that occur more often. Imagine you have a bunch of""" start="00:07:53.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""documents. One has a term 10 times, one has a term, that same""" start="00:07:55.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""term a hundred times, another has a thousand times.""" start="00:07:59.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""You'll see the score dampens off as the number of""" start="00:08:02.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""occurrences increases. That prevents any one term from""" start="00:08:06.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""overpowering the score. This is the algorithm I ended up""" start="00:08:10.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""choosing for my implementation. So with a plan of using a""" start="00:08:16.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""command line tool like ripgrep to get term occurrences, and""" start="00:08:21.040" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""then using a scoring algorithm like BM25 to rank the terms,""" start="00:08:29.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""we can combine this together and create a simple search""" start="00:08:36.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""mechanism.""" start="00:08:40.080" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Searching with p-search""" start="00:08:41.200" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""Here we're in the directory for the Emacs source code.""" start="00:08:41.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Let's say we want to search for the display code. We""" start="00:08:47.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""run the p-search command, starting the search engine. It""" start="00:08:53.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""opens up. We notice it has three sections, the candidate""" start="00:08:58.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""generators, the priors, and the search results. The""" start="00:09:01.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""candidate generators generates the search space we're""" start="00:09:05.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""looking on. These are all composable and you can add as""" start="00:09:10.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""many as you want. So with this, it specifies that here""" start="00:09:14.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""we're searching on the file system and we're searching in""" start="00:09:19.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""this directory. We're using the ripgrep tool to search""" start="00:09:25.240" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""with, and we want to make sure that we're searching only on""" start="00:09:30.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""files committed to Git. Here we see the search results.""" start="00:09:33.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Notice here is their final probability. Here, notice""" start="00:09:40.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that they're all the same, and they're the same because we""" start="00:09:45.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""don't have any search criteria specified here. Suppose""" start="00:09:47.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""we want to search for display-related code. We add a""" start="00:09:50.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""query: display.""" start="00:09:55.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So then it spins off the processes, gets the search term""" start="00:09:57.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""counts and calculates the new scores. Notice here that""" start="00:10:06.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the results that come on top are just at first glance appear""" start="00:10:10.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""to be relevant to display. Remember, if we compare""" start="00:10:15.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that to just running a ripgrep raw, notice here we're""" start="00:10:19.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""getting 53,000 results and it's pretty hard to go through""" start="00:10:25.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""these results and make sense of it.""" start="00:10:31.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So that's p-search in a nutshell.""" start="00:10:34.320" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Flight AF 447""" start="00:10:41.457" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""Next, I wanted to talk about the story of Flight 447.""" start="00:10:41.457" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Flight 447 going from Rio de Janeiro to Paris""" start="00:10:45.983" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""crashed somewhere in the Atlantic Ocean""" start="00:10:49.327" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""on June 1st, 2009, killing everyone on board.""" start="00:10:51.510" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Four search attempts were made to find the wreckage.""" start="00:10:54.714" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""None of them were successful, except the finding of some debris""" start="00:10:56.895" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""and a dead body. It was decided that they really wanted""" start="00:11:01.076" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""to find the wreckage to retrieve data as to why the search""" start="00:11:05.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""occurred. This occurred two years after the""" start="00:11:09.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""initial crash. With this next search attempt, they""" start="00:11:14.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""wanted to create a probability distribution of where the""" start="00:11:19.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""crash could be. The only piece of concrete data they had""" start="00:11:23.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""was a GPS signal from the ship at 210 containing the GPS""" start="00:11:26.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""location of the plane was at 2.98 degrees north, 30.59""" start="00:11:35.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""degrees west. That was the only data they had to go off of.""" start="00:11:40.240" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So they drew a circle around that point""" start="00:11:44.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""with a radius of 40 nautical miles. They assumed that""" start="00:11:50.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""anything outside the circle would have been impossible for""" start="00:11:54.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the ship to reach. This was the starting point for""" start="00:11:57.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""creating the probability distribution of where the""" start="00:12:01.240" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""wreckage occurred. Anything outside the circle, they""" start="00:12:04.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""assumed it was impossible to reach.""" start="00:12:08.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""The only other pieces of data were the four failed search""" start="00:12:09.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""attempts and then some of the debris found. One thing they""" start="00:12:16.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""did decide was to look at similar crashes where control was""" start="00:12:21.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""lost to analyze where the crashes landed, compared to where""" start="00:12:26.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the loss of control started. This probability""" start="00:12:30.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""distribution, the circular normal distribution was""" start="00:12:37.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""decided upon. Here you can see that the center has a lot""" start="00:12:43.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""higher chance of finding the wreckage. As you go away""" start="00:12:47.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""from the center, the probability of finding the wreckage""" start="00:12:51.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""decreases a lot. The next thing they looked at was, well,""" start="00:12:55.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""they noticed they had retrieved some dead bodies from the""" start="00:13:02.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""wreckage. So they thought that they could calculate the""" start="00:13:05.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""backward drift on that particular day to find where the""" start="00:13:12.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""crash might've occurred. If they found bodies at a""" start="00:13:18.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""particular location, they can kind of work backwards from""" start="00:13:21.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that in order to find where the initial crash occurred.""" start="00:13:25.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So here you can see the probability distribution based off of""" start="00:13:30.666" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the backward drift model. Here you see the darker colors""" start="00:13:34.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""have a higher probability of finding the location. So""" start="00:13:40.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""with all these pieces of data, so with that circular 40""" start="00:13:46.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""nautical mile uniform distribution, with that circular""" start="00:13:50.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""normal distribution of comparing similar crashes, as well""" start="00:13:54.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""as with the backward drift, they were able to combine all""" start="00:14:02.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""three of these pieces""" start="00:14:07.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""in order to come up with a final prior distribution of where""" start="00:14:08.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the wreckage occurred. So this is what the final model""" start="00:14:14.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""they came upon. Here you can see it has that 40 nautical""" start="00:14:19.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""mile radius circle. It has that darker center, which""" start="00:14:24.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""indicates a higher probability because of the""" start="00:14:29.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""crash similarity. Then here you also see along this line""" start="00:14:32.040" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""has a slightly higher probability due to the backward drift""" start="00:14:38.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""distribution.""" start="00:14:50.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So the next thing is, since they had performed searches,""" start="00:14:52.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""they decided to incorporate the data from those searches""" start="00:14:56.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""into their new distribution. Here you can see places""" start="00:15:00.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""where they searched initially. If you think about it,""" start="00:15:04.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you can assume that, well, if you search for something,""" start="00:15:08.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""there's a good chance you'll find it, but not necessarily.""" start="00:15:11.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Anywhere where they searched, the probability of it""" start="00:15:14.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""finding it there is greatly reduced. It's not zero because""" start="00:15:18.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""obviously you can look for something and miss it, but it kind""" start="00:15:22.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""of reduces the probability that we would expect to find it in""" start="00:15:26.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""those already searched locations. This is the""" start="00:15:31.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""posterior distribution or distribution after counting""" start="00:15:36.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""observations made.""" start="00:15:41.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Here we can see kind of these cutouts of where the""" start="00:15:44.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""previous searches occurred. This is the final""" start="00:15:48.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""distribution they went off of to perform the subsequent""" start="00:15:53.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search. In the end, the wreckage was found at a point close to""" start="00:15:57.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the center here, thus validating this methodology.""" start="00:16:02.000" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Modifying priors""" start="00:16:06.771" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""We can see the power of this Bayesian search methodology""" start="00:16:06.771" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""in the way that we could take information from all the sources we had.""" start="00:16:10.333" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""We could draw analogies to similar situations.""" start="00:16:14.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""We can quantify these, combine them into a model,""" start="00:16:19.238" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""and then also update our model according to each observation we make.""" start="00:16:22.480" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""I think there's a lot of similarities to be drawn with""" start="00:16:27.894" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""searching on a computer in the sense that when we search for""" start="00:16:30.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""something, there's oftentimes a story we kind of have as to""" start="00:16:35.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""what search terms exist, where we expect to find the file.""" start="00:16:39.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""For example, if you're implementing a new feature, you'll""" start="00:16:43.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""often have some search terms in mind that you think will be""" start="00:16:46.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""relevant. Some search terms, you might think they have a""" start="00:16:49.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""possibility of being relevant, but maybe you're not sure.""" start="00:16:54.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""There's some directories where you know that they're not""" start="00:16:57.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""relevant. There's other criteria like, well, you know that""" start="00:17:02.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""maybe somebody in particular worked on this code.""" start="00:17:07.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""What if you could incorporate that information? Like, I know""" start="00:17:11.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""this author, he's always working on this feature. What if""" start="00:17:16.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""I just give the files that this person works on a higher""" start="00:17:21.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""probability than ones he doesn't work on? Or maybe you think""" start="00:17:25.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that this is a file that's committed too often. You think""" start="00:17:32.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that maybe the amount of times of commits it receives""" start="00:17:38.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""should change your probability of this file being""" start="00:17:43.440" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""relevant. That's where p-search comes in.""" start="00:17:47.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Its aim is to be a framework in order to incorporate all these""" start="00:17:52.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""sorts of different prior information into your searching""" start="00:17:57.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""process. You're able to say things like, I want files""" start="00:18:01.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""authored by this user to be given higher probability. I want""" start="00:18:06.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""this author to be given a lower priority. I know this author""" start="00:18:11.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""never works on this code. If he has a commit, then lower its""" start="00:18:13.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""probability, or you can specify specific paths, or you can""" start="00:18:18.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""specify multiple search terms, weighing different ones""" start="00:18:24.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""according to how you think those terms should be relevant.""" start="00:18:30.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So with p-search, we're able to incorporate information""" start="00:18:38.920" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""from multiple sources. Here, for example, we have a prior""" start="00:18:42.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""of type git author, and we're looking for all of the files""" start="00:18:46.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that are committed to by Lars. So the more commits he has,""" start="00:18:52.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the higher probability is given to that file. Suppose""" start="00:18:56.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""there's a feature I know he worked on, but I don't know the""" start="00:19:01.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""file or necessarily even key terms of it. Well, with this, I""" start="00:19:04.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""can incorporate that information.""" start="00:19:09.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So let's search again. Let's add display.""" start="00:19:12.141" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Let's see what responses we get back here. We can add""" start="00:19:16.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""as many of these criteria as we want. We can even specify that""" start="00:19:22.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the title of the file name should be a certain type. Let's""" start="00:19:27.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""say we're only concerned about C files. We add the file""" start="00:19:31.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""name should contain .c in it. With this, now we""" start="00:19:36.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""notice that all of the C files containing display authored""" start="00:19:45.400" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""by Lars should be given higher probability. We can""" start="00:19:51.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""continue to add these priors as we feel fit. The workflow""" start="00:19:56.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""that I found helps when searching is that you'll add""" start="00:20:02.720" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""criteria, you'll see some good results come up and some bad""" start="00:20:07.520" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""results come up. So you'll often find a pattern in those""" start="00:20:11.360" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""bad results, like, oh, I don't want test files, or this""" start="00:20:15.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""directory isn't relevant, or something like that. Then""" start="00:20:18.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you can update your prior distribution, adding its""" start="00:20:22.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""criteria, and then rerun it, and then it will get different""" start="00:20:27.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""probabilities for the files. So in the end, you'll have a""" start="00:20:31.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""list of results that's tailor-made to the thing you're""" start="00:20:35.160" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""searching for.""" start="00:20:37.640" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Importance""" start="00:20:40.405" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""There's a couple of other features I""" start="00:20:40.405" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""want to go through. One thing is that each of these priors,""" start="00:20:41.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""you can specify the importance. In other words, how""" start="00:20:49.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""important is this particular piece of information to your""" start="00:20:55.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search? So here, everything is of importance medium. But""" start="00:21:01.120" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""let's say I really care about something having the word""" start="00:21:05.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""display in it. I'm going to change its importance.""" start="00:21:07.880" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""Instead of medium, I'll change its importance to high.""" start="00:21:12.680" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""What that does essentially is things that don't have""" start="00:21:18.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""display in it are given a much bigger penalty and things with""" start="00:21:23.280" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the word display in it are rated much higher.""" start="00:21:28.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""With this, we're able to fine-tune the results that we get.""" start="00:21:28.129" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template new="1" text="""Complement or inverse""" start="00:21:38.560" video="mainVideo-p-search" id="subtitle"]]
+
+[[!template text="""Another thing you can do is that you can add the complement or""" start="00:21:38.560" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the inverse of certain queries. Let's say you want to""" start="00:21:45.640" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search for display, but you don't want it to contain the word""" start="00:21:49.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""frame. With the complement option on, when we create this""" start="00:21:53.240" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""search prior, now it's going to be searching for frame, but""" start="00:21:58.040" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""instead of increasing the search score, it's going to""" start="00:22:01.840" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""decrease it if it contains the word frame.""" start="00:22:04.960" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""So here, things related to frame are kind of""" start="00:22:07.000" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""deprioritized. We can also say that we really don't want""" start="00:22:14.320" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""the search to contain the word frame by increasing its""" start="00:22:18.080" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""importance. So with all these composable pieces, we can""" start="00:22:21.600" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""create kind of a search that's tailor-made to our needs.""" start="00:22:27.200" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""That concludes this talk. There's a lot more I could talk""" start="00:22:33.413" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""about with regards to research, so definitely follow the""" start="00:22:35.760" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""project if you're interested. Thanks for watching, and I""" start="00:22:37.800" video="mainVideo-p-search" id="subtitle"]]
+[[!template text="""hope you enjoy the rest of the conference.""" start="00:22:40.640" video="mainVideo-p-search" id="subtitle"]]
+
+
+
+Captioner: sachac
+
 Questions or comments? Please e-mail [zacromero@posteo.com](mailto:zacromero@posteo.com?subject=Comment%20for%20EmacsConf%202023%20p-search%3A%20p-search%3A%20a%20local%20search%20engine%20in%20Emacs)
 
 
diff --git a/2024/info/p-search-before.md b/2024/info/p-search-before.md
index b6ceaec3..902d9f62 100644
--- a/2024/info/p-search-before.md
+++ b/2024/info/p-search-before.md
@@ -8,12 +8,26 @@ The following image shows where the talk is in the schedule for Sat 2024-12-07.
 Format: 23-min talk			; Q&A: BigBlueButton conference room <https://media.emacsconf.org/2024/current/bbb-p-search.html> Etherpad: <https://pad.emacsconf.org/2024-p-search>  
 Etherpad: <https://pad.emacsconf.org/2024-p-search>  
 Discuss on IRC: [#emacsconf-dev](https://chat.emacsconf.org/?join=emacsconf,emacsconf-dev)  
-Status: Ready to stream  
+Status: Now playing on the conference livestream  
 <div>Times in different time zones:</div><div class="times" start="2024-12-07T16:20:00Z" end="2024-12-07T16:45:00Z"><div class="conf-time">Saturday, Dec 7 2024, ~11:20 AM - 11:45 AM EST (US/Eastern)</div><div class="others"><div>which is the same as:</div>Saturday, Dec 7 2024, ~10:20 AM - 10:45 AM CST (US/Central)<br />Saturday, Dec 7 2024, ~9:20 AM - 9:45 AM MST (US/Mountain)<br />Saturday, Dec 7 2024, ~8:20 AM - 8:45 AM PST (US/Pacific)<br />Saturday, Dec 7 2024, ~4:20 PM - 4:45 PM UTC <br />Saturday, Dec 7 2024, ~5:20 PM - 5:45 PM CET (Europe/Paris)<br />Saturday, Dec 7 2024, ~6:20 PM - 6:45 PM EET (Europe/Athens)<br />Saturday, Dec 7 2024, ~9:50 PM - 10:15 PM IST (Asia/Kolkata)<br />Sunday, Dec 8 2024, ~12:20 AM - 12:45 AM +08 (Asia/Singapore)<br />Sunday, Dec 8 2024, ~1:20 AM - 1:45 AM JST (Asia/Tokyo)</div></div><div><strong><a href="/2024/watch/dev/">Find out how to watch and participate</a></strong></div>
 
 
 
 
+<div class="vid"><video controls preload="none" id="p-search-mainVideo"><source src="https://media.emacsconf.org/2024/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.webm" />captions="""<track label="English" kind="captions" srclang="en" src="/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt" default />"""<track kind="chapters" label="Chapters" src="/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main--chapters.vtt" /><p><em>Your browser does not support the video tag. Please download the video instead.</em></p></video>[[!template id="chapters" vidid="p-search-mainVideo" data="""
+00:00.000 Search in daily workflows
+01:24.200 Problems with editor search tools
+03:58.233 Information retrieval
+04:34.296 Search engine in Emacs: the index
+06:21.757 Search engine in Emacs: Ranking
+06:43.553 tf-idf: term-frequency x inverse-document-frequency
+07:41.160 BM25
+08:41.200 Searching with p-search
+10:41.457 Flight AF 447
+16:06.771 Modifying priors
+20:40.405 Importance
+21:38.560 Complement or inverse
 
+"""]]<div></div>Duration: 22:42 minutes<div class="files resources"><ul><li><a href="https://pad.emacsconf.org/2024-p-search">Open Etherpad</a></li><li><a href="https://media.emacsconf.org/2024/current/bbb-p-search.html">Open public Q&A</a></li><li><a href="https://media.emacsconf.org/2024/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--intro.webm">Download --intro.webm</a></li><li><a href="https://media.emacsconf.org/2024/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main--chapters.vtt">Download --main--chapters.vtt</a></li><li><a href="https://media.emacsconf.org/2024/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt">Download --main.vtt</a></li><li><a href="https://media.emacsconf.org/2024/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.webm">Download --main.webm (63MB)</a></li></ul></div></div>
 # Description
 <!-- End of emacsconf-publish-before-page -->
 \ No newline at end of file
author	EmacsConf <emacsconf-org@gnu.org>	2024-12-07 11:20:32 -0500
committer	EmacsConf <emacsconf-org@gnu.org>	2024-12-07 11:20:32 -0500
commit	01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347 (patch)
tree	867908ce646247f6158b5c2d926aa2ec6fedaa0d /2024
parent	46dfcc1bf722423af72cd47bb5a58c3c280793b9 (diff)
download	emacsconf-wiki-01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347.tar.xz emacsconf-wiki-01d1cb08bc8c93bef9e55bc8ed9ff9df4e2c4347.zip