path: root/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt



WEBVTT captioned by sachac

NOTE Search in daily workflows

00:00:00.000 --> 00:00:03.399
Hello, my name is Zachary Romero, and today I'll be going

00:00:03.400 --> 00:00:08.115
over p-search, a local search engine in Emacs.

00:00:08.116 --> 00:00:12.398
Search these days is everywhere in software, from text editors,

00:00:12.399 --> 00:00:18.359
to IDEs, to most online websites. These tools tend to fall

00:00:18.360 --> 00:00:25.839
into one of two categories. One are tools that run locally,

00:00:25.840 --> 00:00:31.279
and work by matching string to text. The most common

00:00:31.280 --> 00:00:35.639
example of this is grep. In Emacs, there are a lot of

00:00:35.640 --> 00:00:38.959
extensions which provide functionality on top of these

00:00:38.960 --> 00:00:42.388
tools, such as projectile-grep, deadgrep,

00:00:42.389 --> 00:00:46.849
consult-ripgrep. Most editors have some sort of

00:00:46.850 --> 00:00:52.691
search current project feature. Most of the time,

00:00:52.692 --> 00:00:56.393
some of these tools have features like regular expressions,

00:00:56.394 --> 00:00:59.215
or you can specify file extension,

00:00:59.216 --> 00:01:01.636
or a directory you want to search in,

00:01:01.637 --> 00:01:03.957
but features are pretty limited.

00:01:03.958 --> 00:01:07.919
The other kind of search we use are usually hosted online,

00:01:07.920 --> 00:01:12.302
and they usually search a vast corpus of data.

00:01:12.303 --> 00:01:15.639
These are usually proprietary

00:01:15.640 --> 00:01:18.765
online services such as Google, GitHub,

00:01:18.766 --> 00:01:24.199
SourceGraph for code.

NOTE Problems with editor search tools

00:01:24.200 --> 00:01:28.839
The kind of search feature that editors

00:01:28.840 --> 00:01:36.719
usually have have a lot of downsides to them. For one, a lot

00:01:36.720 --> 00:01:38.839
of times you don't know the exact search string you're

00:01:38.840 --> 00:01:42.783
searching for. Some complicated term like this

00:01:42.784 --> 00:01:46.860
high volume demand partner, you know, do you know if...

00:01:46.861 --> 00:01:49.708
Are some words abbreviated, is it capitalized,

00:01:49.709 --> 00:01:53.089
is it in kebab case, camel case, snake case?

00:01:53.090 --> 00:01:57.571
You often have to search all these variations.

00:01:57.572 --> 00:02:05.434
Another downside is that the search results returned

00:02:05.435 --> 00:02:07.769
contain a lot of noise. For example,

00:02:07.770 --> 00:02:10.816
you may get a lot of test files.

00:02:10.817 --> 00:02:13.537
If the tool hits your vendor directory,

00:02:13.538 --> 00:02:17.199
it may get a bunch of results from libraries

00:02:17.200 --> 00:02:22.879
you're using, which most are not helpful. Another downside

00:02:22.880 --> 00:02:26.679
is that the order given is, well, there's no meaning to the

00:02:26.680 --> 00:02:30.319
order. It's usually just the search order that the tool

00:02:30.320 --> 00:02:34.639
happens to look in first.

00:02:34.640 --> 00:02:38.639
Another thing is, so when you're searching, you oftentimes

00:02:38.640 --> 00:02:41.639
have to keep the state of the searches in your head. For

00:02:41.640 --> 00:02:46.639
example, you try one search, you see the results, find the

00:02:46.640 --> 00:02:49.639
results you think are relevant, keep them in your head, run

00:02:49.640 --> 00:02:52.519
search number two, look through the results, kind of

00:02:52.520 --> 00:02:56.119
combine these different search results in your head until

00:02:56.120 --> 00:02:59.970
you get an idea of which ones might be relevant.

00:02:59.971 --> 00:03:04.515
Another thing is that the search primitives are fairly limited.

00:03:04.516 --> 00:03:10.599
So yeah, you can search regular expressions, but you can't

00:03:10.600 --> 00:03:14.719
really define complex things like, I want to search files in

00:03:14.720 --> 00:03:18.439
this directory, and this directory, and this directory,

00:03:18.440 --> 00:03:22.319
except these subdirectories, and accept test files, and I

00:03:22.320 --> 00:03:25.559
only want files with this file extension. Criteria like

00:03:25.560 --> 00:03:28.919
that are really hard to... I'm sure they're possible in tools

00:03:28.920 --> 00:03:34.479
like grep, but they're pretty hard to construct.

00:03:34.480 --> 00:03:38.199
And lastly, there's no notion of any relevance. All the

00:03:38.200 --> 00:03:42.039
results you get back, I mean, you don't know, is the search

00:03:42.040 --> 00:03:43.095
more relevant? Is it twice as relevant? Is it

00:03:43.096 --> 00:03:52.279
100 times more relevant? These tools usually don't provide

00:03:52.280 --> 00:03:58.232
such information.

NOTE Information retrieval

00:03:58.233 --> 00:04:00.394
There's a field called information retrieval,

00:04:00.395 --> 00:04:02.616
and this deals with this exact problem.

00:04:02.617 --> 00:04:04.718
You have lots of data you're searching for.

00:04:04.719 --> 00:04:09.261
How do you construct a search query?

00:04:09.262 --> 00:04:09.839
How do you get results back fast? How do you

00:04:09.840 --> 00:04:14.519
rank which ones are most relevant? How do you evaluate

00:04:14.520 --> 00:04:20.079
your search system to see if it's getting better or worse?

00:04:20.080 --> 00:04:23.119
There's a lot of work, a lot of books written on the topic of

00:04:23.120 --> 00:04:28.159
information retrieval. If one wants to improve

00:04:28.160 --> 00:04:31.879
searching in Emacs, then drawing inspiration from this

00:04:31.880 --> 00:04:34.295
field is necessary.

NOTE Search engine in Emacs: the index

00:04:34.296 --> 00:04:41.383
The first aspect of information retrieval is the index.

00:04:41.384 --> 00:04:46.608
The reverse index is what search engines use to find results really fast.

00:04:46.609 --> 00:04:51.454
Essentially, it's a map of search term

00:04:51.455 --> 00:04:54.738
to locations where that term is located.

00:04:54.739 --> 00:04:57.079
You'll have all the terms or maybe even parts of

00:04:57.080 --> 00:04:59.159
the terms, and then you'll have all the locations where

00:04:59.160 --> 00:05:02.119
they're located. Any query could easily look up

00:05:02.120 --> 00:05:05.919
where things are located, join results together, and

00:05:05.920 --> 00:05:12.879
that's how they get the results to be really fast. For this

00:05:12.880 --> 00:05:19.159
project, I decided to forgo creating an index altogether.

00:05:19.160 --> 00:05:23.759
An index is pretty complicated to maintain because

00:05:23.760 --> 00:05:27.319
it always has to be in sync. Any time you open a file and save

00:05:27.320 --> 00:05:29.959
it, you would have to re-index, you would have to make sure

00:05:29.960 --> 00:05:32.559
that file is re-indexed properly. Then you have the

00:05:32.560 --> 00:05:36.119
whole issue of, well, if you're searching in Emacs,

00:05:36.120 --> 00:05:38.799
you have all these projects, this directory,

00:05:38.800 --> 00:05:42.479
that directory, how do you know which? Do you always have to

00:05:42.480 --> 00:05:47.399
keep them in sync? It's quite a hard task to handle

00:05:47.400 --> 00:05:53.079
that. Then on the other end, tools like ripgrep can

00:05:53.080 --> 00:05:59.119
search very fast. Even though they can't search maybe on the

00:05:59.120 --> 00:06:03.919
order of tens of thousands of repositories, for a local

00:06:03.920 --> 00:06:06.039
setting, they should be plenty fast enough.

00:06:06.040 --> 00:06:12.239
I benchmarked. Ripgrep, for example, is

00:06:12.240 --> 00:06:15.959
on the order of gigabytes per second.

00:06:15.960 --> 00:06:19.239
Definitely, it can search a few pretty big size

00:06:19.240 --> 00:06:21.756
repositories.

NOTE Search engine in Emacs: Ranking

00:06:21.757 --> 00:06:24.799
Next main task. We decided not to use an

00:06:24.800 --> 00:06:29.959
index. Next task is how do we rank search results? So there's

00:06:29.960 --> 00:06:33.439
two main algorithms that are used these days. The first

00:06:33.440 --> 00:06:36.519
one is tf-idf, which stands for term frequency, inverse

00:06:36.520 --> 00:06:43.039
target frequency. Then there's BM25, which is sort of a

00:06:43.040 --> 00:06:43.552
modified tf-idf algorithm.

NOTE tf-idf: term-frequency x inverse-document-frequency

00:06:43.553 --> 00:06:45.679
tf-idf, without going into

00:06:45.680 --> 00:06:49.159
too much detail, essentially multiplies two terms. One

00:06:49.160 --> 00:06:51.879
is the term frequency, and then you multiply it by the

00:06:51.880 --> 00:06:54.559
inverse document frequency. The term frequency is a

00:06:54.560 --> 00:06:58.519
measure of how often that search term occurs. The

00:06:58.520 --> 00:07:00.799
inverse document frequency is a measure of how much

00:07:00.800 --> 00:07:06.199
information that term provides. If the term occurs a lot,

00:07:06.200 --> 00:07:08.719
then it gets a higher score in the term frequency section.

00:07:08.720 --> 00:07:12.399
But if it's a common word that exists in a lot of documents,

00:07:12.400 --> 00:07:13.900
then its inverse document frequency goes down.

00:07:13.901 --> 00:07:20.879
It kind of scores it less. You'll find that words like the,

00:07:20.880 --> 00:07:25.959
in, is, these really common words, since they occur

00:07:25.960 --> 00:07:29.199
everywhere, their inverse document frequency is

00:07:29.200 --> 00:07:32.479
essentially zero. They don't really count towards a

00:07:32.480 --> 00:07:35.679
score. But when you have rare words that only occur in a

00:07:35.680 --> 00:07:37.679
few documents, they're weighted a lot more. So the more

00:07:37.680 --> 00:07:41.159
those rare words occur, they boost the score higher.

NOTE BM25

00:07:41.160 --> 00:07:48.839
BM25 is a modification of this. It's essentially TF, it's

00:07:48.840 --> 00:07:53.119
essentially the previous one, except it dampens out terms

00:07:53.120 --> 00:07:55.439
that occur more often. Imagine you have a bunch of

00:07:55.440 --> 00:07:59.359
documents. One has a term 10 times, one has a term, that same

00:07:59.360 --> 00:08:02.439
term a hundred times, another has a thousand times.

00:08:02.440 --> 00:08:06.799
You'll see the score dampens off as the number of

00:08:06.800 --> 00:08:10.639
occurrences increases. That prevents any one term from

00:08:10.640 --> 00:08:16.559
overpowering the score. This is the algorithm I ended up

00:08:16.560 --> 00:08:21.039
choosing for my implementation. So with a plan of using a

00:08:21.040 --> 00:08:29.559
command line tool like ripgrep to get term occurrences, and

00:08:29.560 --> 00:08:36.799
then using a scoring algorithm like BM25 to rank the terms,

00:08:36.800 --> 00:08:40.079
we can combine this together and create a simple search

00:08:40.080 --> 00:08:41.199
mechanism.

NOTE Searching with p-search

00:08:41.200 --> 00:08:47.439
Here we're in the directory for the Emacs source code.

00:08:47.440 --> 00:08:53.479
Let's say we want to search for the display code. We

00:08:53.480 --> 00:08:58.679
run the p-search command, starting the search engine. It

00:08:58.680 --> 00:09:01.159
opens up. We notice it has three sections, the candidate

00:09:01.160 --> 00:09:05.199
generators, the priors, and the search results. The

00:09:05.200 --> 00:09:09.999
candidate generators generates the search space we're

00:09:10.000 --> 00:09:14.719
looking on. These are all composable and you can add as

00:09:14.720 --> 00:09:19.719
many as you want. So with this, it specifies that here

00:09:19.720 --> 00:09:25.239
we're searching on the file system and we're searching in

00:09:25.240 --> 00:09:30.799
this directory. We're using the ripgrep tool to search

00:09:30.800 --> 00:09:33.359
with, and we want to make sure that we're searching only on

00:09:33.360 --> 00:09:40.479
files committed to Git. Here we see the search results.

00:09:40.480 --> 00:09:45.159
Notice here is their final probability. Here, notice

00:09:45.160 --> 00:09:47.079
that they're all the same, and they're the same because we

00:09:47.080 --> 00:09:50.719
don't have any search criteria specified here. Suppose

00:09:50.720 --> 00:09:55.679
we want to search for display-related code. We add a

00:09:55.680 --> 00:09:57.359
query: display.

00:09:57.360 --> 00:10:06.559
So then it spins off the processes, gets the search term

00:10:06.560 --> 00:10:10.879
counts and calculates the new scores. Notice here that

00:10:10.880 --> 00:10:15.759
the results that come on top are just at first glance appear

00:10:15.760 --> 00:10:19.919
to be relevant to display. Remember, if we compare

00:10:19.920 --> 00:10:25.079
that to just running a ripgrep raw, notice here we're

00:10:25.080 --> 00:10:31.279
getting 53,000 results and it's pretty hard to go through

00:10:31.280 --> 00:10:34.319
these results and make sense of it.

00:10:34.320 --> 00:10:41.456
So that's p-search in a nutshell.

NOTE Flight AF 447

00:10:41.457 --> 00:10:45.982
Next, I wanted to talk about the story of Flight 447.

00:10:45.983 --> 00:10:49.326
Flight 447 going from Rio de Janeiro to Paris

00:10:49.327 --> 00:10:51.509
crashed somewhere in the Atlantic Ocean

00:10:51.510 --> 00:10:54.713
on June 1st, 2009, killing everyone on board.

00:10:54.714 --> 00:10:56.894
Four search attempts were made to find the wreckage.

00:10:56.895 --> 00:11:01.075
None of them were successful, except the finding of some debris

00:11:01.076 --> 00:11:05.479
and a dead body. It was decided that they really wanted

00:11:05.480 --> 00:11:09.519
to find the wreckage to retrieve data as to why the search

00:11:09.520 --> 00:11:14.639
occurred. This occurred two years after the

00:11:14.640 --> 00:11:19.959
initial crash. With this next search attempt, they

00:11:19.960 --> 00:11:23.199
wanted to create a probability distribution of where the

00:11:23.200 --> 00:11:26.759
crash could be. The only piece of concrete data they had

00:11:26.760 --> 00:11:35.079
was a GPS signal from the ship at 210 containing the GPS

00:11:35.080 --> 00:11:40.239
location of the plane was at 2.98 degrees north, 30.59

00:11:40.240 --> 00:11:44.719
degrees west. That was the only data they had to go off of.

00:11:44.720 --> 00:11:50.079
So they drew a circle around that point

00:11:50.080 --> 00:11:54.679
with a radius of 40 nautical miles. They assumed that

00:11:54.680 --> 00:11:57.479
anything outside the circle would have been impossible for

00:11:57.480 --> 00:12:01.239
the ship to reach. This was the starting point for

00:12:01.240 --> 00:12:04.799
creating the probability distribution of where the

00:12:04.800 --> 00:12:08.119
wreckage occurred. Anything outside the circle, they

00:12:08.120 --> 00:12:09.639
assumed it was impossible to reach.

00:12:09.640 --> 00:12:16.479
The only other pieces of data were the four failed search

00:12:16.480 --> 00:12:21.719
attempts and then some of the debris found. One thing they

00:12:21.720 --> 00:12:26.159
did decide was to look at similar crashes where control was

00:12:26.160 --> 00:12:30.319
lost to analyze where the crashes landed, compared to where

00:12:30.320 --> 00:12:37.399
the loss of control started. This probability

00:12:37.400 --> 00:12:43.479
distribution, the circular normal distribution was

00:12:43.480 --> 00:12:47.919
decided upon. Here you can see that the center has a lot

00:12:47.920 --> 00:12:51.879
higher chance of finding the wreckage. As you go away

00:12:51.880 --> 00:12:55.399
from the center, the probability of finding the wreckage

00:12:55.400 --> 00:13:02.319
decreases a lot. The next thing they looked at was, well,

00:13:02.320 --> 00:13:05.959
they noticed they had retrieved some dead bodies from the

00:13:05.960 --> 00:13:12.959
wreckage. So they thought that they could calculate the

00:13:12.960 --> 00:13:18.439
backward drift on that particular day to find where the

00:13:18.440 --> 00:13:21.479
crash might've occurred. If they found bodies at a

00:13:21.480 --> 00:13:25.119
particular location, they can kind of work backwards from

00:13:25.120 --> 00:13:30.665
that in order to find where the initial crash occurred.

00:13:30.666 --> 00:13:34.719
So here you can see the probability distribution based off of

00:13:34.720 --> 00:13:40.279
the backward drift model. Here you see the darker colors

00:13:40.280 --> 00:13:46.159
have a higher probability of finding the location. So

00:13:46.160 --> 00:13:50.679
with all these pieces of data, so with that circular 40

00:13:50.680 --> 00:13:54.959
nautical mile uniform distribution, with that circular

00:13:54.960 --> 00:14:02.199
normal distribution of comparing similar crashes, as well

00:14:02.200 --> 00:14:07.439
as with the backward drift, they were able to combine all

00:14:07.440 --> 00:14:08.559
three of these pieces

00:14:08.560 --> 00:14:14.599
in order to come up with a final prior distribution of where

00:14:14.600 --> 00:14:19.519
the wreckage occurred. So this is what the final model

00:14:19.520 --> 00:14:24.719
they came upon. Here you can see it has that 40 nautical

00:14:24.720 --> 00:14:29.679
mile radius circle. It has that darker center, which

00:14:29.680 --> 00:14:32.039
indicates a higher probability because of the

00:14:32.040 --> 00:14:38.959
crash similarity. Then here you also see along this line

00:14:38.960 --> 00:14:50.799
has a slightly higher probability due to the backward drift

00:14:50.800 --> 00:14:52.119
distribution.

00:14:52.120 --> 00:14:56.559
So the next thing is, since they had performed searches,

00:14:56.560 --> 00:15:00.559
they decided to incorporate the data from those searches

00:15:00.560 --> 00:15:04.759
into their new distribution. Here you can see places

00:15:04.760 --> 00:15:08.879
where they searched initially. If you think about it,

00:15:08.880 --> 00:15:11.399
you can assume that, well, if you search for something,

00:15:11.400 --> 00:15:14.199
there's a good chance you'll find it, but not necessarily.

00:15:14.200 --> 00:15:18.439
Anywhere where they searched, the probability of it

00:15:18.440 --> 00:15:22.839
finding it there is greatly reduced. It's not zero because

00:15:22.840 --> 00:15:26.879
obviously you can look for something and miss it, but it kind

00:15:26.880 --> 00:15:31.119
of reduces the probability that we would expect to find it in

00:15:31.120 --> 00:15:36.679
those already searched locations. This is the

00:15:36.680 --> 00:15:41.919
posterior distribution or distribution after counting

00:15:41.920 --> 00:15:44.559
observations made.

00:15:44.560 --> 00:15:48.759
Here we can see kind of these cutouts of where the

00:15:48.760 --> 00:15:53.959
previous searches occurred. This is the final

00:15:53.960 --> 00:15:56.999
distribution they went off of to perform the subsequent

00:15:57.000 --> 00:16:01.999
search. In the end, the wreckage was found at a point close to

00:16:02.000 --> 00:16:06.770
the center here, thus validating this methodology.

NOTE Modifying priors

00:16:06.771 --> 00:16:10.332
We can see the power of this Bayesian search methodology

00:16:10.333 --> 00:16:13.999
in the way that we could take information from all the sources we had.

00:16:14.000 --> 00:16:19.237
We could draw analogies to similar situations.

00:16:19.238 --> 00:16:22.479
We can quantify these, combine them into a model,

00:16:22.480 --> 00:16:27.893
and then also update our model according to each observation we make.

00:16:27.894 --> 00:16:30.359
I think there's a lot of similarities to be drawn with

00:16:30.360 --> 00:16:35.159
searching on a computer in the sense that when we search for

00:16:35.160 --> 00:16:39.399
something, there's oftentimes a story we kind of have as to

00:16:39.400 --> 00:16:43.959
what search terms exist, where we expect to find the file.

00:16:43.960 --> 00:16:46.719
For example, if you're implementing a new feature, you'll

00:16:46.720 --> 00:16:49.919
often have some search terms in mind that you think will be

00:16:49.920 --> 00:16:54.719
relevant. Some search terms, you might think they have a

00:16:54.720 --> 00:16:57.599
possibility of being relevant, but maybe you're not sure.

00:16:57.600 --> 00:17:02.879
There's some directories where you know that they're not

00:17:02.880 --> 00:17:07.759
relevant. There's other criteria like, well, you know that

00:17:07.760 --> 00:17:11.399
maybe somebody in particular worked on this code.

00:17:11.400 --> 00:17:16.319
What if you could incorporate that information? Like, I know

00:17:16.320 --> 00:17:21.399
this author, he's always working on this feature. What if

00:17:21.400 --> 00:17:25.519
I just give the files that this person works on a higher

00:17:25.520 --> 00:17:32.599
probability than ones he doesn't work on? Or maybe you think

00:17:32.600 --> 00:17:38.599
that this is a file that's committed too often. You think

00:17:38.600 --> 00:17:43.439
that maybe the amount of times of commits it receives

00:17:43.440 --> 00:17:47.719
should change your probability of this file being

00:17:47.720 --> 00:17:52.839
relevant. That's where p-search comes in.

00:17:52.840 --> 00:17:57.679
Its aim is to be a framework in order to incorporate all these

00:17:57.680 --> 00:18:01.359
sorts of different prior information into your searching

00:18:01.360 --> 00:18:05.999
process. You're able to say things like, I want files

00:18:06.000 --> 00:18:11.119
authored by this user to be given higher probability. I want

00:18:11.120 --> 00:18:13.919
this author to be given a lower priority. I know this author

00:18:13.920 --> 00:18:18.759
never works on this code. If he has a commit, then lower its

00:18:18.760 --> 00:18:24.679
probability, or you can specify specific paths, or you can

00:18:24.680 --> 00:18:30.199
specify multiple search terms, weighing different ones

00:18:30.200 --> 00:18:38.919
according to how you think those terms should be relevant.

00:18:38.920 --> 00:18:42.079
So with p-search, we're able to incorporate information

00:18:42.080 --> 00:18:46.279
from multiple sources. Here, for example, we have a prior

00:18:46.280 --> 00:18:52.079
of type git author, and we're looking for all of the files

00:18:52.080 --> 00:18:56.719
that are committed to by Lars. So the more commits he has,

00:18:56.720 --> 00:19:01.399
the higher probability is given to that file. Suppose

00:19:01.400 --> 00:19:04.559
there's a feature I know he worked on, but I don't know the

00:19:04.560 --> 00:19:09.159
file or necessarily even key terms of it. Well, with this, I

00:19:09.160 --> 00:19:12.140
can incorporate that information.

00:19:12.141 --> 00:19:15.999
So let's search again. Let's add display.

00:19:16.000 --> 00:19:22.959
Let's see what responses we get back here. We can add

00:19:22.960 --> 00:19:27.199
as many of these criteria as we want. We can even specify that

00:19:27.200 --> 00:19:31.519
the title of the file name should be a certain type. Let's

00:19:31.520 --> 00:19:36.599
say we're only concerned about C files. We add the file

00:19:36.600 --> 00:19:45.399
name should contain .c in it. With this, now we

00:19:45.400 --> 00:19:51.319
notice that all of the C files containing display authored

00:19:51.320 --> 00:19:56.279
by Lars should be given higher probability. We can

00:19:56.280 --> 00:20:02.719
continue to add these priors as we feel fit. The workflow

00:20:02.720 --> 00:20:07.519
that I found helps when searching is that you'll add

00:20:07.520 --> 00:20:11.359
criteria, you'll see some good results come up and some bad

00:20:11.360 --> 00:20:15.319
results come up. So you'll often find a pattern in those

00:20:15.320 --> 00:20:18.839
bad results, like, oh, I don't want test files, or this

00:20:18.840 --> 00:20:22.679
directory isn't relevant, or something like that. Then

00:20:22.680 --> 00:20:27.199
you can update your prior distribution, adding its

00:20:27.200 --> 00:20:31.119
criteria, and then rerun it, and then it will get different

00:20:31.120 --> 00:20:35.159
probabilities for the files. So in the end, you'll have a

00:20:35.160 --> 00:20:37.639
list of results that's tailor-made to the thing you're

00:20:37.640 --> 00:20:40.404
searching for.

NOTE Importance

00:20:40.405 --> 00:20:41.639
There's a couple of other features I

00:20:41.640 --> 00:20:49.079
want to go through. One thing is that each of these priors,

00:20:49.080 --> 00:20:55.839
you can specify the importance. In other words, how

00:20:55.840 --> 00:21:01.119
important is this particular piece of information to your

00:21:01.120 --> 00:21:05.199
search? So here, everything is of importance medium. But

00:21:05.200 --> 00:21:07.879
let's say I really care about something having the word

00:21:07.880 --> 00:21:12.679
display in it. I'm going to change its importance.

00:21:12.680 --> 00:21:18.599
Instead of medium, I'll change its importance to high.

00:21:18.600 --> 00:21:23.279
What that does essentially is things that don't have

00:21:23.280 --> 00:21:28.079
display in it are given a much bigger penalty and things with

00:21:28.080 --> 00:21:28.128
the word display in it are rated much higher.

00:21:28.129 --> 00:21:38.559
With this, we're able to fine-tune the results that we get.

NOTE Complement or inverse

00:21:38.560 --> 00:21:45.639
Another thing you can do is that you can add the complement or

00:21:45.640 --> 00:21:49.759
the inverse of certain queries. Let's say you want to

00:21:49.760 --> 00:21:53.239
search for display, but you don't want it to contain the word

00:21:53.240 --> 00:21:58.039
frame. With the complement option on, when we create this

00:21:58.040 --> 00:22:01.839
search prior, now it's going to be searching for frame, but

00:22:01.840 --> 00:22:04.959
instead of increasing the search score, it's going to

00:22:04.960 --> 00:22:06.999
decrease it if it contains the word frame.

00:22:07.000 --> 00:22:14.319
So here, things related to frame are kind of

00:22:14.320 --> 00:22:18.079
deprioritized. We can also say that we really don't want

00:22:18.080 --> 00:22:21.599
the search to contain the word frame by increasing its

00:22:21.600 --> 00:22:27.199
importance. So with all these composable pieces, we can

00:22:27.200 --> 00:22:33.412
create kind of a search that's tailor-made to our needs.

00:22:33.413 --> 00:22:35.759
That concludes this talk. There's a lot more I could talk

00:22:35.760 --> 00:22:37.799
about with regards to research, so definitely follow the

00:22:37.800 --> 00:22:40.639
project if you're interested. Thanks for watching, and I

00:22:40.640 --> 00:22:42.240
hope you enjoy the rest of the conference.