summaryrefslogblamecommitdiffstats
path: root/2024/captions/emacsconf-2024-p-search--psearch-a-local-search-engine-in-emacs--zac-romero--main.vtt
blob: 111f27286beb8088621d2b1dcde6339972752935 (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970









































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































                                                                         
WEBVTT captioned by sachac

NOTE Search in daily workflows

00:00:00.000 --> 00:00:03.399
Hello, my name is Zachary Romero, and today I'll be going

00:00:03.400 --> 00:00:08.115
over p-search, a local search engine in Emacs.

00:00:08.116 --> 00:00:12.398
Search these days is everywhere in software, from text editors,

00:00:12.399 --> 00:00:18.359
to IDEs, to most online websites. These tools tend to fall

00:00:18.360 --> 00:00:25.839
into one of two categories. One are tools that run locally,

00:00:25.840 --> 00:00:31.279
and work by matching string to text. The most common

00:00:31.280 --> 00:00:35.639
example of this is grep. In Emacs, there are a lot of

00:00:35.640 --> 00:00:38.959
extensions which provide functionality on top of these

00:00:38.960 --> 00:00:42.388
tools, such as projectile-grep, deadgrep,

00:00:42.389 --> 00:00:46.849
consult-ripgrep. Most editors have some sort of

00:00:46.850 --> 00:00:52.691
search current project feature. Most of the time,

00:00:52.692 --> 00:00:56.393
some of these tools have features like regular expressions,

00:00:56.394 --> 00:00:59.215
or you can specify file extension,

00:00:59.216 --> 00:01:01.636
or a directory you want to search in,

00:01:01.637 --> 00:01:03.957
but features are pretty limited.

00:01:03.958 --> 00:01:07.919
The other kind of search we use are usually hosted online,

00:01:07.920 --> 00:01:12.302
and they usually search a vast corpus of data.

00:01:12.303 --> 00:01:15.639
These are usually proprietary

00:01:15.640 --> 00:01:18.765
online services such as Google, GitHub,

00:01:18.766 --> 00:01:24.199
SourceGraph for code.

NOTE Problems with editor search tools

00:01:24.200 --> 00:01:28.839
The kind of search feature that editors

00:01:28.840 --> 00:01:36.719
usually have have a lot of downsides to them. For one, a lot

00:01:36.720 --> 00:01:38.839
of times you don't know the exact search string you're

00:01:38.840 --> 00:01:42.783
searching for. Some complicated term like this

00:01:42.784 --> 00:01:46.860
high volume demand partner, you know, do you know if...

00:01:46.861 --> 00:01:49.708
Are some words abbreviated, is it capitalized,

00:01:49.709 --> 00:01:53.089
is it in kebab case, camel case, snake case?

00:01:53.090 --> 00:01:57.571
You often have to search all these variations.

00:01:57.572 --> 00:02:05.434
Another downside is that the search results returned

00:02:05.435 --> 00:02:07.769
contain a lot of noise. For example,

00:02:07.770 --> 00:02:10.816
you may get a lot of test files.

00:02:10.817 --> 00:02:13.537
If the tool hits your vendor directory,

00:02:13.538 --> 00:02:17.199
it may get a bunch of results from libraries

00:02:17.200 --> 00:02:22.879
you're using, which most are not helpful. Another downside

00:02:22.880 --> 00:02:26.679
is that the order given is, well, there's no meaning to the

00:02:26.680 --> 00:02:30.319
order. It's usually just the search order that the tool

00:02:30.320 --> 00:02:34.639
happens to look in first.

00:02:34.640 --> 00:02:38.639
Another thing is, so when you're searching, you oftentimes

00:02:38.640 --> 00:02:41.639
have to keep the state of the searches in your head. For

00:02:41.640 --> 00:02:46.639
example, you try one search, you see the results, find the

00:02:46.640 --> 00:02:49.639
results you think are relevant, keep them in your head, run

00:02:49.640 --> 00:02:52.519
search number two, look through the results, kind of

00:02:52.520 --> 00:02:56.119
combine these different search results in your head until

00:02:56.120 --> 00:02:59.970
you get an idea of which ones might be relevant.

00:02:59.971 --> 00:03:04.515
Another thing is that the search primitives are fairly limited.

00:03:04.516 --> 00:03:10.599
So yeah, you can search regular expressions, but you can't

00:03:10.600 --> 00:03:14.719
really define complex things like, I want to search files in

00:03:14.720 --> 00:03:18.439
this directory, and this directory, and this directory,

00:03:18.440 --> 00:03:22.319
except these subdirectories, and accept test files, and I

00:03:22.320 --> 00:03:25.559
only want files with this file extension. Criteria like

00:03:25.560 --> 00:03:28.919
that are really hard to... I'm sure they're possible in tools

00:03:28.920 --> 00:03:34.479
like grep, but they're pretty hard to construct.

00:03:34.480 --> 00:03:38.199
And lastly, there's no notion of any relevance. All the

00:03:38.200 --> 00:03:42.039
results you get back, I mean, you don't know, is the search

00:03:42.040 --> 00:03:43.095
more relevant? Is it twice as relevant? Is it

00:03:43.096 --> 00:03:52.279
100 times more relevant? These tools usually don't provide

00:03:52.280 --> 00:03:58.232
such information.

NOTE Information retrieval

00:03:58.233 --> 00:04:00.394
There's a field called information retrieval,

00:04:00.395 --> 00:04:02.616
and this deals with this exact problem.

00:04:02.617 --> 00:04:04.718
You have lots of data you're searching for.

00:04:04.719 --> 00:04:09.261
How do you construct a search query?

00:04:09.262 --> 00:04:09.839
How do you get results back fast? How do you

00:04:09.840 --> 00:04:14.519
rank which ones are most relevant? How do you evaluate

00:04:14.520 --> 00:04:20.079
your search system to see if it's getting better or worse?

00:04:20.080 --> 00:04:23.119
There's a lot of work, a lot of books written on the topic of

00:04:23.120 --> 00:04:28.159
information retrieval. If one wants to improve

00:04:28.160 --> 00:04:31.879
searching in Emacs, then drawing inspiration from this

00:04:31.880 --> 00:04:34.295
field is necessary.

NOTE Search engine in Emacs: the index

00:04:34.296 --> 00:04:41.383
The first aspect of information retrieval is the index.

00:04:41.384 --> 00:04:46.608
The reverse index is what search engines use to find results really fast.

00:04:46.609 --> 00:04:51.454
Essentially, it's a map of search term

00:04:51.455 --> 00:04:54.738
to locations where that term is located.

00:04:54.739 --> 00:04:57.079
You'll have all the terms or maybe even parts of

00:04:57.080 --> 00:04:59.159
the terms, and then you'll have all the locations where

00:04:59.160 --> 00:05:02.119
they're located. Any query could easily look up

00:05:02.120 --> 00:05:05.919
where things are located, join results together, and

00:05:05.920 --> 00:05:12.879
that's how they get the results to be really fast. For this

00:05:12.880 --> 00:05:19.159
project, I decided to forgo creating an index altogether.

00:05:19.160 --> 00:05:23.759
An index is pretty complicated to maintain because

00:05:23.760 --> 00:05:27.319
it always has to be in sync. Any time you open a file and save

00:05:27.320 --> 00:05:29.959
it, you would have to re-index, you would have to make sure

00:05:29.960 --> 00:05:32.559
that file is re-indexed properly. Then you have the

00:05:32.560 --> 00:05:36.119
whole issue of, well, if you're searching in Emacs,

00:05:36.120 --> 00:05:38.799
you have all these projects, this directory,

00:05:38.800 --> 00:05:42.479
that directory, how do you know which? Do you always have to

00:05:42.480 --> 00:05:47.399
keep them in sync? It's quite a hard task to handle

00:05:47.400 --> 00:05:53.079
that. Then on the other end, tools like ripgrep can

00:05:53.080 --> 00:05:59.119
search very fast. Even though they can't search maybe on the

00:05:59.120 --> 00:06:03.919
order of tens of thousands of repositories, for a local

00:06:03.920 --> 00:06:06.039
setting, they should be plenty fast enough.

00:06:06.040 --> 00:06:12.239
I benchmarked. Ripgrep, for example, is

00:06:12.240 --> 00:06:15.959
on the order of gigabytes per second.

00:06:15.960 --> 00:06:19.239
Definitely, it can search a few pretty big size

00:06:19.240 --> 00:06:21.756
repositories.

NOTE Search engine in Emacs: Ranking

00:06:21.757 --> 00:06:24.799
Next main task. We decided not to use an

00:06:24.800 --> 00:06:29.959
index. Next task is how do we rank search results? So there's

00:06:29.960 --> 00:06:33.439
two main algorithms that are used these days. The first

00:06:33.440 --> 00:06:36.519
one is tf-idf, which stands for term frequency, inverse

00:06:36.520 --> 00:06:43.039
target frequency. Then there's BM25, which is sort of a

00:06:43.040 --> 00:06:43.552
modified tf-idf algorithm.

NOTE tf-idf: term-frequency x inverse-document-frequency

00:06:43.553 --> 00:06:45.679
tf-idf, without going into

00:06:45.680 --> 00:06:49.159
too much detail, essentially multiplies two terms. One

00:06:49.160 --> 00:06:51.879
is the term frequency, and then you multiply it by the

00:06:51.880 --> 00:06:54.559
inverse document frequency. The term frequency is a

00:06:54.560 --> 00:06:58.519
measure of how often that search term occurs. The

00:06:58.520 --> 00:07:00.799
inverse document frequency is a measure of how much

00:07:00.800 --> 00:07:06.199
information that term provides. If the term occurs a lot,

00:07:06.200 --> 00:07:08.719
then it gets a higher score in the term frequency section.

00:07:08.720 --> 00:07:12.399
But if it's a common word that exists in a lot of documents,

00:07:12.400 --> 00:07:13.900
then its inverse document frequency goes down.

00:07:13.901 --> 00:07:20.879
It kind of scores it less. You'll find that words like the,

00:07:20.880 --> 00:07:25.959
in, is, these really common words, since they occur

00:07:25.960 --> 00:07:29.199
everywhere, their inverse document frequency is

00:07:29.200 --> 00:07:32.479
essentially zero. They don't really count towards a

00:07:32.480 --> 00:07:35.679
score. But when you have rare words that only occur in a

00:07:35.680 --> 00:07:37.679
few documents, they're weighted a lot more. So the more

00:07:37.680 --> 00:07:41.159
those rare words occur, they boost the score higher.

NOTE BM25

00:07:41.160 --> 00:07:48.839
BM25 is a modification of this. It's essentially TF, it's

00:07:48.840 --> 00:07:53.119
essentially the previous one, except it dampens out terms

00:07:53.120 --> 00:07:55.439
that occur more often. Imagine you have a bunch of

00:07:55.440 --> 00:07:59.359
documents. One has a term 10 times, one has a term, that same

00:07:59.360 --> 00:08:02.439
term a hundred times, another has a thousand times.

00:08:02.440 --> 00:08:06.799
You'll see the score dampens off as the number of

00:08:06.800 --> 00:08:10.639
occurrences increases. That prevents any one term from

00:08:10.640 --> 00:08:16.559
overpowering the score. This is the algorithm I ended up

00:08:16.560 --> 00:08:21.039
choosing for my implementation. So with a plan of using a

00:08:21.040 --> 00:08:29.559
command line tool like ripgrep to get term occurrences, and

00:08:29.560 --> 00:08:36.799
then using a scoring algorithm like BM25 to rank the terms,

00:08:36.800 --> 00:08:40.079
we can combine this together and create a simple search

00:08:40.080 --> 00:08:41.199
mechanism.

NOTE Searching with p-search

00:08:41.200 --> 00:08:47.439
Here we're in the directory for the Emacs source code.

00:08:47.440 --> 00:08:53.479
Let's say we want to search for the display code. We

00:08:53.480 --> 00:08:58.679
run the p-search command, starting the search engine. It

00:08:58.680 --> 00:09:01.159
opens up. We notice it has three sections, the candidate

00:09:01.160 --> 00:09:05.199
generators, the priors, and the search results. The

00:09:05.200 --> 00:09:09.999
candidate generators generates the search space we're

00:09:10.000 --> 00:09:14.719
looking on. These are all composable and you can add as

00:09:14.720 --> 00:09:19.719
many as you want. So with this, it specifies that here

00:09:19.720 --> 00:09:25.239
we're searching on the file system and we're searching in

00:09:25.240 --> 00:09:30.799
this directory. We're using the ripgrep tool to search

00:09:30.800 --> 00:09:33.359
with, and we want to make sure that we're searching only on

00:09:33.360 --> 00:09:40.479
files committed to Git. Here we see the search results.

00:09:40.480 --> 00:09:45.159
Notice here is their final probability. Here, notice

00:09:45.160 --> 00:09:47.079
that they're all the same, and they're the same because we

00:09:47.080 --> 00:09:50.719
don't have any search criteria specified here. Suppose

00:09:50.720 --> 00:09:55.679
we want to search for display-related code. We add a

00:09:55.680 --> 00:09:57.359
query: display.

00:09:57.360 --> 00:10:06.559
So then it spins off the processes, gets the search term

00:10:06.560 --> 00:10:10.879
counts and calculates the new scores. Notice here that

00:10:10.880 --> 00:10:15.759
the results that come on top are just at first glance appear

00:10:15.760 --> 00:10:19.919
to be relevant to display. Remember, if we compare

00:10:19.920 --> 00:10:25.079
that to just running a ripgrep raw, notice here we're

00:10:25.080 --> 00:10:31.279
getting 53,000 results and it's pretty hard to go through

00:10:31.280 --> 00:10:34.319
these results and make sense of it.

00:10:34.320 --> 00:10:41.456
So that's p-search in a nutshell.

NOTE Flight AF 447

00:10:41.457 --> 00:10:45.982
Next, I wanted to talk about the story of Flight 447.

00:10:45.983 --> 00:10:49.326
Flight 447 going from Rio de Janeiro to Paris

00:10:49.327 --> 00:10:51.509
crashed somewhere in the Atlantic Ocean

00:10:51.510 --> 00:10:54.713
on June 1st, 2009, killing everyone on board.

00:10:54.714 --> 00:10:56.894
Four search attempts were made to find the wreckage.

00:10:56.895 --> 00:11:01.075
None of them were successful, except the finding of some debris

00:11:01.076 --> 00:11:05.479
and a dead body. It was decided that they really wanted

00:11:05.480 --> 00:11:09.519
to find the wreckage to retrieve data as to why the search

00:11:09.520 --> 00:11:14.639
occurred. This occurred two years after the

00:11:14.640 --> 00:11:19.959
initial crash. With this next search attempt, they

00:11:19.960 --> 00:11:23.199
wanted to create a probability distribution of where the

00:11:23.200 --> 00:11:26.759
crash could be. The only piece of concrete data they had

00:11:26.760 --> 00:11:35.079
was a GPS signal from the ship at 210 containing the GPS

00:11:35.080 --> 00:11:40.239
location of the plane was at 2.98 degrees north, 30.59

00:11:40.240 --> 00:11:44.719
degrees west. That was the only data they had to go off of.

00:11:44.720 --> 00:11:50.079
So they drew a circle around that point

00:11:50.080 --> 00:11:54.679
with a radius of 40 nautical miles. They assumed that

00:11:54.680 --> 00:11:57.479
anything outside the circle would have been impossible for

00:11:57.480 --> 00:12:01.239
the ship to reach. This was the starting point for

00:12:01.240 --> 00:12:04.799
creating the probability distribution of where the

00:12:04.800 --> 00:12:08.119
wreckage occurred. Anything outside the circle, they

00:12:08.120 --> 00:12:09.639
assumed it was impossible to reach.

00:12:09.640 --> 00:12:16.479
The only other pieces of data were the four failed search

00:12:16.480 --> 00:12:21.719
attempts and then some of the debris found. One thing they

00:12:21.720 --> 00:12:26.159
did decide was to look at similar crashes where control was

00:12:26.160 --> 00:12:30.319
lost to analyze where the crashes landed, compared to where

00:12:30.320 --> 00:12:37.399
the loss of control started. This probability

00:12:37.400 --> 00:12:43.479
distribution, the circular normal distribution was

00:12:43.480 --> 00:12:47.919
decided upon. Here you can see that the center has a lot

00:12:47.920 --> 00:12:51.879
higher chance of finding the wreckage. As you go away

00:12:51.880 --> 00:12:55.399
from the center, the probability of finding the wreckage

00:12:55.400 --> 00:13:02.319
decreases a lot. The next thing they looked at was, well,

00:13:02.320 --> 00:13:05.959
they noticed they had retrieved some dead bodies from the

00:13:05.960 --> 00:13:12.959
wreckage. So they thought that they could calculate the

00:13:12.960 --> 00:13:18.439
backward drift on that particular day to find where the

00:13:18.440 --> 00:13:21.479
crash might've occurred. If they found bodies at a

00:13:21.480 --> 00:13:25.119
particular location, they can kind of work backwards from

00:13:25.120 --> 00:13:30.665
that in order to find where the initial crash occurred.

00:13:30.666 --> 00:13:34.719
So here you can see the probability distribution based off of

00:13:34.720 --> 00:13:40.279
the backward drift model. Here you see the darker colors

00:13:40.280 --> 00:13:46.159
have a higher probability of finding the location. So

00:13:46.160 --> 00:13:50.679
with all these pieces of data, so with that circular 40

00:13:50.680 --> 00:13:54.959
nautical mile uniform distribution, with that circular

00:13:54.960 --> 00:14:02.199
normal distribution of comparing similar crashes, as well

00:14:02.200 --> 00:14:07.439
as with the backward drift, they were able to combine all

00:14:07.440 --> 00:14:08.559
three of these pieces

00:14:08.560 --> 00:14:14.599
in order to come up with a final prior distribution of where

00:14:14.600 --> 00:14:19.519
the wreckage occurred. So this is what the final model

00:14:19.520 --> 00:14:24.719
they came upon. Here you can see it has that 40 nautical

00:14:24.720 --> 00:14:29.679
mile radius circle. It has that darker center, which

00:14:29.680 --> 00:14:32.039
indicates a higher probability because of the

00:14:32.040 --> 00:14:38.959
crash similarity. Then here you also see along this line

00:14:38.960 --> 00:14:50.799
has a slightly higher probability due to the backward drift

00:14:50.800 --> 00:14:52.119
distribution.

00:14:52.120 --> 00:14:56.559
So the next thing is, since they had performed searches,

00:14:56.560 --> 00:15:00.559
they decided to incorporate the data from those searches

00:15:00.560 --> 00:15:04.759
into their new distribution. Here you can see places

00:15:04.760 --> 00:15:08.879
where they searched initially. If you think about it,

00:15:08.880 --> 00:15:11.399
you can assume that, well, if you search for something,

00:15:11.400 --> 00:15:14.199
there's a good chance you'll find it, but not necessarily.

00:15:14.200 --> 00:15:18.439
Anywhere where they searched, the probability of it

00:15:18.440 --> 00:15:22.839
finding it there is greatly reduced. It's not zero because

00:15:22.840 --> 00:15:26.879
obviously you can look for something and miss it, but it kind

00:15:26.880 --> 00:15:31.119
of reduces the probability that we would expect to find it in

00:15:31.120 --> 00:15:36.679
those already searched locations. This is the

00:15:36.680 --> 00:15:41.919
posterior distribution or distribution after counting

00:15:41.920 --> 00:15:44.559
observations made.

00:15:44.560 --> 00:15:48.759
Here we can see kind of these cutouts of where the

00:15:48.760 --> 00:15:53.959
previous searches occurred. This is the final

00:15:53.960 --> 00:15:56.999
distribution they went off of to perform the subsequent

00:15:57.000 --> 00:16:01.999
search. In the end, the wreckage was found at a point close to

00:16:02.000 --> 00:16:06.770
the center here, thus validating this methodology.

NOTE Modifying priors

00:16:06.771 --> 00:16:10.332
We can see the power of this Bayesian search methodology

00:16:10.333 --> 00:16:13.999
in the way that we could take information from all the sources we had.

00:16:14.000 --> 00:16:19.237
We could draw analogies to similar situations.

00:16:19.238 --> 00:16:22.479
We can quantify these, combine them into a model,

00:16:22.480 --> 00:16:27.893
and then also update our model according to each observation we make.

00:16:27.894 --> 00:16:30.359
I think there's a lot of similarities to be drawn with

00:16:30.360 --> 00:16:35.159
searching on a computer in the sense that when we search for

00:16:35.160 --> 00:16:39.399
something, there's oftentimes a story we kind of have as to

00:16:39.400 --> 00:16:43.959
what search terms exist, where we expect to find the file.

00:16:43.960 --> 00:16:46.719
For example, if you're implementing a new feature, you'll

00:16:46.720 --> 00:16:49.919
often have some search terms in mind that you think will be

00:16:49.920 --> 00:16:54.719
relevant. Some search terms, you might think they have a

00:16:54.720 --> 00:16:57.599
possibility of being relevant, but maybe you're not sure.

00:16:57.600 --> 00:17:02.879
There's some directories where you know that they're not

00:17:02.880 --> 00:17:07.759
relevant. There's other criteria like, well, you know that

00:17:07.760 --> 00:17:11.399
maybe somebody in particular worked on this code.

00:17:11.400 --> 00:17:16.319
What if you could incorporate that information? Like, I know

00:17:16.320 --> 00:17:21.399
this author, he's always working on this feature. What if

00:17:21.400 --> 00:17:25.519
I just give the files that this person works on a higher

00:17:25.520 --> 00:17:32.599
probability than ones he doesn't work on? Or maybe you think

00:17:32.600 --> 00:17:38.599
that this is a file that's committed too often. You think

00:17:38.600 --> 00:17:43.439
that maybe the amount of times of commits it receives

00:17:43.440 --> 00:17:47.719
should change your probability of this file being

00:17:47.720 --> 00:17:52.839
relevant. That's where p-search comes in.

00:17:52.840 --> 00:17:57.679
Its aim is to be a framework in order to incorporate all these

00:17:57.680 --> 00:18:01.359
sorts of different prior information into your searching

00:18:01.360 --> 00:18:05.999
process. You're able to say things like, I want files

00:18:06.000 --> 00:18:11.119
authored by this user to be given higher probability. I want

00:18:11.120 --> 00:18:13.919
this author to be given a lower priority. I know this author

00:18:13.920 --> 00:18:18.759
never works on this code. If he has a commit, then lower its

00:18:18.760 --> 00:18:24.679
probability, or you can specify specific paths, or you can

00:18:24.680 --> 00:18:30.199
specify multiple search terms, weighing different ones

00:18:30.200 --> 00:18:38.919
according to how you think those terms should be relevant.

00:18:38.920 --> 00:18:42.079
So with p-search, we're able to incorporate information

00:18:42.080 --> 00:18:46.279
from multiple sources. Here, for example, we have a prior

00:18:46.280 --> 00:18:52.079
of type git author, and we're looking for all of the files

00:18:52.080 --> 00:18:56.719
that are committed to by Lars. So the more commits he has,

00:18:56.720 --> 00:19:01.399
the higher probability is given to that file. Suppose

00:19:01.400 --> 00:19:04.559
there's a feature I know he worked on, but I don't know the

00:19:04.560 --> 00:19:09.159
file or necessarily even key terms of it. Well, with this, I

00:19:09.160 --> 00:19:12.140
can incorporate that information.

00:19:12.141 --> 00:19:15.999
So let's search again. Let's add display.

00:19:16.000 --> 00:19:22.959
Let's see what responses we get back here. We can add

00:19:22.960 --> 00:19:27.199
as many of these criteria as we want. We can even specify that

00:19:27.200 --> 00:19:31.519
the title of the file name should be a certain type. Let's

00:19:31.520 --> 00:19:36.599
say we're only concerned about C files. We add the file

00:19:36.600 --> 00:19:45.399
name should contain .c in it. With this, now we

00:19:45.400 --> 00:19:51.319
notice that all of the C files containing display authored

00:19:51.320 --> 00:19:56.279
by Lars should be given higher probability. We can

00:19:56.280 --> 00:20:02.719
continue to add these priors as we feel fit. The workflow

00:20:02.720 --> 00:20:07.519
that I found helps when searching is that you'll add

00:20:07.520 --> 00:20:11.359
criteria, you'll see some good results come up and some bad

00:20:11.360 --> 00:20:15.319
results come up. So you'll often find a pattern in those

00:20:15.320 --> 00:20:18.839
bad results, like, oh, I don't want test files, or this

00:20:18.840 --> 00:20:22.679
directory isn't relevant, or something like that. Then

00:20:22.680 --> 00:20:27.199
you can update your prior distribution, adding its

00:20:27.200 --> 00:20:31.119
criteria, and then rerun it, and then it will get different

00:20:31.120 --> 00:20:35.159
probabilities for the files. So in the end, you'll have a

00:20:35.160 --> 00:20:37.639
list of results that's tailor-made to the thing you're

00:20:37.640 --> 00:20:40.404
searching for.

NOTE Importance

00:20:40.405 --> 00:20:41.639
There's a couple of other features I

00:20:41.640 --> 00:20:49.079
want to go through. One thing is that each of these priors,

00:20:49.080 --> 00:20:55.839
you can specify the importance. In other words, how

00:20:55.840 --> 00:21:01.119
important is this particular piece of information to your

00:21:01.120 --> 00:21:05.199
search? So here, everything is of importance medium. But

00:21:05.200 --> 00:21:07.879
let's say I really care about something having the word

00:21:07.880 --> 00:21:12.679
display in it. I'm going to change its importance.

00:21:12.680 --> 00:21:18.599
Instead of medium, I'll change its importance to high.

00:21:18.600 --> 00:21:23.279
What that does essentially is things that don't have

00:21:23.280 --> 00:21:28.079
display in it are given a much bigger penalty and things with

00:21:28.080 --> 00:21:28.128
the word display in it are rated much higher.

00:21:28.129 --> 00:21:38.559
With this, we're able to fine-tune the results that we get.

NOTE Complement or inverse

00:21:38.560 --> 00:21:45.639
Another thing you can do is that you can add the complement or

00:21:45.640 --> 00:21:49.759
the inverse of certain queries. Let's say you want to

00:21:49.760 --> 00:21:53.239
search for display, but you don't want it to contain the word

00:21:53.240 --> 00:21:58.039
frame. With the complement option on, when we create this

00:21:58.040 --> 00:22:01.839
search prior, now it's going to be searching for frame, but

00:22:01.840 --> 00:22:04.959
instead of increasing the search score, it's going to

00:22:04.960 --> 00:22:06.999
decrease it if it contains the word frame.

00:22:07.000 --> 00:22:14.319
So here, things related to frame are kind of

00:22:14.320 --> 00:22:18.079
deprioritized. We can also say that we really don't want

00:22:18.080 --> 00:22:21.599
the search to contain the word frame by increasing its

00:22:21.600 --> 00:22:27.199
importance. So with all these composable pieces, we can

00:22:27.200 --> 00:22:33.412
create kind of a search that's tailor-made to our needs.

00:22:33.413 --> 00:22:35.759
That concludes this talk. There's a lot more I could talk

00:22:35.760 --> 00:22:37.799
about with regards to research, so definitely follow the

00:22:37.800 --> 00:22:40.639
project if you're interested. Thanks for watching, and I

00:22:40.640 --> 00:22:42.240
hope you enjoy the rest of the conference.