summaryrefslogtreecommitdiffstats
path: root/2024/talks/p-search.md
blob: 59d1637949f0fdfc7fee56d968a1ba814dfdbdf1 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
[[!meta title="p-search: a local search engine in Emacs"]]
[[!meta copyright="Copyright © 2024 Zac Romero"]]
[[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->


# p-search: a local search engine in Emacs
Zac Romero - <mailto:zacromero@posteo.com>

[[!inline pages="internal(2024/info/p-search-before)" raw="yes"]]

Search is an essential part of any digital work.  Despite this 
importance, most tools don't go beyond simple string/regex matching. 
Oftentimes, a user knows more about what they're looking for: who 
authored the file, how often it's modified, as well as search terms that 
the user is only slightly confident exist.

p-search is a search-engine designed to combine the various prior 
knowledge about the search target, presenting it to the user in a 
systematic way.  In this talk, I will present this package as well as go 
over the fundamentals of inforation retrieval.

Details:

In this talk, I will go over the p-search.  p-search is a search-engine
to assist users in finding things, with a focus on flexibility and
customizablity.

The talk will begin by going over concepts from the field of information
retrieval such as indexing, querying, ranking, and evaluating.  This
will provide the necessary background to describe the workings of
p-search.

Next, an overview of the p-search package and its features will be
given.  p-search utilizes a probabilistic framework to rank documents
according to prior beliefs as to what the file is.  So for example, a
user might know for sure that the file contains a particular string,
might have a strong feeling that it should contain another word, and
things that some other words it may contain.  The user knows the file
extension, the subdirectory, and has known that a particular person
works on this file a lot.  p-search allows the user to express all of
these predicates at once, and ranks documents accordingly.

The talk will then progress to discuss assorted topics concerting the
project, such as design considerations and future directions.

The aim of the talk is to expand the listeners' understanding of search
as well as inspire creativity concerning the possibilities of search
tools.

Code: <https://github.com/zkry/p-search>


# Discussion

## Questions and answers

-   Q: Do you think a reduced version of this functionality could be
    integrated into isearch?  Right now you can turn on various flags
    when using isearch with M-s \<key\>, like M-s SPC to match spaces
    literally.  Is it possible to add a flag to \"search the buffer
    semantically\"? (Ditto with M-x occur, which is more similar to your
    buffer-oriented results interface)
    -   A: it\'s essencially a framwork so you would create a generator;
        but it does not exist yet.
-   Q: Any idea how this would work with personal information like
    Zettlekastens? 
    -   A: Useable as is, because all the files are in directory. So
        only have to set the files to search in only. You can then add
        information to ignore some files (like daily notes).
        Documentation is coming.
-   Q: How good does the search work for synonyms especially if you use
    different languages?
    -   A: There is an entire field of search to translate the word that
        is inputted to normalize it (like plural -\> singular
        transformation). Currently p-search does not address this. 
    -   A: for different languages it gets complicated (vector search
        possible, but might be too slow in Elisp).
-   Q: When searching by author I know authors may setup a new machine
    and not put the exact same information. Is this doing anything to
    combine those into one author?
    -   A: Currently using the git command. So if you know the emails
        the author have used, you can add different priors.
-   Q: A cool more powerful grep \"Rak\" to use and maybe has some good
    ideas in increasing the value of searches, for example using Raku
    code while searching. is Rak written in Raku. Have you seen it? 
    -   [https://github.com/lizmat/App-Rak](https://github.com/lizmat/App-Rak){rel="noreferrer noopener"}
    -   [https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D](https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D){rel="noreferrer noopener"} 
    -   A: I have to look into that. Tree-sitter AST would also be cool
        to include to have a better search.
-   Q: Have you thought about integrating results from using cosine
    similarity with a deep-learning based vector embedding?  This will
    let us search for \"fruit\" and get back results that have \"apple\"
    or \"grapes\" in them \-- that kind of thing.  It will probably also
    handle the case of terms that could be abbreviated/formatted
    differently like in your initial example.
    -   A: Goes back to semantic search. Probably can be implemented,
        but also probably too slow. And it is hard to get the embeddings
        and the system running on the machine.
-   Q:  I missed the start of the talk, so apologies if this has been
    covered - is it possible to save/bookmark searches or search
    templates so they can be used again and again?
    -   A: Exactly.  I just recently added bookmarking capabilities, so
        we can bookmark and rerun our searches from where we left off. 
        I tried to create a one-to-one mapping from the search object to
        the search object - there is a command to do this- to get a data
        representation of the search, to get a custom plist and resume
        the search where we left off, which can be used to create
        command to trigger a prior search.
-   Q: You mentioned about candidate generators. Could you explain about
    to what the score is assigned to. Is it to a line or whatever the
    candidate generates? How does it work with rg in your demo?

   FOLLOW-UP: How does the git scoring thingy hook into this?\

-   -   A: Candidate generator produces documents. Documents have
        properties (like an id and a path). From that you get
        subproperties like the content of the document. Each candidate
        generator know how to search in the files (emails, buffers,
        files, urls, \...). There is only the notion of score +
        document.
    -   Then another method is used to extract the lines that matches in
        the document (to show precisely the lines that matches).

-   Q: Hearing about this makes me think about how nice the emergent
    workflow with denote using easy filtering with orderless. It is
    really easy searching for file tags, titles etc. and do things with
    them. Did this or something like this help or infulce the design of
    psearch?
    -   A: You can search for whatever you want. No hardcoding is
        possible for anything (file, directories, tags, titlese\...).

-   Q: \[comments from IRC\] \<NullNix\> git covers the \"multiple
    names\" thing itself: see .mailmap  10:51:19 
    -   \<NullNix\> thiis is a git feature, p-search shouldn\'t need to
        implement it  10:51:34 
    -   \<NullNix\> To me this seems to have similarities to notmuch \--
        honestly I want notmuch with the p-search UI :) (of course,
        notmuch uses a xapian index, because repeatedly grepping all
        traffic on huge mailing lists would be insane.)  10:55:30 
    -   \<NullNix\> (notmuch also has bookmark-like things as a core
        feature, but no real weighting like p-search does.)  10:56:07 
        -   A: I have not used notmuch, but many extensions are
            possible. mu4e is using  a full index for the search. This
            could be adapted here to with the SQL database as source. 

-   Q: You can search a buffer using ripgrep by feeding it in as stdin
    to the ripgrep process, can\'t you?
    -   A: Yes you can. But the aim is to search many different things
        in elisp. So there is a mechanism in psearch anyway to be able
        to represent anything including buffers. This is working pretty
        well.

-   Q:  Thanks for making this lovely thing, I\'m looking forward to
    trying it out.  Seems modular and well thought out. Questions about
    integreation and about the interface
    -   A: project.el is used to search only in the local files of the
        project (as done by default)

-   Q: how happy are you with the interface?
    -   A: psearch is going over the entire files trying to find the
        best. Many features can be added, e.g., to improve debuggability
        (is this highly ranked due to a bug? due to a high weight? many
        matching documents?)
    -   A: hopefully will be on ELPA at some point with proper
        documentation.

-   Q: Remembering searches is not available everywhere (rg.el? but AI
    package like gptel already have it). Also useful for using the
    document in the future.
    -   A: Retrievel augmented generation: p-search could be used for
        the search, combining it with an AI to fine-tune the search with
        a Q-A workflow. Although currently no API.  
    -   (gptel author here: I\'m looking forward to seeing if I can use
        gptel with p-search)
    -   A: as the results are surprisingly good, why is that not used
        anywhere else? But there is a lot of setup to get it right. You
        need to something like emacs with many configuration (transient
        is helping to do that) without scaring the users. 
    -   Everyone uses emacs differently, so unclear how people will
        really use it. (PlasmaStrike) For example consult-omni
        (elfeed-tube, \...) searching multiple webpages at the same
        time, with orderless. However, no webpage offers this option.
        Somehow those tools stay in emacs only. (Corwin Brust) This is
        the strength of emacs: people invest a lot of time to improve
        their workflow from tomorrow. \[see xkcd on emacs learning curve
        vs nano vs vim\]
    -   [https://github.com/armindarvish/consult-omni](https://github.com/armindarvish/consult-omni){rel="noreferrer noopener"}
    -   [https://github.com/karthink/elfeed-tube](https://github.com/karthink/elfeed-tube){rel="noreferrer noopener"}
    -   [https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/](https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/){rel="noreferrer noopener"}
    -   A: emacs is not the most beginner friendly, but the solution
        space is very large
    -   (Corwin Brust) Emacs supports all approaches and is extensible.
        (PlasmaStrike) Youtube much larger, but somehow does not have
        this nice sane interface.

-   Q: Do you think the Emacs being kinda slow will get in the way of
    being able to run a lot of scoring algorithms?
    -   A: The code currently is dumb in a lot of places (like going of
        all files to calculate a score), but that is not that slow
        surprisingly. Elisp enumerating all files and multiplying
        numbers in the emacs repo isn\'t really slow. But if you have to
        search in files, this will be slow without relying on ripgrep on
        a faster tool. Take for example the search in info files / elisp
        info files, the search in elisp is almost instant. For
        human-size documents, probably fast enough \-- and if not, there
        is room for optimizations. For coompany-size documents (like
        repos), could be too small.

-   Q: When do you have to make something more complicated to scale
    better?
    -   A: I do not know yet really. I try to automate tasks as much as
        possible, like in the emacs configuration meme \"not doing work
        I have to do the configuration\". Usually I do not add web-based
        things into emacs.

## Notes

-   I like the dedicated-buffer interface (I\'m assuming using
    magit-section and transient).
-   \<meain\> Very interesting ideas. I was very happy when I was able
    to do simple
-                   filters with orderless, but this is great \[11:46\]
-   \<NullNix\> I dunno about you, but I want to start using p-search
    yesterday.
-                       (possibly integrating lsp-based tokens
    somehow\...) \[11:44\]
-   \<codeasone\> Awesome job Ryota, thank you for sharing! 


[[!inline pages="internal(2024/info/p-search-after)" raw="yes"]]

[[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]]