[[!meta title="p-search: a local search engine in Emacs"]]
[[!meta copyright="Copyright © 2024 Zac Romero"]]
[[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]]
<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->
# p-search: a local search engine in Emacs
Zac Romero - <mailto:zacromero@posteo.com>
[[!inline pages="internal(2024/info/p-search-before)" raw="yes"]]
Search is an essential part of any digital work. Despite this
importance, most tools don't go beyond simple string/regex matching.
Oftentimes, a user knows more about what they're looking for: who
authored the file, how often it's modified, as well as search terms that
the user is only slightly confident exist.
p-search is a search-engine designed to combine the various prior
knowledge about the search target, presenting it to the user in a
systematic way. In this talk, I will present this package as well as go
over the fundamentals of inforation retrieval.
Details:
In this talk, I will go over the p-search. p-search is a search-engine
to assist users in finding things, with a focus on flexibility and
customizablity.
The talk will begin by going over concepts from the field of information
retrieval such as indexing, querying, ranking, and evaluating. This
will provide the necessary background to describe the workings of
p-search.
Next, an overview of the p-search package and its features will be
given. p-search utilizes a probabilistic framework to rank documents
according to prior beliefs as to what the file is. So for example, a
user might know for sure that the file contains a particular string,
might have a strong feeling that it should contain another word, and
things that some other words it may contain. The user knows the file
extension, the subdirectory, and has known that a particular person
works on this file a lot. p-search allows the user to express all of
these predicates at once, and ranks documents accordingly.
The talk will then progress to discuss assorted topics concerting the
project, such as design considerations and future directions.
The aim of the talk is to expand the listeners' understanding of search
as well as inspire creativity concerning the possibilities of search
tools.
Code: <https://github.com/zkry/p-search>
# Discussion
## Questions and answers
- Q: Do you think a reduced version of this functionality could be
integrated into isearch? Right now you can turn on various flags
when using isearch with M-s \<key\>, like M-s SPC to match spaces
literally. Is it possible to add a flag to \"search the buffer
semantically\"? (Ditto with M-x occur, which is more similar to your
buffer-oriented results interface)
- A: it\'s essencially a framwork so you would create a generator;
but it does not exist yet.
- Q: Any idea how this would work with personal information like
Zettlekastens?
- A: Useable as is, because all the files are in directory. So
only have to set the files to search in only. You can then add
information to ignore some files (like daily notes).
Documentation is coming.
- Q: How good does the search work for synonyms especially if you use
different languages?
- A: There is an entire field of search to translate the word that
is inputted to normalize it (like plural -\> singular
transformation). Currently p-search does not address this.
- A: for different languages it gets complicated (vector search
possible, but might be too slow in Elisp).
- Q: When searching by author I know authors may setup a new machine
and not put the exact same information. Is this doing anything to
combine those into one author?
- A: Currently using the git command. So if you know the emails
the author have used, you can add different priors.
- Q: A cool more powerful grep \"Rak\" to use and maybe has some good
ideas in increasing the value of searches, for example using Raku
code while searching. is Rak written in Raku. Have you seen it?
- [https://github.com/lizmat/App-Rak](https://github.com/lizmat/App-Rak)
- [https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D](https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D)
- A: I have to look into that. Tree-sitter AST would also be cool
to include to have a better search.
- Q: Have you thought about integrating results from using cosine
similarity with a deep-learning based vector embedding? This will
let us search for \"fruit\" and get back results that have \"apple\"
or \"grapes\" in them \-- that kind of thing. It will probably also
handle the case of terms that could be abbreviated/formatted
differently like in your initial example.
- A: Goes back to semantic search. Probably can be implemented,
but also probably too slow. And it is hard to get the embeddings
and the system running on the machine.
- Q: I missed the start of the talk, so apologies if this has been
covered - is it possible to save/bookmark searches or search
templates so they can be used again and again?
- A: Exactly. I just recently added bookmarking capabilities, so
we can bookmark and rerun our searches from where we left off.
I tried to create a one-to-one mapping from the search object to
the search object - there is a command to do this- to get a data
representation of the search, to get a custom plist and resume
the search where we left off, which can be used to create
command to trigger a prior search.
- Q: You mentioned about candidate generators. Could you explain about
to what the score is assigned to. Is it to a line or whatever the
candidate generates? How does it work with rg in your demo?
FOLLOW-UP: How does the git scoring thingy hook into this?\
- - A: Candidate generator produces documents. Documents have
properties (like an id and a path). From that you get
subproperties like the content of the document. Each candidate
generator know how to search in the files (emails, buffers,
files, urls, \...). There is only the notion of score +
document.
- Then another method is used to extract the lines that matches in
the document (to show precisely the lines that matches).
- Q: Hearing about this makes me think about how nice the emergent
workflow with denote using easy filtering with orderless. It is
really easy searching for file tags, titles etc. and do things with
them. Did this or something like this help or infulce the design of
psearch?
- A: You can search for whatever you want. No hardcoding is
possible for anything (file, directories, tags, titlese\...).
- Q: \[comments from IRC\] \<NullNix\> git covers the \"multiple
names\" thing itself: see .mailmap 10:51:19
- \<NullNix\> thiis is a git feature, p-search shouldn\'t need to
implement it 10:51:34
- \<NullNix\> To me this seems to have similarities to notmuch \--
honestly I want notmuch with the p-search UI :) (of course,
notmuch uses a xapian index, because repeatedly grepping all
traffic on huge mailing lists would be insane.) 10:55:30
- \<NullNix\> (notmuch also has bookmark-like things as a core
feature, but no real weighting like p-search does.) 10:56:07
- A: I have not used notmuch, but many extensions are
possible. mu4e is using a full index for the search. This
could be adapted here to with the SQL database as source.
- Q: You can search a buffer using ripgrep by feeding it in as stdin
to the ripgrep process, can\'t you?
- A: Yes you can. But the aim is to search many different things
in elisp. So there is a mechanism in psearch anyway to be able
to represent anything including buffers. This is working pretty
well.
- Q: Thanks for making this lovely thing, I\'m looking forward to
trying it out. Seems modular and well thought out. Questions about
integreation and about the interface
- A: project.el is used to search only in the local files of the
project (as done by default)
- Q: how happy are you with the interface?
- A: psearch is going over the entire files trying to find the
best. Many features can be added, e.g., to improve debuggability
(is this highly ranked due to a bug? due to a high weight? many
matching documents?)
- A: hopefully will be on ELPA at some point with proper
documentation.
- Q: Remembering searches is not available everywhere (rg.el? but AI
package like gptel already have it). Also useful for using the
document in the future.
- A: Retrievel augmented generation: p-search could be used for
the search, combining it with an AI to fine-tune the search with
a Q-A workflow. Although currently no API.
- (gptel author here: I\'m looking forward to seeing if I can use
gptel with p-search)
- A: as the results are surprisingly good, why is that not used
anywhere else? But there is a lot of setup to get it right. You
need to something like emacs with many configuration (transient
is helping to do that) without scaring the users.
- Everyone uses emacs differently, so unclear how people will
really use it. (PlasmaStrike) For example consult-omni
(elfeed-tube, \...) searching multiple webpages at the same
time, with orderless. However, no webpage offers this option.
Somehow those tools stay in emacs only. (Corwin Brust) This is
the strength of emacs: people invest a lot of time to improve
their workflow from tomorrow. \[see xkcd on emacs learning curve
vs nano vs vim\]
- [https://github.com/armindarvish/consult-omni](https://github.com/armindarvish/consult-omni)
- [https://github.com/karthink/elfeed-tube](https://github.com/karthink/elfeed-tube)
- [https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/](https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/)
- A: emacs is not the most beginner friendly, but the solution
space is very large
- (Corwin Brust) Emacs supports all approaches and is extensible.
(PlasmaStrike) Youtube much larger, but somehow does not have
this nice sane interface.
- Q: Do you think the Emacs being kinda slow will get in the way of
being able to run a lot of scoring algorithms?
- A: The code currently is dumb in a lot of places (like going of
all files to calculate a score), but that is not that slow
surprisingly. Elisp enumerating all files and multiplying
numbers in the emacs repo isn\'t really slow. But if you have to
search in files, this will be slow without relying on ripgrep on
a faster tool. Take for example the search in info files / elisp
info files, the search in elisp is almost instant. For
human-size documents, probably fast enough \-- and if not, there
is room for optimizations. For coompany-size documents (like
repos), could be too small.
- Q: When do you have to make something more complicated to scale
better?
- A: I do not know yet really. I try to automate tasks as much as
possible, like in the emacs configuration meme \"not doing work
I have to do the configuration\". Usually I do not add web-based
things into emacs.
## Notes
- I like the dedicated-buffer interface (I\'m assuming using
magit-section and transient).
- \<meain\> Very interesting ideas. I was very happy when I was able
to do simple
- filters with orderless, but this is great \[11:46\]
- \<NullNix\> I dunno about you, but I want to start using p-search
yesterday.
- (possibly integrating lsp-based tokens
somehow\...) \[11:44\]
- \<codeasone\> Awesome job Ryota, thank you for sharing!
[[!inline pages="internal(2024/info/p-search-after)" raw="yes"]]
[[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]]