[[!meta title="p-search: a local search engine in Emacs"]] [[!meta copyright="Copyright © 2024 Zac Romero"]] [[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]] # p-search: a local search engine in Emacs Zac Romero - [[!inline pages="internal(2024/info/p-search-before)" raw="yes"]] Search is an essential part of any digital work. Despite this importance, most tools don't go beyond simple string/regex matching. Oftentimes, a user knows more about what they're looking for: who authored the file, how often it's modified, as well as search terms that the user is only slightly confident exist. p-search is a search-engine designed to combine the various prior knowledge about the search target, presenting it to the user in a systematic way. In this talk, I will present this package as well as go over the fundamentals of inforation retrieval. Details: In this talk, I will go over the p-search. p-search is a search-engine to assist users in finding things, with a focus on flexibility and customizablity. The talk will begin by going over concepts from the field of information retrieval such as indexing, querying, ranking, and evaluating. This will provide the necessary background to describe the workings of p-search. Next, an overview of the p-search package and its features will be given. p-search utilizes a probabilistic framework to rank documents according to prior beliefs as to what the file is. So for example, a user might know for sure that the file contains a particular string, might have a strong feeling that it should contain another word, and things that some other words it may contain. The user knows the file extension, the subdirectory, and has known that a particular person works on this file a lot. p-search allows the user to express all of these predicates at once, and ranks documents accordingly. The talk will then progress to discuss assorted topics concerting the project, such as design considerations and future directions. The aim of the talk is to expand the listeners' understanding of search as well as inspire creativity concerning the possibilities of search tools. Code: # Discussion ## Questions and answers - Q: Do you think a reduced version of this functionality could be integrated into isearch?  Right now you can turn on various flags when using isearch with M-s \, like M-s SPC to match spaces literally.  Is it possible to add a flag to \"search the buffer semantically\"? (Ditto with M-x occur, which is more similar to your buffer-oriented results interface) - A: it\'s essencially a framwork so you would create a generator; but it does not exist yet. - Q: Any idea how this would work with personal information like Zettlekastens?  - A: Useable as is, because all the files are in directory. So only have to set the files to search in only. You can then add information to ignore some files (like daily notes). Documentation is coming. - Q: How good does the search work for synonyms especially if you use different languages? - A: There is an entire field of search to translate the word that is inputted to normalize it (like plural -\> singular transformation). Currently p-search does not address this.  - A: for different languages it gets complicated (vector search possible, but might be too slow in Elisp). - Q: When searching by author I know authors may setup a new machine and not put the exact same information. Is this doing anything to combine those into one author? - A: Currently using the git command. So if you know the emails the author have used, you can add different priors. - Q: A cool more powerful grep \"Rak\" to use and maybe has some good ideas in increasing the value of searches, for example using Raku code while searching. is Rak written in Raku. Have you seen it?  - [https://github.com/lizmat/App-Rak](https://github.com/lizmat/App-Rak) - [https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D](https://www.youtube.com/watch?v=YkjGNV4dVio&t=167s&pp=ygURYXBwIHJhayByYWt1IGdyZXA%3D)  - A: I have to look into that. Tree-sitter AST would also be cool to include to have a better search. - Q: Have you thought about integrating results from using cosine similarity with a deep-learning based vector embedding?  This will let us search for \"fruit\" and get back results that have \"apple\" or \"grapes\" in them \-- that kind of thing.  It will probably also handle the case of terms that could be abbreviated/formatted differently like in your initial example. - A: Goes back to semantic search. Probably can be implemented, but also probably too slow. And it is hard to get the embeddings and the system running on the machine. - Q:  I missed the start of the talk, so apologies if this has been covered - is it possible to save/bookmark searches or search templates so they can be used again and again? - A: Exactly.  I just recently added bookmarking capabilities, so we can bookmark and rerun our searches from where we left off.  I tried to create a one-to-one mapping from the search object to the search object - there is a command to do this- to get a data representation of the search, to get a custom plist and resume the search where we left off, which can be used to create command to trigger a prior search. - Q: You mentioned about candidate generators. Could you explain about to what the score is assigned to. Is it to a line or whatever the candidate generates? How does it work with rg in your demo?    FOLLOW-UP: How does the git scoring thingy hook into this?\ - - A: Candidate generator produces documents. Documents have properties (like an id and a path). From that you get subproperties like the content of the document. Each candidate generator know how to search in the files (emails, buffers, files, urls, \...). There is only the notion of score + document. - Then another method is used to extract the lines that matches in the document (to show precisely the lines that matches). - Q: Hearing about this makes me think about how nice the emergent workflow with denote using easy filtering with orderless. It is really easy searching for file tags, titles etc. and do things with them. Did this or something like this help or infulce the design of psearch? - A: You can search for whatever you want. No hardcoding is possible for anything (file, directories, tags, titlese\...). - Q: \[comments from IRC\] \ git covers the \"multiple names\" thing itself: see .mailmap  10:51:19  - \ thiis is a git feature, p-search shouldn\'t need to implement it  10:51:34  - \ To me this seems to have similarities to notmuch \-- honestly I want notmuch with the p-search UI :) (of course, notmuch uses a xapian index, because repeatedly grepping all traffic on huge mailing lists would be insane.)  10:55:30  - \ (notmuch also has bookmark-like things as a core feature, but no real weighting like p-search does.)  10:56:07  - A: I have not used notmuch, but many extensions are possible. mu4e is using  a full index for the search. This could be adapted here to with the SQL database as source.  - Q: You can search a buffer using ripgrep by feeding it in as stdin to the ripgrep process, can\'t you? - A: Yes you can. But the aim is to search many different things in elisp. So there is a mechanism in psearch anyway to be able to represent anything including buffers. This is working pretty well. - Q:  Thanks for making this lovely thing, I\'m looking forward to trying it out.  Seems modular and well thought out. Questions about integreation and about the interface - A: project.el is used to search only in the local files of the project (as done by default) - Q: how happy are you with the interface? - A: psearch is going over the entire files trying to find the best. Many features can be added, e.g., to improve debuggability (is this highly ranked due to a bug? due to a high weight? many matching documents?) - A: hopefully will be on ELPA at some point with proper documentation. - Q: Remembering searches is not available everywhere (rg.el? but AI package like gptel already have it). Also useful for using the document in the future. - A: Retrievel augmented generation: p-search could be used for the search, combining it with an AI to fine-tune the search with a Q-A workflow. Although currently no API.   - (gptel author here: I\'m looking forward to seeing if I can use gptel with p-search) - A: as the results are surprisingly good, why is that not used anywhere else? But there is a lot of setup to get it right. You need to something like emacs with many configuration (transient is helping to do that) without scaring the users.  - Everyone uses emacs differently, so unclear how people will really use it. (PlasmaStrike) For example consult-omni (elfeed-tube, \...) searching multiple webpages at the same time, with orderless. However, no webpage offers this option. Somehow those tools stay in emacs only. (Corwin Brust) This is the strength of emacs: people invest a lot of time to improve their workflow from tomorrow. \[see xkcd on emacs learning curve vs nano vs vim\] - [https://github.com/armindarvish/consult-omni](https://github.com/armindarvish/consult-omni) - [https://github.com/karthink/elfeed-tube](https://github.com/karthink/elfeed-tube) - [https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/](https://www.reddit.com/r/ProgrammerHumor/comments/9d6f19/text_editor_learning_curves_fixed/) - A: emacs is not the most beginner friendly, but the solution space is very large - (Corwin Brust) Emacs supports all approaches and is extensible. (PlasmaStrike) Youtube much larger, but somehow does not have this nice sane interface. - Q: Do you think the Emacs being kinda slow will get in the way of being able to run a lot of scoring algorithms? - A: The code currently is dumb in a lot of places (like going of all files to calculate a score), but that is not that slow surprisingly. Elisp enumerating all files and multiplying numbers in the emacs repo isn\'t really slow. But if you have to search in files, this will be slow without relying on ripgrep on a faster tool. Take for example the search in info files / elisp info files, the search in elisp is almost instant. For human-size documents, probably fast enough \-- and if not, there is room for optimizations. For coompany-size documents (like repos), could be too small. - Q: When do you have to make something more complicated to scale better? - A: I do not know yet really. I try to automate tasks as much as possible, like in the emacs configuration meme \"not doing work I have to do the configuration\". Usually I do not add web-based things into emacs. ## Notes - I like the dedicated-buffer interface (I\'m assuming using magit-section and transient). - \ Very interesting ideas. I was very happy when I was able to do simple -                 filters with orderless, but this is great \[11:46\] - \ I dunno about you, but I want to start using p-search yesterday. -                     (possibly integrating lsp-based tokens somehow\...) \[11:44\] - \ Awesome job Ryota, thank you for sharing!  [[!inline pages="internal(2024/info/p-search-after)" raw="yes"]] [[!inline pages="internal(2024/info/p-search-nav)" raw="yes"]]