[[!meta title="Collaborative data processing and documenting using org-babel"]] [[!meta copyright="Copyright © 2023 Jonathan Hartman, Lukas C. Bossert"]] [[!inline pages="internal(2023/info/collab-nav)" raw="yes"]] # Collaborative data processing and documenting using org-babel Jonathan Hartman (he/him), Lukas C. Bossert (he/him) - , , [[!inline pages="internal(2023/info/collab-before)" raw="yes"]] In our presentation we will show an efficient way of combining information and enriching it by retrieving data, processing it, and finally exporting it, all with org-mode. In this presentation, we will demonstrate not only org-mode, but also a few companion libraries that add functionality such as knowledge graph visualizations, literate programming, and collaborative editing to quickly create a deeply informative reference page. The starting point of our best practice is the National Research Data Infrastructure Germany (NFDI), about which we intend to retrieve and process certain information data gathered from wikidata. For this, we are additionally leveraging the "org-roam" emacs package, which provides functionality for quickly and simply linking together notes and ideas into a custom knowledge graph. Initially, we will write a short abstract about the NFDI and embed it into our existing knowledge graph by linking it to other existing nodes. In the visualized graph (using the “org-roam-ui” package), links and secondary connections to other existing nodes can now be revealed. Next, we would like to enrich the text about the NFDI by with data retrieved from the Wikidata API. A convenient way of creating self-documenting code is the approach called “literate programming”, which presents program logic embedded within human language text. In Emacs we achieve this by using the “org-babel” package. Perhaps now we find it is helpful to collaborate with a colleague in the document: while one is writing the code, the other can explain its use and interpret the results. We will do this simultaneously in the same document using a method called “crdt” (conflict-free replicated data type) and – of course – there is also an implementation of this in Emacs. The results of the code blocks can be used for further analysis and shared throughout the same document. Finally, for the sake of proper and barrier free documentation, we show how to export the document to various formats like pdf, html, txt etc. using either the built-in feature of org-mode or the implementation of pandoc. About the speakers: **Jonathan Hartman** is a trained data scientist and works at the IT Center of the RWTH Aachen University, Germany. **Lukas C. Bossert** is a trained classical archaeologist and is deputy head of the department "research process and data management" at the IT Center of the RWTH. Lukas, an intermediate Emacs user, is currently exploring how to optimize his daily workflow by leveraging various Emacs packages. On the other hand, Jonathan is a relative newcomer to this environment, encountering common pitfalls faced by beginners. Together, they explore the capabilities and functionalities of org-mode, discovering how it can enhance data management and presentation in their research processes. [[!img /i/emacsconf-2023-collab-sponsorship.png alt="Lukas and Jonathan are financed by the DKZ.2R Datenkompetenzkolleg Rhein-Ruhr (16DKZ2030E), www.dks2r.de"]] # Discussion ## Questions and answers - Q: How reliable it resolves the conflict? I mean, for my personal use case, for example, Sycnthing, sometimes it's not working perfectly and I had to manually edit it. How is it robust compared to syncthing? - A (Lukas): We  also faced sometimes issues that letters got mixed up. We couldnt figure out what caused it and it was not reproducable . I cannot compare it to syncthing, never used that with emacs/org-mode. - Q: How's the security for this kind of things? I mean, if we adopt these things in our PAD, is there any, can this thing execute arbitrary (elisp) code in different people's computer? (Think like an adversary!) - A: (Lukas)  As far as we saw the code is executed on the local computer, see the part with the R-code in our video.  - (zaeph) We had plans with qhong (maintainer of crdt.el) to tunnel the connection via SSL, but we were blocked by the SSL library that shipped with Emacs, sadly.  However, we did create a security policy that allowed restrictions on the execution of Elisp code. (great!) - Q: Really nice talk and demo!  You guys clearly rehearsed :).  I always wonder with serial data processing sequencing like this, to what degree do the intermediate outputs need to appear inline in the text?  Suppose you had 50,000 or one million rows from your initial wikidata (or similar) call.  How would you handle that size of data using a collaborative, literate approach like this? - A: (Lukas) Good question. In your local buffer there is no difference and for the collaborative partner I cannot tell. We testet it with 50 items because that was enough for demonstrating our purpose. - noweb allows getting results of evaluation without having to put the actual data into Org buffer - just arrange the original block generating the data to have :results silent. Basically, :var foo=block-name does not require "block-name" to be evaluated in advance - it will be evaluated as necessary. AFAIU, in the talk, it is re-evaluated every time (to not have it, one would need :cache t). - This has tremendous utility - So it would be stored on disk and referenced by name in a subsequent block?  Sounds useful.   - Not on disk - just cached within a single session. To store on disk, need to save to actual file on disk. - Q: How do you handle the viewing of larger or really any tabular data in Emacs/Org when you want to inspect it, like the nice way tabular data is displayed inline in Rmarkdown/RStudio? - A: (Lukas) I have no particular way of doing this.  - What about pandas data summary functionality? Can be a simple python block. - Lukas: Jonathan is our python expert, he might answer this question. - A: (Jonathan) If I follow, you can certainly just use DataFrame.describe() or Series.describe() to get summary statistics for a dataset - the return value would be a Series or a DataFrame, which would be displayed similiarly to how we show things here. Alternatively, DataFrame.head(n) or DataFrame.sample(n) would return a dataframe of the first n / n random lines of a dataset, and might be a way of providing the gist of a very large dataset without printing the entire table in the document. - Would be nice to have a "summarized table" functionality in Org, that includes an abridged copy of a long table inline, but you can open it in another buffer to browse/edit the full table (ala block edit).   - Feel free to post a feature request - see - Q: I'm thinking about an application for a single user, but in different platforms. In a simple case. For example, you have a buffer in your local computer, and you also want to have some files on your pad or on your phone, and you can use this CADT concept to make sure that there's not too much conflict in between different editing sections. Do you think this is a good idea? I mean, compared to purely relying on Syncthing, which sometimes I feel is unreliable for resolving those conflicts. - A: (Lukas) This sounds very interesting and could beneficial for contiously working on things. ## Notes - I like the way you highlight the point you are talking about in real time. - Conflict-free Replicated Data Types (CADT) :: - !This is the future of PAD for our conference. - Just came here to say watching two users editing the same buffer simultaneously is BLOWING MY MIND  - BLOWING MY MIND  +2 - blowing my mind, too ... - WOW - Gitlab custom-export.setup - What about it? - I am looking for that setup file and want to try it :)  --> - Thank you! - Truly one of the most impressive talks of the day. Congrats! Very inspiring - Yes, indeed.  - (Lukas) Wow! Thank you. We werent sure if this is worth showing at EmacsConf because there already have been plenty of talks about literate programming and org-babel.... - Great collaborative conversation and step-wise example creates a different (and impactful) framing.  Thank you! - crdt is fantastic; pity that most (all but one) of my collaborators use Word & VS Code. 🙁 - that's really cool. One of the parts that's a bit hidden from the user is seeing the format that the data is in inside the shell script - it is whatever constitutes the closest equivalent of table in sh (array) - yeah, you have to keep the representation in mind when filtering it as text through sed - this demo is so cool :D - Really, really impressive I have to admit - HA. you cannot evaluate in place so seamlessly in that way with Rmarkdown :). And you cannot combine named blocks in this way either. Wish more folks used emacs. - wow, so `#+CALL` can be embedded in text via `call_()?` TIL - such a slick presentation, I like the CRDT collaboration angle, looks like an end-game UX - Impressive workflow! - great presentation! - For those of you who remember the bad old days before "reproducible research," that talk is even more impressive. Great job! - i was prolly not there in the bad old days, but imho reproducible research is a pressing, current problem. - I feel like that talk video should be shared on Hacker News [[!inline pages="internal(2023/info/collab-after)" raw="yes"]] [[!inline pages="internal(2023/info/collab-nav)" raw="yes"]]