[[!meta title="Collaborative data processing and documenting using org-babel"]]
[[!meta copyright="Copyright &copy; 2023 Jonathan Hartman, Lukas C. Bossert"]]
[[!inline pages="internal(2023/info/collab-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->


# Collaborative data processing and documenting using org-babel
Jonathan Hartman (he/him), Lukas C. Bossert (he/him) - <https://mastodon.social/@lukascbossert>, <mailto:hartman@itc.rwth-aachen.de>, <mailto:bossert@itc.rwth-aachen.de>

[[!inline pages="internal(2023/info/collab-before)" raw="yes"]]

In our presentation we will show an efficient way of combining
information and enriching it by retrieving data, processing it, and
finally exporting it, all with org-mode. In this presentation, we will
demonstrate not only org-mode, but also a few companion libraries that
add functionality such as knowledge graph visualizations, literate
programming, and collaborative editing to quickly create a deeply
informative reference page.

The starting point of our best practice is the National Research Data
Infrastructure Germany (NFDI), about which we intend to retrieve and
process certain information data gathered from wikidata. For this, we
are additionally leveraging the "org-roam" emacs package, which
provides functionality for quickly and simply linking together notes
and ideas into a custom knowledge graph. Initially, we will write a
short abstract about the NFDI and embed it into our existing knowledge
graph by linking it to other existing nodes. In the visualized graph
(using the “org-roam-ui” package), links and secondary connections to
other existing nodes can now be revealed.

Next, we would like to enrich the text about the NFDI by with data
retrieved from the Wikidata API. A convenient way of creating
self-documenting code is the approach called “literate programming”,
which presents program logic embedded within human language text. In
Emacs we achieve this by using the “org-babel” package. Perhaps now we
find it is helpful to collaborate with a colleague in the document:
while one is writing the code, the other can explain its use and
interpret the results. We will do this simultaneously in the same
document using a method called “crdt” (conflict-free replicated data
type) and – of course – there is also an implementation of this in
Emacs. The results of the code blocks can be used for further analysis
and shared throughout the same document.

Finally, for the sake of proper and barrier free documentation, we
show how to export the document to various formats like pdf, html, txt
etc. using either the built-in feature of org-mode or the
implementation of pandoc.

About the speakers:

**Jonathan Hartman** is a trained data scientist and works at the IT 
Center of the RWTH Aachen University, Germany.

**Lukas C. Bossert** is a trained classical archaeologist and is deputy
head of the department "research process and data management" at the
IT Center of the RWTH.

Lukas, an intermediate Emacs user, is currently exploring how to
optimize his daily workflow by leveraging various Emacs packages. On
the other hand, Jonathan is a relative newcomer to this environment,
encountering common pitfalls faced by beginners. Together, they
explore the capabilities and functionalities of org-mode, discovering
how it can enhance data management and presentation in their research
processes.

[[!img /i/emacsconf-2023-collab-sponsorship.png alt="Lukas and Jonathan are financed by the DKZ.2R Datenkompetenzkolleg Rhein-Ruhr (16DKZ2030E), www.dks2r.de"]]

# Discussion

## Questions and answers

-   Q: How reliable it resolves the conflict? I mean, for my personal
    use case, for example, Sycnthing, sometimes it's not working
    perfectly and I had to manually edit it. How is it robust compared
    to syncthing?
    -   A (Lukas): We  also faced sometimes issues that letters got
        mixed up. We couldnt figure out what caused it and it was not
        reproducable . I cannot compare it to syncthing, never used that
        with emacs/org-mode.
-   Q: How's the security for this kind of things? I mean, if we adopt
    these things in our PAD, is there any, can this thing execute
    arbitrary (elisp) code in different people's computer? (Think like
    an adversary!)
    -   A: (Lukas)  As far as we saw the code is executed on the local
        computer, see the part with the R-code in our video. 
    -   (zaeph) We had plans with qhong (maintainer of crdt.el) to
        tunnel the connection via SSL, but we were blocked by the SSL
        library that shipped with Emacs, sadly.  However, we did create
        a security policy that allowed restrictions on the execution of
        Elisp code. (great!)
-   Q: Really nice talk and demo!  You guys clearly rehearsed :).  I
    always wonder with serial data processing sequencing like this, to
    what degree do the intermediate outputs need to appear inline in the
    text?  Suppose you had 50,000 or one million rows from your initial
    wikidata (or similar) call.  How would you handle that size of data
    using a collaborative, literate approach like this?
    -   A: (Lukas) Good question. In your local buffer there is no
        difference and for the collaborative partner I cannot tell. We
        testet it with 50 items because that was enough for
        demonstrating our purpose.
    -   noweb allows getting results of evaluation without having to put
        the actual data into Org buffer - just arrange the original
        block generating the data to have :results silent. Basically,
        :var foo=block-name does not require "block-name" to be
        evaluated in advance - it will be evaluated as necessary. AFAIU,
        in the talk, it is re-evaluated every time (to not have it, one
        would need :cache t).
        -   This has tremendous utility
    -   So it would be stored on disk and referenced by name in a
        subsequent block?  Sounds useful.  
        -   Not on disk - just cached within a single session. To store
            on disk, need to save to actual file on disk.
-   Q: How do you handle the viewing of larger or really any tabular
    data in Emacs/Org when you want to inspect it, like the nice way
    tabular data is displayed inline in Rmarkdown/RStudio?
    -   A: (Lukas) I have no particular way of doing this. 
    -   What about pandas data summary functionality? Can be a simple
        python block.
    -   Lukas: Jonathan is our python expert, he might answer this
        question.
    -   A: (Jonathan) If I follow, you can certainly just use
        DataFrame.describe() or Series.describe() to get summary
        statistics for a dataset - the return value would be a Series or
        a DataFrame, which would be displayed similiarly to how we show
        things here. Alternatively, DataFrame.head(n) or
        DataFrame.sample(n) would return a dataframe of the first n / n
        random lines of a dataset, and might be a way of providing the
        gist of a very large dataset without printing the entire table
        in the document.
    -   Would be nice to have a "summarized table" functionality in
        Org, that includes an abridged copy of a long table inline, but
        you can open it in another buffer to browse/edit the full table
        (ala block edit).  
        -   Feel free to post a feature request - see
            <https://orgmode.org/manual/Feedback.html#Feedback>
-   Q: I'm thinking about an application for a single user, but in
    different platforms. In a simple case. For example, you have a
    buffer in your local computer, and you also want to have some files
    on your pad or on your phone, and you can use this CADT concept to
    make sure that there's not too much conflict in between different
    editing sections. Do you think this is a good idea? I mean, compared
    to purely relying on Syncthing, which sometimes I feel is unreliable
    for resolving those conflicts.
    -   A: (Lukas) This sounds very interesting and could beneficial for
        contiously working on things.

## Notes

-   I like the way you highlight the point you are talking about in real
    time.
-   Conflict-free Replicated Data Types (CADT) ::
    <https://github.com/emacs-straight/crdt>
-   !This is the future of PAD for our conference.
-   Just came here to say watching two users editing the same buffer
    simultaneously is BLOWING MY MIND 
    -   BLOWING MY MIND  +2
    -   blowing my mind, too ...
    -   WOW
-   Gitlab custom-export.setup
    -   What about it?
        -   I am looking for that setup file and want to try it :) 
            -->
            <https://git.rwth-aachen.de/dl/workshops/collaborative-coding-with-emacs/-/blob/main/emacs/custom-export.setup>
            -   Thank you!
-   Truly one of the most impressive talks of the day. Congrats! Very
    inspiring
    -   Yes, indeed. 
    -   (Lukas) Wow! Thank you. We werent sure if this is worth showing
        at EmacsConf because there already have been plenty of talks
        about literate programming and org-babel....
        -   Great collaborative conversation and step-wise example
            creates a different (and impactful) framing.  Thank you!
- crdt is fantastic; pity that most (all but one) of my collaborators use Word & VS Code. 🙁
- that's really cool.  One of the parts that's a bit hidden from the user is seeing the format that the data is in inside the shell script
- it is whatever constitutes the closest equivalent of table in sh (array)
	  - yeah, you have to keep the representation in mind when filtering it as text through sed
- this demo is so cool :D
- Really, really impressive I have to admit
- HA. you cannot evaluate in place so seamlessly in that way with Rmarkdown :). And you cannot combine named blocks in this way either. Wish more folks used emacs.
- wow, so `#+CALL` can be embedded in text via `call_()?` TIL
- such a slick presentation, I like the CRDT collaboration angle, looks like an end-game UX
- Impressive workflow!
- great presentation!
- For those of you who remember the bad old days before "reproducible research," that talk is even more impressive. Great job!
  - i was prolly not there in the bad old days, but imho reproducible research is a pressing, current problem.
- I feel like that talk video should be shared on Hacker News


[[!inline pages="internal(2023/info/collab-after)" raw="yes"]]

[[!inline pages="internal(2023/info/collab-nav)" raw="yes"]]