1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
|
[[!meta title="Collaborative data processing and documenting using org-babel"]]
[[!meta copyright="Copyright © 2023 Jonathan Hartman, Lukas C. Bossert"]]
[[!inline pages="internal(2023/info/collab-nav)" raw="yes"]]
<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->
# Collaborative data processing and documenting using org-babel
Jonathan Hartman (he/him), Lukas C. Bossert (he/him) - <https://mastodon.social/@lukascbossert>, <mailto:hartman@itc.rwth-aachen.de>, <mailto:bossert@itc.rwth-aachen.de>
[[!inline pages="internal(2023/info/collab-before)" raw="yes"]]
In our presentation we will show an efficient way of combining
information and enriching it by retrieving data, processing it, and
finally exporting it, all with org-mode. In this presentation, we will
demonstrate not only org-mode, but also a few companion libraries that
add functionality such as knowledge graph visualizations, literate
programming, and collaborative editing to quickly create a deeply
informative reference page.
The starting point of our best practice is the National Research Data
Infrastructure Germany (NFDI), about which we intend to retrieve and
process certain information data gathered from wikidata. For this, we
are additionally leveraging the "org-roam" emacs package, which
provides functionality for quickly and simply linking together notes
and ideas into a custom knowledge graph. Initially, we will write a
short abstract about the NFDI and embed it into our existing knowledge
graph by linking it to other existing nodes. In the visualized graph
(using the “org-roam-ui” package), links and secondary connections to
other existing nodes can now be revealed.
Next, we would like to enrich the text about the NFDI by with data
retrieved from the Wikidata API. A convenient way of creating
self-documenting code is the approach called “literate programming”,
which presents program logic embedded within human language text. In
Emacs we achieve this by using the “org-babel” package. Perhaps now we
find it is helpful to collaborate with a colleague in the document:
while one is writing the code, the other can explain its use and
interpret the results. We will do this simultaneously in the same
document using a method called “crdt” (conflict-free replicated data
type) and – of course – there is also an implementation of this in
Emacs. The results of the code blocks can be used for further analysis
and shared throughout the same document.
Finally, for the sake of proper and barrier free documentation, we
show how to export the document to various formats like pdf, html, txt
etc. using either the built-in feature of org-mode or the
implementation of pandoc.
About the speakers:
**Jonathan Hartman** is a trained data scientist and works at the IT
Center of the RWTH Aachen University, Germany.
**Lukas C. Bossert** is a trained classical archaeologist and is deputy
head of the department "research process and data management" at the
IT Center of the RWTH.
Lukas, an intermediate Emacs user, is currently exploring how to
optimize his daily workflow by leveraging various Emacs packages. On
the other hand, Jonathan is a relative newcomer to this environment,
encountering common pitfalls faced by beginners. Together, they
explore the capabilities and functionalities of org-mode, discovering
how it can enhance data management and presentation in their research
processes.
[[!img /i/emacsconf-2023-collab-sponsorship.png alt="Lukas and Jonathan are financed by the DKZ.2R Datenkompetenzkolleg Rhein-Ruhr (16DKZ2030E), www.dks2r.de"]]
# Discussion
## Questions and answers
- Q: How reliable it resolves the conflict? I mean, for my personal
use case, for example, Sycnthing, sometimes it's not working
perfectly and I had to manually edit it. How is it robust compared
to syncthing?
- A (Lukas): We also faced sometimes issues that letters got
mixed up. We couldnt figure out what caused it and it was not
reproducable . I cannot compare it to syncthing, never used that
with emacs/org-mode.
- Q: How's the security for this kind of things? I mean, if we adopt
these things in our PAD, is there any, can this thing execute
arbitrary (elisp) code in different people's computer? (Think like
an adversary!)
- A: (Lukas) As far as we saw the code is executed on the local
computer, see the part with the R-code in our video.
- (zaeph) We had plans with qhong (maintainer of crdt.el) to
tunnel the connection via SSL, but we were blocked by the SSL
library that shipped with Emacs, sadly. However, we did create
a security policy that allowed restrictions on the execution of
Elisp code. (great!)
- Q: Really nice talk and demo! You guys clearly rehearsed :). I
always wonder with serial data processing sequencing like this, to
what degree do the intermediate outputs need to appear inline in the
text? Suppose you had 50,000 or one million rows from your initial
wikidata (or similar) call. How would you handle that size of data
using a collaborative, literate approach like this?
- A: (Lukas) Good question. In your local buffer there is no
difference and for the collaborative partner I cannot tell. We
testet it with 50 items because that was enough for
demonstrating our purpose.
- noweb allows getting results of evaluation without having to put
the actual data into Org buffer - just arrange the original
block generating the data to have :results silent. Basically,
:var foo=block-name does not require "block-name" to be
evaluated in advance - it will be evaluated as necessary. AFAIU,
in the talk, it is re-evaluated every time (to not have it, one
would need :cache t).
- This has tremendous utility
- So it would be stored on disk and referenced by name in a
subsequent block? Sounds useful.
- Not on disk - just cached within a single session. To store
on disk, need to save to actual file on disk.
- Q: How do you handle the viewing of larger or really any tabular
data in Emacs/Org when you want to inspect it, like the nice way
tabular data is displayed inline in Rmarkdown/RStudio?
- A: (Lukas) I have no particular way of doing this.
- What about pandas data summary functionality? Can be a simple
python block.
- Lukas: Jonathan is our python expert, he might answer this
question.
- A: (Jonathan) If I follow, you can certainly just use
DataFrame.describe() or Series.describe() to get summary
statistics for a dataset - the return value would be a Series or
a DataFrame, which would be displayed similiarly to how we show
things here. Alternatively, DataFrame.head(n) or
DataFrame.sample(n) would return a dataframe of the first n / n
random lines of a dataset, and might be a way of providing the
gist of a very large dataset without printing the entire table
in the document.
- Would be nice to have a "summarized table" functionality in
Org, that includes an abridged copy of a long table inline, but
you can open it in another buffer to browse/edit the full table
(ala block edit).
- Feel free to post a feature request - see
<https://orgmode.org/manual/Feedback.html#Feedback>
- Q: I'm thinking about an application for a single user, but in
different platforms. In a simple case. For example, you have a
buffer in your local computer, and you also want to have some files
on your pad or on your phone, and you can use this CADT concept to
make sure that there's not too much conflict in between different
editing sections. Do you think this is a good idea? I mean, compared
to purely relying on Syncthing, which sometimes I feel is unreliable
for resolving those conflicts.
- A: (Lukas) This sounds very interesting and could beneficial for
contiously working on things.
## Notes
- I like the way you highlight the point you are talking about in real
time.
- Conflict-free Replicated Data Types (CADT) ::
<https://github.com/emacs-straight/crdt>
- !This is the future of PAD for our conference.
- Just came here to say watching two users editing the same buffer
simultaneously is BLOWING MY MIND
- BLOWING MY MIND +2
- blowing my mind, too ...
- WOW
- Gitlab custom-export.setup
- What about it?
- I am looking for that setup file and want to try it :)
-->
<https://git.rwth-aachen.de/dl/workshops/collaborative-coding-with-emacs/-/blob/main/emacs/custom-export.setup>
- Thank you!
- Truly one of the most impressive talks of the day. Congrats! Very
inspiring
- Yes, indeed.
- (Lukas) Wow! Thank you. We werent sure if this is worth showing
at EmacsConf because there already have been plenty of talks
about literate programming and org-babel....
- Great collaborative conversation and step-wise example
creates a different (and impactful) framing. Thank you!
- crdt is fantastic; pity that most (all but one) of my collaborators use Word & VS Code. 🙁
- that's really cool. One of the parts that's a bit hidden from the user is seeing the format that the data is in inside the shell script
- it is whatever constitutes the closest equivalent of table in sh (array)
- yeah, you have to keep the representation in mind when filtering it as text through sed
- this demo is so cool :D
- Really, really impressive I have to admit
- HA. you cannot evaluate in place so seamlessly in that way with Rmarkdown :). And you cannot combine named blocks in this way either. Wish more folks used emacs.
- wow, so `#+CALL` can be embedded in text via `call_()?` TIL
- such a slick presentation, I like the CRDT collaboration angle, looks like an end-game UX
- Impressive workflow!
- great presentation!
- For those of you who remember the bad old days before "reproducible research," that talk is even more impressive. Great job!
- i was prolly not there in the bad old days, but imho reproducible research is a pressing, current problem.
- I feel like that talk video should be shared on Hacker News
[[!inline pages="internal(2023/info/collab-after)" raw="yes"]]
[[!inline pages="internal(2023/info/collab-nav)" raw="yes"]]
|