WEBVTT captioned by amine, checked by sachac
NOTE Introduction
00:00.000 --> 00:00:01.874
[Lukas]: Welcome to our presentation,
00:00:01.875 --> 00:00:03.599
Collaborative Data Processing
00:03.600 --> 00:06.039
and Documenting using org-babel.
00:06.040 --> 00:07.759
My name is Lukas Bossert, and I'm
00:07.760 --> 00:00:09.740
from the RWTH Aachen University
00:00:09.741 --> 00:00:12.519
in the city of Aachen, Germany.
00:12.520 --> 00:14.839
[Jonathan]: And my name is Jonathan Hartmann.
00:14.840 --> 00:18.719
I'm also from the IT Center here at RWTH Aachen.
00:18.720 --> 00:19.239
[Lukas]: Great.
00:19.240 --> 00:21.679
And we will show you today how you
00:21.680 --> 00:25.399
can use Org Mode for data processing.
00:25.400 --> 00:27.999
So you see a little workflow what we are going to do.
00:28.000 --> 00:31.199
First, we will give you a slight introduction to Org Mode.
00:31.200 --> 00:34.639
Then we will dive into the part of data preparing.
00:34.640 --> 00:38.679
First, you're going to query the data using the language SPARQL.
00:38.680 --> 00:41.759
Then we're going to clean it using a different language.
00:41.760 --> 00:44.279
And in the main part of our presentation,
00:44.280 --> 00:48.119
we're going to do the data processing, first aggregating
00:48.120 --> 00:52.519
using Python, later on counting items using Org,
00:52.520 --> 00:56.360
and even visualizing it using R. At the end,
00:56.400 --> 00:58.959
we're going to show you how to preserve
00:58.960 --> 01:01.759
the data and the document and its documentation,
01:01.760 --> 01:06.599
first doing in plain exporting, then adding some metadata,
01:06.600 --> 01:09.759
and showing you two different ways, first a manual export,
01:09.760 --> 01:13.359
and also then a batch-processed export.
01:13.360 --> 01:14.239
All right.
01:14.240 --> 01:16.079
Let's dive in to that.
NOTE Org Mode
01:16.080 --> 01:19.919
Jonathan, can you give us an introduction about Org Mode?
01:19.920 --> 01:20.439
[Jonathan]: Of course.
01:20.440 --> 01:23.079
So in case anyone isn't familiar with it,
01:23.080 --> 01:25.879
Org Mode, in the words of Carsten Dominik,
01:25.880 --> 01:28.559
is back to the future for plain text.
01:28.560 --> 01:31.439
So this is just a module available for Emacs,
01:31.440 --> 01:32.519
plain-text base.
01:32.520 --> 01:34.919
It's been around since 2003, which
01:34.920 --> 01:36.799
makes it about 20 years old.
01:36.800 --> 01:40.159
And it's extensible and fully customizable.
01:40.160 --> 01:43.999
And especially, it's very convenient, very good
01:44.000 --> 01:46.719
for scientific text production and organization.
01:46.720 --> 01:49.439
So for example, you can do project management, agenda,
01:49.440 --> 01:52.559
diary, journaling, personal knowledge management,
01:52.560 --> 01:53.359
presentation.
01:53.360 --> 01:55.520
Even this is written in Org Mode.
01:55.560 --> 01:57.439
It's an Org Mode presentation.
01:57.440 --> 01:59.199
You can do single source publishing,
01:59.200 --> 02:01.679
which we will do later on, and also
02:01.680 --> 02:06.479
literate programming, which is the core of our talk.
02:06.480 --> 02:06.999
OK.
02:07.000 --> 02:10.799
[Lukas]: So let me stop this presentation here.
02:10.800 --> 02:14.719
So what you see here is the plain text underneath it.
02:14.720 --> 02:18.959
So this is Org Mode.
NOTE Working together
02:18.960 --> 02:21.919
And Jonathan, since we kind of already
02:21.920 --> 02:25.320
did the introduction together, should we
02:26.120 --> 00:02:28.760
also do the working part together?
00:02:28.761 --> 00:02:29.700
[Jonathan]: Of course.
00:02:29.701 --> 00:02:33.119
So you see on the screen there on the right,
00:02:33.120 --> 00:02:35.060
that's my screen in Emacs.
00:02:35.061 --> 00:02:39.520
And Lukas, why don't you host a session using CRDT,
00:02:39.521 --> 00:02:41.200
and I'll connect to your buffer.
00:02:41.201 --> 00:02:42.560
[Lukas]: OK. Great.
00:02:42.561 --> 00:02:43.280
I do that.
00:02:43.281 --> 00:02:46.180
So what I do, I'm using Doom Emacs.
00:02:46.181 --> 00:02:49.307
And I can use the `SPC` and then the `l`
00:02:49.308 --> 00:02:52.140
for the live share/collab part.
00:02:52.141 --> 02:57.999
I can use the `s` for share current buffer.
02:58.000 --> 00:03:01.559
So when I do this, I'm getting asked for some settings.
00:03:01.560 --> 00:03:04.439
I'm going with the default settings here.
00:03:04.440 --> 00:03:08.340
So default port, no password, and my display name.
00:03:08.341 --> 00:03:11.940
And now Emacs is connecting.
00:03:11.941 --> 00:03:15.179
And once it's connected, which just takes a couple of seconds,
00:03:15.180 --> 00:03:17.239
I can get the URL.
00:03:17.240 --> 03:20.800
So I'm going back to this menu and using `y`
03:21.160 --> 03:23.999
for copying the URL of the current session.
03:24.000 --> 03:27.799
And this is the URL I'm going to send over to you, Jonathan,
03:27.800 --> 03:29.079
to pick that up.
03:29.080 --> 03:29.599
[Jonathan]: Right.
03:29.600 --> 03:30.079
OK.
03:30.080 --> 00:03:36.999
And now on my screen, I'm going to do a `SPC l c` for connect.
00:03:37.000 --> 00:03:38.740
And I'm going to paste the URL
00:03:38.741 --> 00:03:40.040
that Lukas just sent me in here.
00:03:40.980 --> 03:43.719
Default port, no password.
03:43.720 --> 00:03:45.440
And we're connecting now.
00:03:45.700 --> 03:48.600
So this takes a second just to get us synced up.
03:51.600 --> 00:03:54.160
So we can work on the same document at the same time.
00:03:54.161 --> 03:56.639
We can follow each other's cursors around.
03:56.640 --> 03:58.839
We can have multiple buffers open and work on them
03:58.840 --> 04:00.999
at the same time.
04:01.000 --> 04:04.719
And so here you see that we are both in the same document.
04:04.720 --> 04:06.280
You can see my cursor popping around.
04:09.040 --> 04:13.279
And you can see we're both editing the same item.
04:13.280 --> 04:14.039
Great.
04:14.040 --> 04:18.039
[Lukas]: So we also see who else is currently in our buffer
04:18.040 --> 04:20.199
with the user overview.
04:20.200 --> 04:23.559
So let me just delete that window.
04:23.560 --> 04:26.079
And that's going to work in our main one.
04:26.080 --> 04:29.599
So we said first part is about data retrieval.
04:29.600 --> 04:32.720
So we should give it a headline.
04:37.080 --> 04:39.239
We said prepare stage.
04:39.240 --> 04:42.319
So what are we going to do first, Jonathan?
04:42.320 --> 00:04:43.940
[Jonathan]: So what we're going to do,
00:04:43.941 --> 00:04:45.399
what this whole document is based upon,
04:45.400 --> 04:50.119
is we're going to pull data from Wikidata using a SPARQL query.
04:50.120 --> 04:53.519
The data we're going to pull is related to the NFDIs,
04:53.520 --> 04:55.639
which here in Germany is the National Forschungsdaten
04:55.640 --> 05:00.679
Infrastructure, which is a sort of collection of universities
05:00.680 --> 05:03.399
that work together on various research projects.
05:03.400 --> 05:05.599
And this is emblematic of the kind of data
05:05.600 --> 05:09.239
that we would be interested in working with here.
05:09.240 --> 05:13.359
So I'm going to paste a--forgive the pre-written code--
05:13.360 --> 05:19.840
I'm going to paste some text in here.
05:20.040 --> 00:05:21.407
[Lukas]: And while you are talking, I just
00:05:21.408 --> 00:05:23.359
keep on documenting what we do
00:05:23.360 --> 00:05:25.880
so we can split the work.
05:27.360 --> 05:29.679
[Jonathan]: In here, after a minor technical upset,
05:29.680 --> 05:32.559
is the raw dataset cell.
05:32.560 --> 00:05:34.740
And it's going to use SPARQL,
00:05:34.741 --> 00:05:37.174
which is how we have the syntax highlighting
00:05:37.175 --> 00:05:37.940
in our code here.
00:05:37.941 --> 05:40.639
It's going to go to the URL endpoint
05:40.640 --> 05:43.639
query.wikidata.org/sparql ,
05:43.640 --> 05:46.799
and it's going to return the data as a text CSV,
05:46.800 --> 05:49.279
and it's going to cache that data
05:49.280 --> 05:51.439
so that we don't constantly hammer the API every time
05:51.440 --> 05:54.239
we run this notebook.
05:54.240 --> 00:05:57.360
So I'm going to run that there.
00:05:57.361 --> 05:58.799
You can see down at the bottom of my screen,
05:58.800 --> 06:00.840
we're contacting the host query.wikidata.org .
06:05.720 --> 06:07.319
[Lukas]: And there's the result.
06:07.320 --> 06:11.799
[Jonathan]: Yeah, except I think that for our purposes here,
06:11.800 --> 06:15.279
we're just going to limit this to 50 results.
06:15.280 --> 06:16.279
[Lukas]: Oh, yeah.
06:16.280 --> 06:18.679
[Jonathan]: Just so it's a little easier for us to manage.
06:18.680 --> 06:20.719
I'm going to run that again.
06:20.720 --> 06:21.519
There we go.
06:21.520 --> 00:06:22.319
That looks a little better.
00:06:22.320 --> 00:06:23.159
[Lukas]: I think that's fine.
00:06:23.160 --> 00:06:25.359
50 items is fine.
00:06:25.360 --> 06:27.839
So what do we see here, Jonathan?
NOTE Data cleaning
06:27.840 --> 06:28.319
[Jonathan]: Right.
06:28.320 --> 06:31.239
So the first thing we see when we look at this
06:31.240 --> 00:06:33.307
is a couple of Q codes at the top,
00:06:33.308 --> 00:06:36.079
which are an artifact of Wikidata.
06:36.080 --> 06:39.519
So these are pages which don't have
06:39.520 --> 06:42.519
the label for whichever institution they happen to be.
06:42.520 --> 06:45.919
For our purposes here, we're just going to exclude them.
06:45.920 --> 06:48.199
We could just go on Wikidata and edit them ourselves.
06:48.200 --> 06:50.399
But for now, it's a little more interesting
06:50.400 --> 06:52.519
if we go and remove them.
06:52.520 --> 06:55.159
So I'm going to create a new cell.
06:55.160 --> 06:58.279
Lukas, if you don't mind starting one for data cleaning.
06:58.280 --> 06:58.879
[Lukas]: Oh, yeah.
06:58.880 --> 06:59.479
Good point.
06:59.480 --> 07:02.039
Yeah, data cleaning.
07:02.040 --> 07:03.439
OK.
07:03.440 --> 00:07:05.499
How do you want to do that, Jonathan?
00:07:05.500 --> 07:09.759
[Jonathan]: I'm going to use a shell command.
07:09.760 --> 07:11.119
So let's see.
07:11.120 --> 07:12.999
There we go.
07:13.000 --> 07:15.159
And so you can see, here is another cell,
07:15.160 --> 07:20.039
that the cell is now using a shell,
07:20.040 --> 00:07:23.799
and that we have this thing `:var input=raw-dataset`,
00:07:23.800 --> 00:07:25.840
which is the name of the cell above
00:07:25.841 --> 00:07:28.439
where we got our data from Wikidata.
07:28.440 --> 07:31.679
This is going to run just a simple shell command.
07:31.680 --> 07:33.959
It's going to take the input and then run `sed` on it
07:33.960 --> 00:07:37.039
and exclude any records which have a Q
00:07:37.040 --> 00:07:41.279
followed by one or more digits afterwards.
07:41.280 --> 07:43.960
That should remove those from our data set.
07:44.000 --> 07:45.400
So I'm going to run that.
07:48.640 --> 07:51.039
That seems to have done the trick.
07:51.040 --> 07:51.879
[Lukas]: Great, yeah.
07:51.880 --> 07:52.919
That's really good.
07:52.920 --> 07:55.399
We got rid of all the Q items.
07:55.400 --> 07:55.919
Very good.
07:55.920 --> 07:59.959
So we just have two-column table: institutions
07:59.960 --> 08:02.759
and consortia.
08:02.760 --> 08:04.039
Very nice.
NOTE Processing
08:04.040 --> 08:08.719
So let's come to our main part, doing some processing.
08:08.720 --> 08:13.560
Let me give you a headline here, process the data.
08:13.640 --> 08:15.519
What do you want to do first?
08:15.520 --> 08:17.599
[Jonathan]: This is not a very complicated data set,
08:17.600 --> 08:19.439
but let's just do some simple counts first.
08:19.440 --> 08:22.199
I'm going to start with Python,
08:22.200 --> 08:25.239
and we're just going to do some aggregation with Python.
08:25.240 --> 08:30.039
Again, I've got some pre-written code here.
08:30.040 --> 08:34.999
You can see that we've started a cell using Python.
08:35.000 --> 08:37.879
The variable `clean_df` now is equal to `clean-dataset`.
08:37.880 --> 00:08:39.707
So we're going to take that data
00:08:39.708 --> 00:08:41.039
that we retrieved from the SPARQL query,
08:41.040 --> 08:42.680
we're going to run it through the cleaning cell,
08:42.720 --> 08:45.239
and then we're going to import it into this cell.
08:45.240 --> 08:47.839
This is just going to do some simple Python aggregation.
08:47.840 --> 00:08:49.007
We're going to import `pandas`,
00:08:49.008 --> 00:08:51.307
which is the Python data science library,
00:08:51.308 --> 00:08:54.839
create a data frame out of our input,
08:54.840 --> 08:57.479
and then aggregate it, grouping on `wLabel`,
08:57.480 --> 08:59.959
and getting a count from that and returning it.
08:59.960 --> 09:01.640
So if we execute that cell...
09:05.040 --> 09:08.879
[Lukas]: Nice, we get institutions and a count.
09:08.880 --> 09:14.119
But what about not ordering it by the alphabet,
09:14.120 --> 09:17.079
but more like ordering by counts?
09:17.080 --> 09:18.439
[Jonathan]: Sure.
09:18.440 --> 09:22.839
So let's do this... `sort_values()`, I think, as the Python.
09:22.840 --> 09:24.919
How does that look?
09:24.920 --> 00:09:27.640
[Lukas]: Better, but I would like to
00:09:27.641 --> 00:09:29.239
have the highest number first
09:29.240 --> 09:32.239
and then ascending.
09:32.240 --> 09:34.719
Well, not ascending, descending.
09:34.720 --> 09:37.600
[Jonathan]: Right, so we can do `ascending=False`.
09:39.880 --> 09:42.559
[Lukas]: This is perfect, I'd say.
09:42.560 --> 09:43.079
[Jonathan]: Great.
09:43.080 --> 09:44.079
[Lukas]: Very good.
09:44.080 --> 00:09:46.799
OK, that's nice.
00:09:46.800 --> 09:47.999
We get a good overview here.
09:48.000 --> 09:50.079
But can we also do something else,
09:50.080 --> 09:56.079
like counting how many institutions are
09:56.080 --> 09:57.799
involved in one consortium?
09:57.800 --> 10:00.879
And also using this later on in the text?
10:00.880 --> 00:10:00.880
[Jonathan]: Sure, so I'm going to put a new...
00:10:00.881 --> 00:10:05.040
If you give me another heading down here
00:10:05.041 --> 00:10:08.320
for institutions per consortium...
10:12.080 --> 10:16.799
And here we're going to use awk code just to spice things up
10:16.800 --> 10:18.959
and add yet another language in here.
10:18.960 --> 10:22.439
So you can see this is awk.
10:22.440 --> 10:26.279
We're using standard in instead of defining a variable.
10:26.280 --> 10:28.359
But the really interesting thing about this cell
10:28.360 --> 00:10:33.399
is that we have this `:var consortium="NFDI4Memory"`.
10:33.400 --> 00:10:35.640
And what this code is doing is
00:10:35.641 --> 00:10:38.040
it's counting any time it sees
00:10:38.041 --> 00:10:40.279
that particular consortium name
10:40.280 --> 10:41.759
and keeping track of that.
10:41.760 --> 00:10:43.907
So if we execute this,
00:10:43.908 --> 00:10:45.919
Lukas, why don't you execute this one?
10:45.920 --> 10:49.399
[Lukas]: OK, I'm going to enter it.
10:49.400 --> 10:52.439
And I get a result, NFDI4Memory,
10:52.440 --> 10:58.239
because this is our default value for this variable.
10:58.240 --> 10:59.439
And we get the count.
10:59.440 --> 00:11:01.640
So it's five institutions are involved
00:11:01.641 --> 00:11:04.639
in the NFDI4memory consortium.
11:04.640 --> 11:07.839
Great, but the very nice thing, what I think,
11:07.840 --> 11:12.519
is here that we can use this code snippet within our text.
11:12.520 --> 11:14.279
So, blended in seamlessly.
11:14.280 --> 11:16.199
Let me give you an example.
11:16.200 --> 11:18.919
I'm writing out the text.
11:18.920 --> 11:27.599
Now we know how many institutions are in...
11:27.600 --> 11:29.239
Give me an example.
11:29.240 --> 11:31.480
I would like to know how many institutions are
11:31.560 --> 11:35.079
involved in NFDI4Objects, which is a consortium.
11:35.080 --> 11:39.239
So I'm writing `call_` and using
11:39.240 --> 00:11:42.607
the name of this snippet here, of this cell,
00:11:42.608 --> 00:11:46.607
which is `inst-count(`,
00:11:46.608 --> 00:11:51.719
and writing my value, `NFDI4Objects`.
11:51.720 --> 11:57.999
As soon as I evaluate this using `C-c C-c`,
11:58.000 --> 12:00.279
I get the result back here.
12:00.280 --> 12:05.159
I can do this even for more.
12:05.160 --> 12:14.039
Or in writing, `call_inst-count`, go with `NFDI4Earth`,
12:14.040 --> 12:16.799
which is another consortium.
12:16.800 --> 12:20.559
`C-c C-c`, it's three institutions.
12:20.560 --> 12:23.439
This can be used throughout your text,
12:23.440 --> 12:26.639
and as soon as the data set changes from in the beginning,
12:26.640 --> 12:30.399
maybe different results requiring Wikidata,
12:30.400 --> 12:35.079
this also will be updated once it's exported.
12:35.080 --> 12:36.039
Very nice, Jonathan.
NOTE Visualization
12:36.040 --> 00:12:38.974
But I think we did a lot of analysis
00:12:38.975 --> 00:12:41.079
on text and counting things.
12:41.080 --> 12:43.679
Can we also do something more visual?
12:43.680 --> 12:45.199
Show me something.
12:45.200 --> 12:45.759
[Jonathan]: Sure.
12:45.760 --> 12:48.639
So what we can do with this, because we just
12:48.640 --> 12:51.399
have two columns here that are sort of related,
12:51.400 --> 12:53.759
we can build a little network plot out of it.
12:53.760 --> 12:56.999
So let's make a network visualization.
12:57.000 --> 12:59.599
We're going to use the `igraph` library from R
12:59.600 --> 13:02.559
and just plot the edges that we see here.
13:02.560 --> 13:04.239
There we go.
13:04.240 --> 13:11.879
There's my little heading and space.
13:11.880 --> 13:13.479
Here is our code.
13:13.480 --> 13:16.039
Again, just to be fancy and keep using
13:16.040 --> 13:19.719
different languages in here, we set a variable called
13:19.720 --> 13:21.560
`NFDI_edges` equal to `clean-dataset`.
13:21.600 --> 13:23.399
So this, again, is sort of cascading
13:23.400 --> 00:13:25.740
through the original data
00:13:25.741 --> 00:13:28.807
that we pulled from the Wikidata endpoint,
00:13:28.808 --> 00:13:30.959
cleaning that data, and now it's being inserted
13:30.960 --> 13:32.959
into this cell as well.
13:32.960 --> 13:34.239
But you see the difference here.
13:34.240 --> 13:36.839
Instead of exporting a table, what we're saying
13:36.840 --> 13:39.239
is that there will be a graphics file,
13:39.240 --> 13:44.639
and it will be called network-plot.png.
13:44.640 --> 13:45.119
All right.
13:45.120 --> 13:47.959
And so Lukas, why don't you execute this one?
13:47.960 --> 13:48.759
[Lukas]: There you go.
13:48.760 --> 13:52.919
I can click `C-c C-c`
13:52.920 --> 13:59.159
and I get a nice plot of the network below our cell.
13:59.160 --> 14:01.759
So this is very nice indeed.
NOTE Preserve
14:01.760 --> 14:05.199
So I think it's about time to wrap it up and to export
14:05.200 --> 14:07.959
and to preserve the data and the documentation
14:07.960 --> 14:13.079
that we have in our very last step, calling preserve.
14:13.080 --> 14:16.239
So I would like to do it in two steps.
14:16.240 --> 14:18.600
First, maybe manually exporting it,
14:18.800 --> 14:22.239
but then also doing it in a batch process.
14:22.240 --> 14:27.119
Giving you some insights how to do that manual export.
14:27.120 --> 14:30.559
For example, you can do a LaTeX export.
14:30.560 --> 14:34.279
Let me write down the key combination to do that here.
14:34.280 --> 14:44.560
So you press `SPC m e l o`.
14:44.600 --> 14:49.159
Let me show you how this is done.
14:49.160 --> 14:51.439
So I'm pressing `SPC`.
14:51.440 --> 14:55.679
I'm pressing `m`, which is my local leader.
14:55.680 --> 15:01.279
I'm pressing `e`, which is now the `org-export-dispatch`.
15:01.280 --> 15:03.519
And now I have different options I can choose from.
15:03.520 --> 15:07.119
I want to do a LaTeX export because I want to get in PDF.
15:07.120 --> 00:15:08.674
So I'm pressing `l`.
00:15:08.675 --> 00:15:11.479
Now I've got different options available.
15:11.480 --> 15:17.399
So I'm pressing `o` for a PDF file and open that.
15:17.400 --> 15:21.119
Let's see now the code.
15:21.120 --> 15:25.639
Now this is exporting document.
15:25.640 --> 00:15:29.674
And what we have here is PDF,
00:15:29.675 --> 00:15:31.974
which contains our workflow in the beginning,
00:15:31.975 --> 00:15:35.707
our bullet points we have here,
00:15:35.708 --> 00:15:37.919
and also the code snippet
15:37.920 --> 15:41.120
that we use for querying the data.
15:41.280 --> 15:43.599
And we have the result below that.
15:43.600 --> 15:46.999
So this is our table with all the data sets.
15:47.000 --> 15:51.879
But as you can see, this is running out of the page.
15:51.880 --> 15:55.679
So this is not very nice using the default settings.
15:55.680 --> 16:00.239
But everything is in this PDF.
16:00.240 --> 16:02.759
I guess we can now show you a way
16:02.760 --> 16:06.519
how to improve this result.
16:06.520 --> 16:07.039
[Jonathan]: Right.
16:07.040 --> 16:09.399
So we have, of course, a version of this
16:09.400 --> 00:16:10.774
that we prepared ahead of time,
00:16:10.775 --> 00:16:14.279
which is more or less identical to the one we just made,
16:14.280 --> 16:17.839
but it has a little more text, a little more explanation,
16:17.840 --> 16:20.559
a little more documentation along with the code.
16:20.560 --> 16:23.879
You can see we have some metadata up at the top,
16:23.880 --> 16:26.879
the title, the authors, a bibliography,
16:26.880 --> 16:31.679
and most importantly, the `custom-export.setup` file,
16:31.680 --> 16:36.879
which lists specifically the sort of LaTeX commands
16:36.880 --> 16:43.599
that we're using and the HTML styles that we're going to use.
16:43.600 --> 16:45.919
And then down at the bottom of this file,
16:45.920 --> 16:49.119
we have our automatic batch process.
16:49.120 --> 16:51.719
Here is one more language we're including in here.
16:51.720 --> 16:53.439
So this is Lisp.
16:53.440 --> 16:57.359
And you can see here we are exporting to HTML, ASCII,
16:57.360 --> 16:58.079
and PDF.
16:58.080 --> 17:01.359
The nice thing about this is that this is a document.
17:01.360 --> 00:17:03.307
It's a sort of document that we have a couple of
00:17:03.308 --> 00:17:08.639
that we can have running automatically and building.
17:08.640 --> 17:12.919
It will export a HTML, an ASCII file, and a PDF file
17:12.920 --> 00:17:14.674
every time it's run based off of
00:17:14.675 --> 00:17:17.319
the most recent data available on Wikidata.
17:17.320 --> 17:19.719
So it's self-documenting.
17:19.720 --> 00:17:22.440
We have, of course, our data retrieval steps,
00:17:22.441 --> 00:17:25.159
our data cleaning steps, our data preparation steps,
17:25.160 --> 17:28.359
and our preservation steps all listed at the same time.
17:28.360 --> 17:30.239
And then you can see over on the right,
17:30.240 --> 17:34.320
there's an example of the HTML file that we get out of this.
17:34.360 --> 17:37.639
We also get a very nicely formatted PDF file,
17:37.640 --> 17:39.239
which doesn't have that little issue
17:39.240 --> 17:41.719
with the overflow of the table.
17:41.720 --> 17:43.559
It's very nicely put together.
17:43.560 --> 17:46.199
And we even have an ASCII file.
17:46.200 --> 17:47.879
And I should also point out very quickly,
17:47.880 --> 17:51.799
while you have this one up, Lukas, after the awk code,
17:51.800 --> 17:56.079
you can see the text for the number of consortia,
17:56.080 --> 17:57.839
or the number of institutions per consortia
17:57.840 --> 18:00.519
is actually printed inline.
18:00.520 --> 18:01.799
[Lukas]: Yeah, you're very right.
18:01.800 --> 18:06.119
So this is what we had as code,
18:06.120 --> 18:10.719
and now this is nicely integrated into our text.
18:10.720 --> 18:15.279
So we got the consortium and number of institutions.
18:15.280 --> 18:19.199
You can't tell a difference between code and text.
18:19.200 --> 18:20.719
[Jonathan]: And those are automatically updated.
18:20.720 --> 18:23.879
So if another institution joins NFDI4Earth,
18:23.880 --> 18:26.319
then the next time this runs, we update the text right here.
18:26.320 --> 18:28.519
It's nothing we have to worry about.
18:28.520 --> 18:30.400
We just pull it directly out of Wikidata.
18:31.840 --> 18:34.679
[Lukas]: And for the sake of completeness,
18:34.680 --> 18:37.879
this is the ASCII file.
18:37.880 --> 18:39.320
That's in the export format.
18:42.760 --> 18:46.440
It contains also everything, code and data.
18:48.360 --> 18:51.680
Yeah, so this is what we wanted to show you,
18:53.240 --> 18:56.639
how to do some data processing,
18:56.640 --> 18:58.679
some collaborative work,
18:58.680 --> 19:01.119
documenting using org-babel.
19:01.120 --> 19:03.960
Thanks for listening.
19:05.720 --> 19:07.280
[Jonathan]: Thank you all, have a good day.