WEBVTT captioned by amine, checked by sachac NOTE Introduction 00:00.000 --> 00:00:01.874 [Lukas]: Welcome to our presentation, 00:00:01.875 --> 00:00:03.599 Collaborative Data Processing 00:03.600 --> 00:06.039 and Documenting using org-babel. 00:06.040 --> 00:07.759 My name is Lukas Bossert, and I'm 00:07.760 --> 00:00:09.740 from the RWTH Aachen University 00:00:09.741 --> 00:00:12.519 in the city of Aachen, Germany. 00:12.520 --> 00:14.839 [Jonathan]: And my name is Jonathan Hartmann. 00:14.840 --> 00:18.719 I'm also from the IT Center here at RWTH Aachen. 00:18.720 --> 00:19.239 [Lukas]: Great. 00:19.240 --> 00:21.679 And we will show you today how you 00:21.680 --> 00:25.399 can use Org Mode for data processing. 00:25.400 --> 00:27.999 So you see a little workflow what we are going to do. 00:28.000 --> 00:31.199 First, we will give you a slight introduction to Org Mode. 00:31.200 --> 00:34.639 Then we will dive into the part of data preparing. 00:34.640 --> 00:38.679 First, you're going to query the data using the language SPARQL. 00:38.680 --> 00:41.759 Then we're going to clean it using a different language. 00:41.760 --> 00:44.279 And in the main part of our presentation, 00:44.280 --> 00:48.119 we're going to do the data processing, first aggregating 00:48.120 --> 00:52.519 using Python, later on counting items using Org, 00:52.520 --> 00:56.360 and even visualizing it using R. At the end, 00:56.400 --> 00:58.959 we're going to show you how to preserve 00:58.960 --> 01:01.759 the data and the document and its documentation, 01:01.760 --> 01:06.599 first doing in plain exporting, then adding some metadata, 01:06.600 --> 01:09.759 and showing you two different ways, first a manual export, 01:09.760 --> 01:13.359 and also then a batch-processed export. 01:13.360 --> 01:14.239 All right. 01:14.240 --> 01:16.079 Let's dive in to that. NOTE Org Mode 01:16.080 --> 01:19.919 Jonathan, can you give us an introduction about Org Mode? 01:19.920 --> 01:20.439 [Jonathan]: Of course. 01:20.440 --> 01:23.079 So in case anyone isn't familiar with it, 01:23.080 --> 01:25.879 Org Mode, in the words of Carsten Dominik, 01:25.880 --> 01:28.559 is back to the future for plain text. 01:28.560 --> 01:31.439 So this is just a module available for Emacs, 01:31.440 --> 01:32.519 plain-text base. 01:32.520 --> 01:34.919 It's been around since 2003, which 01:34.920 --> 01:36.799 makes it about 20 years old. 01:36.800 --> 01:40.159 And it's extensible and fully customizable. 01:40.160 --> 01:43.999 And especially, it's very convenient, very good 01:44.000 --> 01:46.719 for scientific text production and organization. 01:46.720 --> 01:49.439 So for example, you can do project management, agenda, 01:49.440 --> 01:52.559 diary, journaling, personal knowledge management, 01:52.560 --> 01:53.359 presentation. 01:53.360 --> 01:55.520 Even this is written in Org Mode. 01:55.560 --> 01:57.439 It's an Org Mode presentation. 01:57.440 --> 01:59.199 You can do single source publishing, 01:59.200 --> 02:01.679 which we will do later on, and also 02:01.680 --> 02:06.479 literate programming, which is the core of our talk. 02:06.480 --> 02:06.999 OK. 02:07.000 --> 02:10.799 [Lukas]: So let me stop this presentation here. 02:10.800 --> 02:14.719 So what you see here is the plain text underneath it. 02:14.720 --> 02:18.959 So this is Org Mode. NOTE Working together 02:18.960 --> 02:21.919 And Jonathan, since we kind of already 02:21.920 --> 02:25.320 did the introduction together, should we 02:26.120 --> 00:02:28.760 also do the working part together? 00:02:28.761 --> 00:02:29.700 [Jonathan]: Of course. 00:02:29.701 --> 00:02:33.119 So you see on the screen there on the right, 00:02:33.120 --> 00:02:35.060 that's my screen in Emacs. 00:02:35.061 --> 00:02:39.520 And Lukas, why don't you host a session using CRDT, 00:02:39.521 --> 00:02:41.200 and I'll connect to your buffer. 00:02:41.201 --> 00:02:42.560 [Lukas]: OK. Great. 00:02:42.561 --> 00:02:43.280 I do that. 00:02:43.281 --> 00:02:46.180 So what I do, I'm using Doom Emacs. 00:02:46.181 --> 00:02:49.307 And I can use the `SPC` and then the `l` 00:02:49.308 --> 00:02:52.140 for the live share/collab part. 00:02:52.141 --> 02:57.999 I can use the `s` for share current buffer. 02:58.000 --> 00:03:01.559 So when I do this, I'm getting asked for some settings. 00:03:01.560 --> 00:03:04.439 I'm going with the default settings here. 00:03:04.440 --> 00:03:08.340 So default port, no password, and my display name. 00:03:08.341 --> 00:03:11.940 And now Emacs is connecting. 00:03:11.941 --> 00:03:15.179 And once it's connected, which just takes a couple of seconds, 00:03:15.180 --> 00:03:17.239 I can get the URL. 00:03:17.240 --> 03:20.800 So I'm going back to this menu and using `y` 03:21.160 --> 03:23.999 for copying the URL of the current session. 03:24.000 --> 03:27.799 And this is the URL I'm going to send over to you, Jonathan, 03:27.800 --> 03:29.079 to pick that up. 03:29.080 --> 03:29.599 [Jonathan]: Right. 03:29.600 --> 03:30.079 OK. 03:30.080 --> 00:03:36.999 And now on my screen, I'm going to do a `SPC l c` for connect. 00:03:37.000 --> 00:03:38.740 And I'm going to paste the URL 00:03:38.741 --> 00:03:40.040 that Lukas just sent me in here. 00:03:40.980 --> 03:43.719 Default port, no password. 03:43.720 --> 00:03:45.440 And we're connecting now. 00:03:45.700 --> 03:48.600 So this takes a second just to get us synced up. 03:51.600 --> 00:03:54.160 So we can work on the same document at the same time. 00:03:54.161 --> 03:56.639 We can follow each other's cursors around. 03:56.640 --> 03:58.839 We can have multiple buffers open and work on them 03:58.840 --> 04:00.999 at the same time. 04:01.000 --> 04:04.719 And so here you see that we are both in the same document. 04:04.720 --> 04:06.280 You can see my cursor popping around. 04:09.040 --> 04:13.279 And you can see we're both editing the same item. 04:13.280 --> 04:14.039 Great. 04:14.040 --> 04:18.039 [Lukas]: So we also see who else is currently in our buffer 04:18.040 --> 04:20.199 with the user overview. 04:20.200 --> 04:23.559 So let me just delete that window. 04:23.560 --> 04:26.079 And that's going to work in our main one. 04:26.080 --> 04:29.599 So we said first part is about data retrieval. 04:29.600 --> 04:32.720 So we should give it a headline. 04:37.080 --> 04:39.239 We said prepare stage. 04:39.240 --> 04:42.319 So what are we going to do first, Jonathan? 04:42.320 --> 00:04:43.940 [Jonathan]: So what we're going to do, 00:04:43.941 --> 00:04:45.399 what this whole document is based upon, 04:45.400 --> 04:50.119 is we're going to pull data from Wikidata using a SPARQL query. 04:50.120 --> 04:53.519 The data we're going to pull is related to the NFDIs, 04:53.520 --> 04:55.639 which here in Germany is the National Forschungsdaten 04:55.640 --> 05:00.679 Infrastructure, which is a sort of collection of universities 05:00.680 --> 05:03.399 that work together on various research projects. 05:03.400 --> 05:05.599 And this is emblematic of the kind of data 05:05.600 --> 05:09.239 that we would be interested in working with here. 05:09.240 --> 05:13.359 So I'm going to paste a--forgive the pre-written code-- 05:13.360 --> 05:19.840 I'm going to paste some text in here. 05:20.040 --> 00:05:21.407 [Lukas]: And while you are talking, I just 00:05:21.408 --> 00:05:23.359 keep on documenting what we do 00:05:23.360 --> 00:05:25.880 so we can split the work. 05:27.360 --> 05:29.679 [Jonathan]: In here, after a minor technical upset, 05:29.680 --> 05:32.559 is the raw dataset cell. 05:32.560 --> 00:05:34.740 And it's going to use SPARQL, 00:05:34.741 --> 00:05:37.174 which is how we have the syntax highlighting 00:05:37.175 --> 00:05:37.940 in our code here. 00:05:37.941 --> 05:40.639 It's going to go to the URL endpoint 05:40.640 --> 05:43.639 query.wikidata.org/sparql , 05:43.640 --> 05:46.799 and it's going to return the data as a text CSV, 05:46.800 --> 05:49.279 and it's going to cache that data 05:49.280 --> 05:51.439 so that we don't constantly hammer the API every time 05:51.440 --> 05:54.239 we run this notebook. 05:54.240 --> 00:05:57.360 So I'm going to run that there. 00:05:57.361 --> 05:58.799 You can see down at the bottom of my screen, 05:58.800 --> 06:00.840 we're contacting the host query.wikidata.org . 06:05.720 --> 06:07.319 [Lukas]: And there's the result. 06:07.320 --> 06:11.799 [Jonathan]: Yeah, except I think that for our purposes here, 06:11.800 --> 06:15.279 we're just going to limit this to 50 results. 06:15.280 --> 06:16.279 [Lukas]: Oh, yeah. 06:16.280 --> 06:18.679 [Jonathan]: Just so it's a little easier for us to manage. 06:18.680 --> 06:20.719 I'm going to run that again. 06:20.720 --> 06:21.519 There we go. 06:21.520 --> 00:06:22.319 That looks a little better. 00:06:22.320 --> 00:06:23.159 [Lukas]: I think that's fine. 00:06:23.160 --> 00:06:25.359 50 items is fine. 00:06:25.360 --> 06:27.839 So what do we see here, Jonathan? NOTE Data cleaning 06:27.840 --> 06:28.319 [Jonathan]: Right. 06:28.320 --> 06:31.239 So the first thing we see when we look at this 06:31.240 --> 00:06:33.307 is a couple of Q codes at the top, 00:06:33.308 --> 00:06:36.079 which are an artifact of Wikidata. 06:36.080 --> 06:39.519 So these are pages which don't have 06:39.520 --> 06:42.519 the label for whichever institution they happen to be. 06:42.520 --> 06:45.919 For our purposes here, we're just going to exclude them. 06:45.920 --> 06:48.199 We could just go on Wikidata and edit them ourselves. 06:48.200 --> 06:50.399 But for now, it's a little more interesting 06:50.400 --> 06:52.519 if we go and remove them. 06:52.520 --> 06:55.159 So I'm going to create a new cell. 06:55.160 --> 06:58.279 Lukas, if you don't mind starting one for data cleaning. 06:58.280 --> 06:58.879 [Lukas]: Oh, yeah. 06:58.880 --> 06:59.479 Good point. 06:59.480 --> 07:02.039 Yeah, data cleaning. 07:02.040 --> 07:03.439 OK. 07:03.440 --> 00:07:05.499 How do you want to do that, Jonathan? 00:07:05.500 --> 07:09.759 [Jonathan]: I'm going to use a shell command. 07:09.760 --> 07:11.119 So let's see. 07:11.120 --> 07:12.999 There we go. 07:13.000 --> 07:15.159 And so you can see, here is another cell, 07:15.160 --> 07:20.039 that the cell is now using a shell, 07:20.040 --> 00:07:23.799 and that we have this thing `:var input=raw-dataset`, 00:07:23.800 --> 00:07:25.840 which is the name of the cell above 00:07:25.841 --> 00:07:28.439 where we got our data from Wikidata. 07:28.440 --> 07:31.679 This is going to run just a simple shell command. 07:31.680 --> 07:33.959 It's going to take the input and then run `sed` on it 07:33.960 --> 00:07:37.039 and exclude any records which have a Q 00:07:37.040 --> 00:07:41.279 followed by one or more digits afterwards. 07:41.280 --> 07:43.960 That should remove those from our data set. 07:44.000 --> 07:45.400 So I'm going to run that. 07:48.640 --> 07:51.039 That seems to have done the trick. 07:51.040 --> 07:51.879 [Lukas]: Great, yeah. 07:51.880 --> 07:52.919 That's really good. 07:52.920 --> 07:55.399 We got rid of all the Q items. 07:55.400 --> 07:55.919 Very good. 07:55.920 --> 07:59.959 So we just have two-column table: institutions 07:59.960 --> 08:02.759 and consortia. 08:02.760 --> 08:04.039 Very nice. NOTE Processing 08:04.040 --> 08:08.719 So let's come to our main part, doing some processing. 08:08.720 --> 08:13.560 Let me give you a headline here, process the data. 08:13.640 --> 08:15.519 What do you want to do first? 08:15.520 --> 08:17.599 [Jonathan]: This is not a very complicated data set, 08:17.600 --> 08:19.439 but let's just do some simple counts first. 08:19.440 --> 08:22.199 I'm going to start with Python, 08:22.200 --> 08:25.239 and we're just going to do some aggregation with Python. 08:25.240 --> 08:30.039 Again, I've got some pre-written code here. 08:30.040 --> 08:34.999 You can see that we've started a cell using Python. 08:35.000 --> 08:37.879 The variable `clean_df` now is equal to `clean-dataset`. 08:37.880 --> 00:08:39.707 So we're going to take that data 00:08:39.708 --> 00:08:41.039 that we retrieved from the SPARQL query, 08:41.040 --> 08:42.680 we're going to run it through the cleaning cell, 08:42.720 --> 08:45.239 and then we're going to import it into this cell. 08:45.240 --> 08:47.839 This is just going to do some simple Python aggregation. 08:47.840 --> 00:08:49.007 We're going to import `pandas`, 00:08:49.008 --> 00:08:51.307 which is the Python data science library, 00:08:51.308 --> 00:08:54.839 create a data frame out of our input, 08:54.840 --> 08:57.479 and then aggregate it, grouping on `wLabel`, 08:57.480 --> 08:59.959 and getting a count from that and returning it. 08:59.960 --> 09:01.640 So if we execute that cell... 09:05.040 --> 09:08.879 [Lukas]: Nice, we get institutions and a count. 09:08.880 --> 09:14.119 But what about not ordering it by the alphabet, 09:14.120 --> 09:17.079 but more like ordering by counts? 09:17.080 --> 09:18.439 [Jonathan]: Sure. 09:18.440 --> 09:22.839 So let's do this... `sort_values()`, I think, as the Python. 09:22.840 --> 09:24.919 How does that look? 09:24.920 --> 00:09:27.640 [Lukas]: Better, but I would like to 00:09:27.641 --> 00:09:29.239 have the highest number first 09:29.240 --> 09:32.239 and then ascending. 09:32.240 --> 09:34.719 Well, not ascending, descending. 09:34.720 --> 09:37.600 [Jonathan]: Right, so we can do `ascending=False`. 09:39.880 --> 09:42.559 [Lukas]: This is perfect, I'd say. 09:42.560 --> 09:43.079 [Jonathan]: Great. 09:43.080 --> 09:44.079 [Lukas]: Very good. 09:44.080 --> 00:09:46.799 OK, that's nice. 00:09:46.800 --> 09:47.999 We get a good overview here. 09:48.000 --> 09:50.079 But can we also do something else, 09:50.080 --> 09:56.079 like counting how many institutions are 09:56.080 --> 09:57.799 involved in one consortium? 09:57.800 --> 10:00.879 And also using this later on in the text? 10:00.880 --> 00:10:00.880 [Jonathan]: Sure, so I'm going to put a new... 00:10:00.881 --> 00:10:05.040 If you give me another heading down here 00:10:05.041 --> 00:10:08.320 for institutions per consortium... 10:12.080 --> 10:16.799 And here we're going to use awk code just to spice things up 10:16.800 --> 10:18.959 and add yet another language in here. 10:18.960 --> 10:22.439 So you can see this is awk. 10:22.440 --> 10:26.279 We're using standard in instead of defining a variable. 10:26.280 --> 10:28.359 But the really interesting thing about this cell 10:28.360 --> 00:10:33.399 is that we have this `:var consortium="NFDI4Memory"`. 10:33.400 --> 00:10:35.640 And what this code is doing is 00:10:35.641 --> 00:10:38.040 it's counting any time it sees 00:10:38.041 --> 00:10:40.279 that particular consortium name 10:40.280 --> 10:41.759 and keeping track of that. 10:41.760 --> 00:10:43.907 So if we execute this, 00:10:43.908 --> 00:10:45.919 Lukas, why don't you execute this one? 10:45.920 --> 10:49.399 [Lukas]: OK, I'm going to enter it. 10:49.400 --> 10:52.439 And I get a result, NFDI4Memory, 10:52.440 --> 10:58.239 because this is our default value for this variable. 10:58.240 --> 10:59.439 And we get the count. 10:59.440 --> 00:11:01.640 So it's five institutions are involved 00:11:01.641 --> 00:11:04.639 in the NFDI4memory consortium. 11:04.640 --> 11:07.839 Great, but the very nice thing, what I think, 11:07.840 --> 11:12.519 is here that we can use this code snippet within our text. 11:12.520 --> 11:14.279 So, blended in seamlessly. 11:14.280 --> 11:16.199 Let me give you an example. 11:16.200 --> 11:18.919 I'm writing out the text. 11:18.920 --> 11:27.599 Now we know how many institutions are in... 11:27.600 --> 11:29.239 Give me an example. 11:29.240 --> 11:31.480 I would like to know how many institutions are 11:31.560 --> 11:35.079 involved in NFDI4Objects, which is a consortium. 11:35.080 --> 11:39.239 So I'm writing `call_` and using 11:39.240 --> 00:11:42.607 the name of this snippet here, of this cell, 00:11:42.608 --> 00:11:46.607 which is `inst-count(`, 00:11:46.608 --> 00:11:51.719 and writing my value, `NFDI4Objects`. 11:51.720 --> 11:57.999 As soon as I evaluate this using `C-c C-c`, 11:58.000 --> 12:00.279 I get the result back here. 12:00.280 --> 12:05.159 I can do this even for more. 12:05.160 --> 12:14.039 Or in writing, `call_inst-count`, go with `NFDI4Earth`, 12:14.040 --> 12:16.799 which is another consortium. 12:16.800 --> 12:20.559 `C-c C-c`, it's three institutions. 12:20.560 --> 12:23.439 This can be used throughout your text, 12:23.440 --> 12:26.639 and as soon as the data set changes from in the beginning, 12:26.640 --> 12:30.399 maybe different results requiring Wikidata, 12:30.400 --> 12:35.079 this also will be updated once it's exported. 12:35.080 --> 12:36.039 Very nice, Jonathan. NOTE Visualization 12:36.040 --> 00:12:38.974 But I think we did a lot of analysis 00:12:38.975 --> 00:12:41.079 on text and counting things. 12:41.080 --> 12:43.679 Can we also do something more visual? 12:43.680 --> 12:45.199 Show me something. 12:45.200 --> 12:45.759 [Jonathan]: Sure. 12:45.760 --> 12:48.639 So what we can do with this, because we just 12:48.640 --> 12:51.399 have two columns here that are sort of related, 12:51.400 --> 12:53.759 we can build a little network plot out of it. 12:53.760 --> 12:56.999 So let's make a network visualization. 12:57.000 --> 12:59.599 We're going to use the `igraph` library from R 12:59.600 --> 13:02.559 and just plot the edges that we see here. 13:02.560 --> 13:04.239 There we go. 13:04.240 --> 13:11.879 There's my little heading and space. 13:11.880 --> 13:13.479 Here is our code. 13:13.480 --> 13:16.039 Again, just to be fancy and keep using 13:16.040 --> 13:19.719 different languages in here, we set a variable called 13:19.720 --> 13:21.560 `NFDI_edges` equal to `clean-dataset`. 13:21.600 --> 13:23.399 So this, again, is sort of cascading 13:23.400 --> 00:13:25.740 through the original data 00:13:25.741 --> 00:13:28.807 that we pulled from the Wikidata endpoint, 00:13:28.808 --> 00:13:30.959 cleaning that data, and now it's being inserted 13:30.960 --> 13:32.959 into this cell as well. 13:32.960 --> 13:34.239 But you see the difference here. 13:34.240 --> 13:36.839 Instead of exporting a table, what we're saying 13:36.840 --> 13:39.239 is that there will be a graphics file, 13:39.240 --> 13:44.639 and it will be called network-plot.png. 13:44.640 --> 13:45.119 All right. 13:45.120 --> 13:47.959 And so Lukas, why don't you execute this one? 13:47.960 --> 13:48.759 [Lukas]: There you go. 13:48.760 --> 13:52.919 I can click `C-c C-c` 13:52.920 --> 13:59.159 and I get a nice plot of the network below our cell. 13:59.160 --> 14:01.759 So this is very nice indeed. NOTE Preserve 14:01.760 --> 14:05.199 So I think it's about time to wrap it up and to export 14:05.200 --> 14:07.959 and to preserve the data and the documentation 14:07.960 --> 14:13.079 that we have in our very last step, calling preserve. 14:13.080 --> 14:16.239 So I would like to do it in two steps. 14:16.240 --> 14:18.600 First, maybe manually exporting it, 14:18.800 --> 14:22.239 but then also doing it in a batch process. 14:22.240 --> 14:27.119 Giving you some insights how to do that manual export. 14:27.120 --> 14:30.559 For example, you can do a LaTeX export. 14:30.560 --> 14:34.279 Let me write down the key combination to do that here. 14:34.280 --> 14:44.560 So you press `SPC m e l o`. 14:44.600 --> 14:49.159 Let me show you how this is done. 14:49.160 --> 14:51.439 So I'm pressing `SPC`. 14:51.440 --> 14:55.679 I'm pressing `m`, which is my local leader. 14:55.680 --> 15:01.279 I'm pressing `e`, which is now the `org-export-dispatch`. 15:01.280 --> 15:03.519 And now I have different options I can choose from. 15:03.520 --> 15:07.119 I want to do a LaTeX export because I want to get in PDF. 15:07.120 --> 00:15:08.674 So I'm pressing `l`. 00:15:08.675 --> 00:15:11.479 Now I've got different options available. 15:11.480 --> 15:17.399 So I'm pressing `o` for a PDF file and open that. 15:17.400 --> 15:21.119 Let's see now the code. 15:21.120 --> 15:25.639 Now this is exporting document. 15:25.640 --> 00:15:29.674 And what we have here is PDF, 00:15:29.675 --> 00:15:31.974 which contains our workflow in the beginning, 00:15:31.975 --> 00:15:35.707 our bullet points we have here, 00:15:35.708 --> 00:15:37.919 and also the code snippet 15:37.920 --> 15:41.120 that we use for querying the data. 15:41.280 --> 15:43.599 And we have the result below that. 15:43.600 --> 15:46.999 So this is our table with all the data sets. 15:47.000 --> 15:51.879 But as you can see, this is running out of the page. 15:51.880 --> 15:55.679 So this is not very nice using the default settings. 15:55.680 --> 16:00.239 But everything is in this PDF. 16:00.240 --> 16:02.759 I guess we can now show you a way 16:02.760 --> 16:06.519 how to improve this result. 16:06.520 --> 16:07.039 [Jonathan]: Right. 16:07.040 --> 16:09.399 So we have, of course, a version of this 16:09.400 --> 00:16:10.774 that we prepared ahead of time, 00:16:10.775 --> 00:16:14.279 which is more or less identical to the one we just made, 16:14.280 --> 16:17.839 but it has a little more text, a little more explanation, 16:17.840 --> 16:20.559 a little more documentation along with the code. 16:20.560 --> 16:23.879 You can see we have some metadata up at the top, 16:23.880 --> 16:26.879 the title, the authors, a bibliography, 16:26.880 --> 16:31.679 and most importantly, the `custom-export.setup` file, 16:31.680 --> 16:36.879 which lists specifically the sort of LaTeX commands 16:36.880 --> 16:43.599 that we're using and the HTML styles that we're going to use. 16:43.600 --> 16:45.919 And then down at the bottom of this file, 16:45.920 --> 16:49.119 we have our automatic batch process. 16:49.120 --> 16:51.719 Here is one more language we're including in here. 16:51.720 --> 16:53.439 So this is Lisp. 16:53.440 --> 16:57.359 And you can see here we are exporting to HTML, ASCII, 16:57.360 --> 16:58.079 and PDF. 16:58.080 --> 17:01.359 The nice thing about this is that this is a document. 17:01.360 --> 00:17:03.307 It's a sort of document that we have a couple of 00:17:03.308 --> 00:17:08.639 that we can have running automatically and building. 17:08.640 --> 17:12.919 It will export a HTML, an ASCII file, and a PDF file 17:12.920 --> 00:17:14.674 every time it's run based off of 00:17:14.675 --> 00:17:17.319 the most recent data available on Wikidata. 17:17.320 --> 17:19.719 So it's self-documenting. 17:19.720 --> 00:17:22.440 We have, of course, our data retrieval steps, 00:17:22.441 --> 00:17:25.159 our data cleaning steps, our data preparation steps, 17:25.160 --> 17:28.359 and our preservation steps all listed at the same time. 17:28.360 --> 17:30.239 And then you can see over on the right, 17:30.240 --> 17:34.320 there's an example of the HTML file that we get out of this. 17:34.360 --> 17:37.639 We also get a very nicely formatted PDF file, 17:37.640 --> 17:39.239 which doesn't have that little issue 17:39.240 --> 17:41.719 with the overflow of the table. 17:41.720 --> 17:43.559 It's very nicely put together. 17:43.560 --> 17:46.199 And we even have an ASCII file. 17:46.200 --> 17:47.879 And I should also point out very quickly, 17:47.880 --> 17:51.799 while you have this one up, Lukas, after the awk code, 17:51.800 --> 17:56.079 you can see the text for the number of consortia, 17:56.080 --> 17:57.839 or the number of institutions per consortia 17:57.840 --> 18:00.519 is actually printed inline. 18:00.520 --> 18:01.799 [Lukas]: Yeah, you're very right. 18:01.800 --> 18:06.119 So this is what we had as code, 18:06.120 --> 18:10.719 and now this is nicely integrated into our text. 18:10.720 --> 18:15.279 So we got the consortium and number of institutions. 18:15.280 --> 18:19.199 You can't tell a difference between code and text. 18:19.200 --> 18:20.719 [Jonathan]: And those are automatically updated. 18:20.720 --> 18:23.879 So if another institution joins NFDI4Earth, 18:23.880 --> 18:26.319 then the next time this runs, we update the text right here. 18:26.320 --> 18:28.519 It's nothing we have to worry about. 18:28.520 --> 18:30.400 We just pull it directly out of Wikidata. 18:31.840 --> 18:34.679 [Lukas]: And for the sake of completeness, 18:34.680 --> 18:37.879 this is the ASCII file. 18:37.880 --> 18:39.320 That's in the export format. 18:42.760 --> 18:46.440 It contains also everything, code and data. 18:48.360 --> 18:51.680 Yeah, so this is what we wanted to show you, 18:53.240 --> 18:56.639 how to do some data processing, 18:56.640 --> 18:58.679 some collaborative work, 18:58.680 --> 19:01.119 documenting using org-babel. 19:01.120 --> 19:03.960 Thanks for listening. 19:05.720 --> 19:07.280 [Jonathan]: Thank you all, have a good day.