path: root/2020/info/12.md



# One Big-ass Org File or multiple tiny ones?  Finally, the End of the debate!
Leo Vivier

[[!template id=vid src="https://mirror.csclub.uwaterloo.ca/emacsconf/2020/emacsconf-2020--12-one-big-ass-org-file-or-multiple-tiny-ones-finally-the-end-of-the-debate--leo-vivier.webm" subtitles="/2020/subtitles/emacsconf-2020--12-one-big-ass-org-file-or-multiple-tiny-ones-finally-the-end-of-the-debate--leo-vivier.vtt"]]  
[Download compressed .webm video (22.3M)](https://mirror.csclub.uwaterloo.ca/emacsconf/2020/smaller/emacsconf-2020--12-one-big-ass-org-file-or-multiple-tiny-ones-finally-the-end-of-the-debate--leo-vivier--vp9-q56-video-original-audio.webm)  
[View transcript](#transcript)

Many discussions have been had over the years on the debate between
using few big files versus many small files.  However, more often than
not, those discussions devolve in a collection of anecdotes with
barely any science to them.

Once and for all (or, at least until org-element.el get overhauled), I
would like to settle the debate by explaining why the way we parse
Org-mode files becomes slower as our files grow in size or numbers,
and how that affects their browsing and the building of custom-agenda
views.

I feel qualified to talk about this topic for two reasons:

-   I went through the trouble of optimising my agenda-views by
    implementing clever regex-based skips, so I know the ceiling that
    can be reached with the current tech.
-   My work on Org-roam has led me to consider the use of an external
    parser for Org-mode files, and whilst we are only at the prototyping
    stage, we know what is at stake.

I intend the talk to be fairly light-hearted and humorous, which is the
only way we can do true justice to the topic.

<!-- from the pad --->

- Actual start and end time (EST): Start 2020-11-28T13.43.24; Q&A
  2020-11-28T13.51; End: 2020-11-28T14.00.07

# Questions

## What's better: one big file or many small ones? :>
For knowledge management: many files (see also org-roam).

Otherwise: one big file to have everything (todos, projects, notes,
etc&#x2026;) in one single place.

- Possible walk around by some hacks?

## Do you switch between British and French accents?

## What's the Emacs icon in the firefox address bar?
Browser extension for org-protocol made by vifon: <https://github.com/vifon/org-protocol-for-firefox>


## How do you feel about archive files in org mode, how can that work in?

## Could you post links?

## How big are your org files?
Main file: 38000 lines for all GTD-tasks and he does archive.

Karl does use archiving although Karl does use Org tasks even in
knowledge management and those don't get archived most of the time.

## Does it not consume more resources and time to load multiple files than a large file of the same contents?
Dealing with hiding contents is computationally expensive.

- I doubt it is correct. Emacs display engine is quite effective
  dealing with invisible text. Moving cursor around is affected, but I
  never heard (and never experienced) issues with scrolling on large
  (2Mb) org files.
  - Actually, Org currently uses overlays to hide text, and the
    overhead of the overlays does eventually add up.  There's a
    working branch that uses text-properties instead, and it may be
    merged to Org someday.
    - It is on the way ;) I need more feedback (see help request in
      <https://updates.orgmode.org/>).
      - If I ever have time to even get my Org upgraded to the latest
        version, maybe I can think about trying to test that ;)
        - Would it help to share the branch on GitHub?
          - It would probably make it easier to use and more visible,
            so&#x2026;maybe?  :)
            - Noted (or rather captured) (using org-mode right? :)
              Indeed.
  - Karl: whenever I had severe performance issues and somebody was
    nice and helped to analyze the issue, "overlays" were the root
    cause in probably 90% of the cases. However, an average user
    (including me) does not know if a specific feature is implemented
    using overlays or not. My Org life is basically try and error ;-)
    - alphapapa: FYI, if you use org-indent-mode (or whatever the name
      is of the mode that uses overlays to indent contents), you could
      disable that to reduce the number of overlays in a
      buffer.
      - Karl: thanks a bunch. However, some features are delivering
        important features to me so that I do have to accept the
        performance overhead to a certain level. That's a difficult
        trade-off I do have to make from time to time ;-)

## Doesn't using many small org file clutter up your buffer list when generating agenda etc?
Personally, I limit org agenda to just a few files while keeping notes
in many more.

# Notes
- Speaker's emacs.d: <https://github.com/zaeph/.emacs.d>
- Mentioned: <https://karl-voit.at/2020/05/03/current-org-files/> ->
  Karl's big Org files.
- org-element.el: <https://orgmode.org/worg/dev/org-element-api.html>.
  - single-threaded lisp function that parses the whole file.
- "the problem is to let org-element to make sense of the item (?)
  &#x2026;".

<a name="transcript"></a>
# Transcript

00:00:24.160 --> 00:00:58.434
Hello again, everyone! I hope you had,
well, quite a lot of talks ever since
the last one I did, and all more
interesting one after the other. You
know, I'm a bit in a bit of a weird spot
right now, because I'm supposed to be
presenting to you (as you can see on my
screen) "One big-ass Org file or
multiple tiny ones: finally, the end of
the debate," and it sounds about as
clickbaity as you can possibly get with
those topics. By the way, credit where
credit is due, the title is not mine.
It's actually from Bastien Guerry, the
current Org maintainer.

00:00:58.434 --> 00:01:22.823
Yeah, I wanted to talk to you a little
bit today about this question because if
you are used to going on
reddit.com/r/emacs , you know the
subreddit that we have, if you go on
Hacker News often, you know it's a
question that you see pop up every once
in a while. "Should I be using one big
file, or should I be using a lot of tiny
files?"

00:01:22.823 --> 00:01:58.575
I believe you know we've got defenders
on both sides. If I just show you one
example... We have Karl Voit. He's one
of the organizers for the conference. He
is the guy who probably has the biggest
Org Mode files right now in all the
people I know, and god knows I know
plenty of people use Org Mode.
But if you just look at this line--I hope
it's not too small; you just
make it a little larger--but
Karl basically has a file with
126,000 lines.

00:01:58.575 --> 00:02:57.040
I'm just going to pause and try to have
you imagine how large a file it actually
is. Just think about all of these lines
being tasks in your days. Think about
all those lines being about little
thoughts you know that you've had
throughout the day or project that you
were working on. It's massive. You know
one of the problems that Karl Voit
actually approaches on this topic is
that it takes him roughly 20 seconds to
get his Org agenda going, which is a
massive amount of time. I mean, we have
very fast computers now. You know, ever
since Emacs was created in 1976,
computers... I have no idea how much
faster they've gotten. And yet, you
know, for 100,000 lines, Emacs seems to
be choking. It's certainly not
reasonable, in a way, to have to wait 20
seconds just for your entire file to be
parsed. So basically what I want to do--

00:02:57.040 --> 00:03:50.720
By the way, I forgot to introduce the
presentation, but I'm Leo Vivier. I did
this before, for those who were around.
I help maintain a software which is
called org-roam, and that's the
expertise that I have on the topic.
Actually, if you go online, I do have a
Github page. I will make sure that you
have all the links available afterwards.
But I do publish my init files, and you
can see, if you scroll at the bottom, I
have a little demonstration which shows
you the fancy things that I can do with
my Org Mode setup. That might be even
interesting in light of the talk you've
just had about GTD stuff, because the
first one is about how I handle my
projects, the second one is about the
flow from a task as I work on it... So I
won't spend too much time on this, but
basically that's my expertise. I have
spent eight years working with Org Mode,
three of them actually thinking about
writing packages.

00:03:50.720 --> 00:04:32.880
The thing is, if I go into a little bit
of detail (and obviously it's only a
lighting talk, so I won't have time to
actually go really in depth about it),
but there is something in the Org Mode
library which is called org-element. You
have the name right there,
org-element.el, .el being for Elisp
file. As you can see, the page is on the
Worg wiki, so it's accessible by
everyone. It's basically the API that
Org Mode uses to parse Org Mode files.
For those who don't know, parsing means
basically checking a file, checking all
the contents of the file, and extracting
all the information that we need from
that file.

00:04:32.880 --> 00:04:58.960
As you can imagine, you all have Org
Mode files in your mind, well you know
they can be fairly complex. You can have
properties, you can have contextual
information, like if you write a line
which starts at column zero (which means
at the left), it doesn't have the same
meaning, whether or not it is before the
beginning of a headline or if it is
after the beginning of a headline. It's
going to be relatively different,
hierarchically speaking.

00:04:58.960 --> 00:05:39.280
So the problem, when it comes to the
question of many files versus one big
file or few big files, is that we always
have to keep in mind what org-element
wants you to do. The thing is, there are
plenty of problems when it comes to
parsing files, the first one being
obviously that Emacs is a single-thread
process (or has some threading
capabilities; we're not going to go into
the details right now, that's not my
goal). It makes it incredibly hard to
parallelize parsing processes with the
current technology.

00:05:39.280 --> 00:07:03.759
So you'd have to imagine that if you
have a very large file--if you go back
to the example of Karl Voit from before:
100,000 lines--that means that you have
to scan through every single line,
basically. Because sometimes... Let's
just say that you have a property
drawer, for instance, which tells you,
oh okay, this tree has the tag :foo:. So
the problem is, there are multiple ways
for you to define a tag. You can use the
usual way, which is about wrapping in
columns the :tag: at the end of a
heading. For instance, if I... (I'm not
going to switch to Emacs, that's going
to waste too much time) That's one way
to say your tag. But say, you have tag
inheritance, which means that when you
have a parent with a tag, you also want
the child to inherit the tag. If you
have first heading with the tag :foo:,
you have the first subheading, and the
tag :foo: is implied. Now imagine having
to do that with a file that is
completely nested, a file that has maybe
9, 10, 11 levels of depth to it. It's
mind-bogglingly complicated for the
software to do that, knowing that...
I've told you about tags, but any
property can be inheritable. Anything
like priorities, even. Though why would
you do this? You can have groups. You
can have all this.

00:07:03.759 --> 00:07:21.957
And as someone who went through the
trouble of optimizing his Org agenda...
So basically, if we go back to the
GIFs--oh god we've already had this
discussion between the "git" and "magit"
and now I've started "gif" and "gif" and
I only have one more minute left to do
so, so let's just
say I'm going to say "gif"
just to spite people...

00:07:21.957 --> 00:07:41.360
So if you go on the way I organize my
agenda, what I did in order to keep my
agenda build time under two seconds, is
that I've rewritten a whole lot of codes
to be able to parse my Org agenda files.
So the thing is, I'm going to be talking
more about this later.

00:07:41.360 --> 00:07:44.479
I only have, let's say, one minute to
conclude.

00:07:44.479 --> 00:08:15.199
So as you've gathered, I'm not going to
be giving you the answer right now. I'm
going to be talking about org-roam a
little later, which is about following
the principle of having many small
files. But as someone who has been using
one large file to manage my life, you
know, I'm sitting on the fence. I do not
know which one is the best, but I hope
that my presentation has given you a
little idea of what goes on behind the
principles.

00:08:15.520 --> 00:08:52.000
You also need to think about the
philosophy behind the organization of
your notes. I hope to be approaching
this topic with you in about two hours
or so (maybe one hour actually). I'm
actually finished. I've decided to leave
you two minutes of questions. If someone
could feed me the questions, that might
be best, because I don't want... oh
actually I can just open the pad. I can
just open it. Give me a second, okay.
Just loading up. I might stop showing my
screen. That might make it easier. So I
mean if you can make myself big now on
the screen, that would be splendid.
([Amin]: yeah sure)

00:08:52.000 --> 00:09:13.920
Thank you. Where are we... Question 12.
Okay, so what's better, one big file
or...? Is it a jab to tell me that I
haven't answered the question because
someone just
asked me the question? Well, personally, if
I were to give you a quick answer in
20 seconds, personally, I think it's a
question that is contextually based.

00:09:13.920 --> 00:09:45.890
Do you want something that is efficient
as far as optimization is concerned?
Then you need to think about this.
Personally, for all the organization
that I do, all this stuff, all the TODOs
that I handle, I like to do this in one
simple big file because you benefit from
all the refiling capabilities of Org
Mode, so I would do that. But for
knowledge management, for note-taking
and all this, well I'd much rather
follow the org-roam way of doing things,
which is about having many small files.

00:09:45.890 --> 00:09:57.040
I'm not getting any more questions. I'm
not sure if there is one on IRC that
could be fed to me. Otherwise, I'm happy
to pass over to the next speaker.

00:09:57.040 --> 00:10:06.520
By the way, just before I finish, your
world is a lie. It's not a three-piece
suit. I'm wearing jeans below, so I hope
that satisfies your curiosity.

00:10:10.640 --> 00:10:35.680
Okay, there's one more question
appearing. "but otherwise one big file
to have everything..." So I'm putting
you on the spot, I believe. It was such
a short talk. You know the problem is, I
just wanted to give you a little answer.
A little, you know, path of thinking on
this topic. Obviously it's a topic I
could be spending 40 minutes on, but I'm
going to be drained, you're going to be
drained, nobody's going to be happy if I
do this.

00:10:39.440 --> 00:11:08.240
Someone asked me if I switch between
British and French accents. A little
secret for you: when I'm stressed, I
tend to revert to a French accent, so
you can measure the amount of stress
that I'm feeling during this talk with
the amount of h's that I drop and the
amount of sheer fright that you can see
sometimes in my eyes, when I'm thinking
about what to say next.

00:11:08.240 --> 00:11:17.040
All right sir. So, Amin, do you believe
we can leave it at that? I'll be...
People will see plenty more of me later
on, anyway.

00:11:17.040 --> 00:11:27.120
([Amin:] So, looking at the schedule, I
think your talk has until like 2:02,
meaning like five or six minutes from
now.)

00:11:27.120 --> 00:11:28.000
Oh, right.

00:11:28.000 --> 00:11:33.920
([Amin:] So if you do like to take one
or two questions, to add two more
questions, by all means.)

00:11:33.920 --> 00:12:20.555
So someone has asked me what is the
Emacs icon (sorry, see, another French
accent) here in my status bar... Oh
sorry, I'm not sharing any more. I might
just share again just so that everyone
can catch a glimpse of that. There we
go. Allow... So it should be... So if
you could make me small again, Amin, I'm
not sure if it's going to do it by
itself, but I do have a little icon here
in my status bar which is basically a
way to interact with org-protocol. I'm
not going to look for it right now, but
it's a browser extension that is
developed by one of my friends over at
Ranger whose name is Li Fong (??) and
it's very useful. I'm someone who uses a
lot of Org protocols.

00:12:20.555 --> 00:12:53.600
And by the way, I used to teach English
to high schoolers, and they were
supremely worried when I showed them my
status line and they saw "kill" and
"explore" in my status line. As fellow
Emacs users, you know that obviously
kill means to kill a selection of text
and keep it inside your clipboard, but
for my students, they were very worried
about what their professor was up to
during his nights.

00:12:53.600 --> 00:13:01.920
So let's see if we've got more
questions. I'm showing you the questions
on the rainbow. Let's see if we've got
more. People are posting a lot of
questions now.

00:13:01.920 --> 00:13:06.399
So how do you feel about archiving files
in Org Mode and how can that work?

00:13:06.399 --> 00:13:59.519
So one of the things when we think about
optimization is: yes, archiving done
trees is a good idea because it means
that if we go back to the org-element,
the way it works (and we'll get into
technical details afterwards; I'm giving
a presentation about org-roam technical
aspects, sorry, so I'll have a chance to
expand a little more on this) but
basically, org-element needs to... Every
time it sees a TODO, it has to consider
it, even though it is a done TODO. Why?
Because let's say, for instance, that in
your agenda you want to activate log
mode, which is going to show the tasks
which are done... Now you could be
clever and say, oh okay, the Org agenda
does not need to show done items, so
it's not going to look for them, but the
problem is that org-element is always
called. It always needs to parse the
buffer.

00:13:59.519 --> 00:14:22.079
You know, Nicolas Goaziou, who is the
French developer who's worked a whole
lot on org-element has gone through a
lot of trouble to optimize org-element,
but the problem is there's just so much
that we can do with a concurrent
process. Right now it leaves somewhat
things to be desired, but we're working
on it.

00:14:22.079 --> 00:14:32.639
One more time... I feel like I spent
half of this talk teasing my next talks,
but I'll be talking more about this in
my future talks in about one to two
hours.

00:14:32.639 --> 00:14:36.079
So, continuing with questions, how big
are my Org files?

00:14:36.079 --> 00:15:04.880
So in the background, I'm just going to
check how many lines I have in my main
file.
In my own file, so the one I told you
about where I keep all
my TODO GTD stuff, I have
38,000 lines, which is...
It's sizable, definitely.
But I do archive a lot of stuff,
so that might be a slight difference
between myself and Karl Voit,
even though I don't remember if they
actually archive stuff.

00:15:04.880 --> 00:15:12.560
So does it not consume more resources
and time to load multiple files files
than a large file or the same content
now?

00:15:12.560 --> 00:16:00.560
Theoretically, yes, having many files
open concurrently is slightly slower
than having one main file opened. Now
the problem is for those of you who have
large files, you may have noticed that
when you are scrolling in a very large
file, it starts taking quite a bit of
time. Why? It's because in Org Mode, you
have a lot of content that is hidden, so
when you have the view mode which hides
as much stuff as possible, meaning that
you only see the top heading--and I'm
checking the time, Amin, don't worry,
I'm finished on this one-- when you're
hiding a whole lot of stuff, Org Mode
needs to keep track, or I should say,
Emacs needs to keep track of which areas
of text to show and which areas of text
to hide.

00:16:00.560 --> 00:16:21.199
The problem is that when you're hiding
stuff-- let's say you're moving from the
first heading to the second heading, but
you've got like 10,000 lines between
those two headings-- well, Emacs needs
to compute the difference between the
two passages, and that takes quite a lot
of time. That's why you might realize
that it's a little choppy when you start
scrolling in large files.

00:16:21.199 --> 00:16:30.719
Anyway I could be answering questions
about Org Mode for literally two hours
straight,
so I'm gonna hand it over to the next
speakers. I'll be seeing
you guys a little later.

00:16:30.719 --> 00:16:33.440
([Amin]: Thank you very much, Leo.)

00:16:33.440 --> 00:16:34.889
Oh, thank you.

00:16:34.889 --> 00:16:36.959
([Amin:] Yes. Bye.)

00:16:36.959 --> 00:16:39.839
Bye.