From c6c2d25cb561946e993e5dc5919afed8017cd087 Mon Sep 17 00:00:00 2001
From: Sacha Chua <sacha@sachachua.com>
Date: Thu, 8 Dec 2022 20:18:23 -0500
Subject: add etherpads to wiki pages

---
 2022/talks/grail.md | 240 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)

(limited to '2022/talks/grail.md')

diff --git a/2022/talks/grail.md b/2022/talks/grail.md
index 824f76b3..b53efc44 100644
--- a/2022/talks/grail.md
+++ b/2022/talks/grail.md
@@ -67,6 +67,246 @@ a systematic fashion.  Our approach is not tied to Emacs, but uses its many
 built-in capabilities for creating and evaluating solution prototypes.
 
 
+# Discussion
+
+## Notes
+
+-   I will plan to fix the issues with the subtitles in a more
+    systematic fashion and make the video available on the
+    emacsconf/grail  URL. My sense is that this URL will be active for
+    the foreseeable future.
+-   I am going to try and revise some of the answers which I typed quite
+    quickly and may not have provided useful context or might have made
+    errors.
+-   .
+-   Please feel free to email me at pradhan\@cemantix.org for any futher
+    questions or discussions you may want to have with me or be part of
+    the grail community (doesn\'t exist yet :-), or is a community of 1)
+-   .
+
+## Questions and answers
+
+-   Q: Has the \'92 UPenn corpus of articles feat been reproduced over
+    and over again using these tools?
+    -   A: 
+    -   Yes. The \'92 corpus only annotated syntactic structure. It was
+        probably the first time that the details captured in syntax were
+        selected not purely based on linguistic accuracy, but on the
+        consistency of such annotations across multiple annotators. This
+        is often referred to as Inter-Annotator Agreement. The high IAA
+        for this corpus was probably one of the reasons that parsers
+        trained on it got accuracies in the mid 80s or so. Then over the
+        next 30 years (and still continuing..) academics improved on
+        parsers and today the performance on the test set from this
+        corpus is somewhere around F-score of 95. But this has to be
+        taken with a big grain of salt given overfitting and how many
+        times people have seen the test set. 
+    -   One thing that might be worth mentioing is that over the past 30
+        years, there have been many different phenomena that have been
+        annotated on a part of this corpus. However, as I mentioned
+        given the difficulty of current tools and representations to
+        integrate disparate layers of annotations. Some such issues
+        being related to the complexity of the phenomena and others
+        related to the brittleness of the representations. For example,
+        I remember when we were building the OntoNotes corpus, there was
+        a point where the guidelines were changed to split all words at
+        a \'hyphen\'. That simple change cause a lot of heartache
+        because the interdependencies were not captured at a level that
+        could be programmatically manipulated. That was around 2007 when
+        I decided to use a relational database architecture to represent
+        the layers. The great thing is that it was an almost perfect
+        representation but for some reason it never caught up because
+        using a database to prepare data for training was something that
+        was kind of unthinkable 15 years ago. Maybe? Anyway, the format
+        that is the easiest to use but very rigid in the sense that you
+        can quickly make use of it, but if something changes somewhere
+        you have no idea if the whole is consistent. And when came
+        across org-mode sometime around 2011/12 (if I remember
+        correctly) I thought it would be a great tool. And indeed about
+        decade in the future I am trying to stand on it\'s and emacs\'
+        shoulders.
+    -   This corpus was one of the first large scale manually annotated
+        corpora that bootstrapped the statistical natural language
+        processing era.  That can be considered the first wave\... 
+        SInce then, there have been  more corpora built on the same
+        philosophy.  In fact I spent about 8 years about a decade ago
+        building a much larger corpus with more layers of information
+        and it is called the OntoNotes. It covers Chinese and Arabic as
+        well (DARPA funding!) This is freely available for research to
+        anyone anywhere. that was quite a feat. 
+-   Q:Is this only for natural languagles like english or more general?
+    Would this be used for programing laungages.
+    -   A: I am using English as a use case, but the idea is to have it
+        completely multilingual. 
+    -   I cannot think why you would want to use it for programming
+        languages. In fact the concept of an AST in programming
+        languages was what I thought would be worth exploring in this
+        area of research.  Org Mode, the way I sometimes view it is a
+        somewhat crude incarnation of that and can be sort of manually
+        built, but the idea is to identify patterns and build upon them
+        to create a larger collection of transformations that could be
+        generally useful.  That could help capture the abstract
+        reprsentation of \"meaning\" and help the models learn better. 
+    -   These days most models are trained on a boat load of data and no
+        matter how much data you use to train your largest model, it is
+        still going to be a small spec in the universe of ever growing
+        data that are are sitting in today. So, not surprisingly, these
+        models tend to overfit the data they are trained on.  
+    -   So, if you have a smaller data set which is not quite the same
+        as the one that you had the training data for, then the models
+        really do poorly. It is sometimes compared to learning a sine
+        function using the points on the sine wave as opposed to
+        deriving the function itself. You can get close, but then then
+        you cannot really do a lot better with that model :-)
+    -   I did a brief stint at the Harvard Medical School/Boston
+        Childrens\' Hospital to see if we would use the same underlying
+        philosophy to build better models for understanding clinical
+        notes. It would be an extremely useful and socially beneficial
+        use case, but then after a few years and realizing that the
+        legal and policy issues realted to making such data available on
+        a larger scale might need a few more decades, I decided to step
+        off that wagon (if I am using the figure of speech correctly).
+    -   .
+    -   More recently, since I joined the Linguistic Data Consortium, we
+        have been looking at spoken neurological tests that are taken by
+        older people and using which neurologists can predict a
+        potential early onset of some neurological disorder. The idea is
+        to see if we can use speech and langauge signals to predict such
+        cases early on. The fact that we don\'t have cures for those
+        conditions yet, the best we can do it identify them earlier with
+        the hope that the progression can be slowed down.
+    -   .
+    -   This is sort of what is happening with the deep learning hype.
+        It is not to say that there hasn;t been a significant
+        advancement in the technologies, but to say that the models can
+        \"learn\" is an extremely overstatement. 
+
+\
+
+-   Q: Reminds me of the advantages of pre computer copy and paste. Cut
+    up paper and rearange but having more stuff with your pieces.
+    -   A: Right! 
+    -   Kind of like that, but more \"intelligent\" than copy/paste,
+        because you could have various local constraints that would
+        ensure that the information that is consistent with the whole. I
+        am also ensioning this as a usecase of hooks. And if you can
+        have rich local dependencies, then you can be sure (as much as
+        you can) that the information signal is not too corrupted.
+    -   .
+    -   I did not read the \"cut up paper\" you mentioned. That is an
+        interesting thought. In fact, the kind of thing I was/am
+        envisioning is that you can cut the paper a million ways but
+        then you can still join them back to form the original piece of
+        paper. 
+
+```{=html}
+<!-- -->
+```
+
+\
+
+-   Q: Have you used it on some real life situation?
+    -   A: NO. 
+    -   I am probably the only person who is doing this crazy thing. It
+        would be nice, or rather I have a feeling that something like
+        this, if worked upon for a while by many might lead to a really
+        potent tool for the masses. I feel strongly about giving such
+        power to the users, and be able to edit and share the data
+        openly so that they are not stuck in some corporate vault
+        somewhere :-) One thing at a time.
+    -   .
+    -   I am in the process of creating a minimally viable package and
+        see where that goes.
+    -   .
+    -   The idea is to start within emacs and orgmode but not
+        necessarily be limited to it.
+
+-   Q:Do you see this as a format for this type of annotation
+    specifically, or something more general that can be used for
+    interlinear glosses, lexicons, etc? \-- Does wordsense include a
+    valence on positive or negative words\-- (mood) . 
+
+-   Interesting. question.  There are sub-corpora that have some of this
+    data. 
+
+-   -   A: Absolutely. IN fact, the project I mentioned OntoNotes has
+        multiple layers of annotation. One of them being the
+        propositional structure which uses a large lexicon that covers
+        about 15K verbs and nouns and all their argument structures that
+        we have been seen so far in the corpora. There is about a
+        million \"propositions\" that have been released recently (we
+        just recently celebrated a 20th birthday of the corpus. It is
+        called the PropBank. 
+
+-   There is an interesting history of the \"Banks\" . It started with
+    Treebank, and then there was PropBank (with a capital B), but then
+    when we were developing OntoNotes which contains:
+    -   Syntax
+    -   Named Entities
+    -   Coreference Resolutoion
+    -   Propositions
+    -   Word Sensse 
+
+-   All in the same whole and across various genre\... (can add more
+    information here later\... )
+
+-   Q: Are there parallel efforts to analyze literary texts or news
+    articles? Pulling the ambiguity of meaning and not just the syntax
+    out of works? (Granted this may be out of your area\-- ignore as
+    desired)
+    -   A: :-) Nothing that relates to \"meaning\" falls too far away
+        from where I would like to be. It is a very large landscape and
+        growing very fast, so it is hard to be able to be everywhere at
+        the same time :-)
+    -   .
+    -   Many people are working on trying to analyze literature.
+        Analyzing news stories has been happening since the beginning of
+        the statistical NLP revolution---sort of linked to the fact that
+        the first million \"trees\" were curated using WSJ articles :-)
+
+-   Q: Have you considered support for conlangs, such as Toki Pona?  The
+    simplicity of Toki Pona seems like it would lend itself well to
+    machine processing.
+    -   A:  This is the first time I hearing of conlangs and Toki Pona.
+        I would love to know more about them to say more, but I cannot
+        imaging any langauge not being able to use this framework.
+    -   conlangs are \"constructed languages\" such as Esperanto ---
+        languages designed with intent, rather than evolved over
+        centuries.  Toki Pona is a minimal conlang created in 2001, with
+        a uniform syntax and small (\<200 word) vocabulary.
+    -   Thanks for the information! I would love to look into it.
+
+-   Q: Is there a roadmap of sorts for GRAIL?
+    -   A: 
+    -   Yes. I am now actually using real world annotations on larg
+        corpora---both text and speech and am validating the concept
+        further. I am sure there will be some bumps in the way, and I am
+        not saying that this is going to be a cure-all, but I feel
+        (after spending most of my professional life building/using
+        corpora) that this approach does seem very appealing to me. The
+        speed of its development will depend on how many buy into the
+        idea and pitch in, I guess.
+
+-   Q: How can GRAIL be used by common people?
+    -   A: I don\'t think it can be used by common people at the very
+        moment---partly because most \"common man\" has never heard of
+        emacs or org-mode. But if we can valide the concept and if it
+        does \"grow legs\" and walk out of the emacs room into the
+        larger universe, then absolutely, anyone who can have any say
+        about langauge could use it. And the contributions would be as
+        useful as the consistency with which one can capture a certain
+        phenomena.
+    -   .
+    -   Everytime you use a capta these days, the algorithms used by the
+        company storing the data get slightly better. What if we could
+        democratize this concept. That could lead to fascinating things.
+        Like Wikipedia did for the sum total of human knowledge.
+
+-   Q: 
+    -   A: 
+
+\
+
 
 [[!inline pages="internal(2022/info/grail-after)" raw="yes"]]
 
-- 
cgit v1.2.3