1 files changed, 322 insertions, 0 deletions
diff --git a/2022/talks/grail.md b/2022/talks/grail.md
new file mode 100644
index 00000000..20e7f6d5
--- /dev/null
+++ b/2022/talks/grail.md
@@ -0,0 +1,322 @@
+[[!sidebar content=""]]
+[[!meta title="GRAIL---A Generalized Representation and Aggregation of Information Layers"]]
+[[!meta copyright="Copyright &copy; 2022 Sameer Pradhan"]]
+[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]
+
+<!-- Initially generated with emacsconf-generate-talk-page and then left alone for manual editing -->
+<!-- You can manually edit this file to update the abstract, add links, etc. --->
+
+
+# GRAIL---A Generalized Representation and Aggregation of Information Layers
+Sameer Pradhan (he/him)
+
+[[!inline pages="internal(2022/info/grail-before)" raw="yes"]]
+[[!template id="help"
+volunteer=""
+summary="Q&A could be indexed with chapter markers"
+tags="help_with_chapter_markers"
+message="""The Q&A session for this talk does not have chapter markers yet.
+Would you like to help? See [[help_with_chapter_markers]] for more details. You can use the vidid="grail-qanda" if adding the markers to this wiki page, or e-mail your chapter notes to <emacsconf-submit@gnu.org>."""]]
+
+
+The human brain receives various signals that it assimilates (filters,
+splices, corrects, etc.) to build a syntactic structure and its semantic
+interpretation.  This is a complex process that enables human communication.
+The field of artificial intelligence (AI) is devoted to studying how we
+generate symbols and derive meaning from such signals and to building
+predictive models that allow effective human-computer interaction.
+
+For the purpose of this talk we will limit the scope of signals to the
+domain to language&#x2014;text and speech.  Computational Linguistics (CL),
+a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to
+interpret them.  It involves modeling and predicting complex linguistic
+structures from these signals.  These models tend to rely heavily on a large
+amount of ``raw'' (naturally occurring) data and a varying amount of
+(manually) enriched data, commonly known as ``annotations''.  The models are
+only as good as the quality of the annotations. Owing to the complex and
+numerous nature of linguistic phenomena, a divide and conquer approach is
+common.  The upside is that it allows one to focus on one, or few, related
+linguistic phenomena.  The downside is that the universe of these phenomena
+keeps expanding as language is context sensitive and evolves over time.  For
+example, depending on the context, the word ``bank'' can refer to a financial
+institution, or the rising ground surrounding a lake, or something else.  The
+verb ``google'' did not exist before the company came into being.
+
+Manually annotating data can be a very task specific, labor intensive,
+endeavor.  Owing to this, advances in multiple modalities have happened in
+silos until recently.  Recent advances in computer hardware and machine
+learning algorithms have opened doors to interpretation of multimodal data.
+However, the need to piece together such related but disjoint predictions
+poses a huge challenge.
+
+This brings us to the two questions that we will try to address in this
+talk:
+
+1.  How can we come up with a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? and,
+
+2.  What role might Emacs play in this process?
+
+Emacs provides a rich environment for editing and manipulating recursive
+embedded structures found in programming languages.  Its view of text,
+however, is more or less linear&#x2013;strings broken into words, strings ended by
+periods, strings identified using delimiters, etc.  It does not assume
+embedded or recursive structure in text.  However, the process of interpreting
+natural language involves operating on such structures.  What if we could
+adapt Emacs to manipulate rich structures derived from text?  Unlike
+programming languages, which are designed to be parsed and interpreted
+deterministically, interpretation of statements in natural languages has to
+frequently deal with phenomena such as ambiguity, inconsistency,
+incompleteness, etc. and can get quite complex.
+
+We present an architecture (GRAIL) which utilizes the capabilities of Emacs
+to allow the representation and aggregation of such rich structures in
+a systematic fashion.  Our approach is not tied to Emacs, but uses its many
+built-in capabilities for creating and evaluating solution prototypes.
+
+
+# Discussion
+
+## Notes
+
+-   I will plan to fix the issues with the subtitles in a more
+    systematic fashion and make the video available on the
+    emacsconf/grail  URL. My sense is that this URL will be active for
+    the foreseeable future.
+-   I am going to try and revise some of the answers which I typed quite
+    quickly and may not have provided useful context or might have made
+    errors.
+-   .
+-   Please feel free to email me at pradhan@cemantix.org for any futher
+    questions or discussions you may want to have with me or be part of
+    the grail community (doesn't exist yet :-), or is a community of 1)
+-   .
+
+## Questions and answers
+
+-   Q: Has the '92 UPenn corpus of articles feat been reproduced over
+    and over again using these tools?
+    -   A: 
+    -   Yes. The '92 corpus only annotated syntactic structure. It was
+        probably the first time that the details captured in syntax were
+        selected not purely based on linguistic accuracy, but on the
+        consistency of such annotations across multiple annotators. This
+        is often referred to as Inter-Annotator Agreement. The high IAA
+        for this corpus was probably one of the reasons that parsers
+        trained on it got accuracies in the mid 80s or so. Then over the
+        next 30 years (and still continuing..) academics improved on
+        parsers and today the performance on the test set from this
+        corpus is somewhere around F-score of 95. But this has to be
+        taken with a big grain of salt given overfitting and how many
+        times people have seen the test set. 
+    -   One thing that might be worth mentioing is that over the past 30
+        years, there have been many different phenomena that have been
+        annotated on a part of this corpus. However, as I mentioned
+        given the difficulty of current tools and representations to
+        integrate disparate layers of annotations. Some such issues
+        being related to the complexity of the phenomena and others
+        related to the brittleness of the representations. For example,
+        I remember when we were building the OntoNotes corpus, there was
+        a point where the guidelines were changed to split all words at
+        a 'hyphen'. That simple change cause a lot of heartache
+        because the interdependencies were not captured at a level that
+        could be programmatically manipulated. That was around 2007 when
+        I decided to use a relational database architecture to represent
+        the layers. The great thing is that it was an almost perfect
+        representation but for some reason it never caught up because
+        using a database to prepare data for training was something that
+        was kind of unthinkable 15 years ago. Maybe? Anyway, the format
+        that is the easiest to use but very rigid in the sense that you
+        can quickly make use of it, but if something changes somewhere
+        you have no idea if the whole is consistent. And when came
+        across org-mode sometime around 2011/12 (if I remember
+        correctly) I thought it would be a great tool. And indeed about
+        decade in the future I am trying to stand on it's and emacs'
+        shoulders.
+    -   This corpus was one of the first large scale manually annotated
+        corpora that bootstrapped the statistical natural language
+        processing era.  That can be considered the first wave... 
+        SInce then, there have been  more corpora built on the same
+        philosophy.  In fact I spent about 8 years about a decade ago
+        building a much larger corpus with more layers of information
+        and it is called the OntoNotes. It covers Chinese and Arabic as
+        well (DARPA funding!) This is freely available for research to
+        anyone anywhere. that was quite a feat. 
+-   Q:Is this only for natural languagles like english or more general?
+    Would this be used for programing laungages.
+    -   A: I am using English as a use case, but the idea is to have it
+        completely multilingual. 
+    -   I cannot think why you would want to use it for programming
+        languages. In fact the concept of an AST in programming
+        languages was what I thought would be worth exploring in this
+        area of research.  Org Mode, the way I sometimes view it is a
+        somewhat crude incarnation of that and can be sort of manually
+        built, but the idea is to identify patterns and build upon them
+        to create a larger collection of transformations that could be
+        generally useful.  That could help capture the abstract
+        reprsentation of "meaning" and help the models learn better. 
+    -   These days most models are trained on a boat load of data and no
+        matter how much data you use to train your largest model, it is
+        still going to be a small spec in the universe of ever growing
+        data that are are sitting in today. So, not surprisingly, these
+        models tend to overfit the data they are trained on.  
+    -   So, if you have a smaller data set which is not quite the same
+        as the one that you had the training data for, then the models
+        really do poorly. It is sometimes compared to learning a sine
+        function using the points on the sine wave as opposed to
+        deriving the function itself. You can get close, but then then
+        you cannot really do a lot better with that model :-)
+    -   I did a brief stint at the Harvard Medical School/Boston
+        Childrens' Hospital to see if we would use the same underlying
+        philosophy to build better models for understanding clinical
+        notes. It would be an extremely useful and socially beneficial
+        use case, but then after a few years and realizing that the
+        legal and policy issues realted to making such data available on
+        a larger scale might need a few more decades, I decided to step
+        off that wagon (if I am using the figure of speech correctly).
+    -   .
+    -   More recently, since I joined the Linguistic Data Consortium, we
+        have been looking at spoken neurological tests that are taken by
+        older people and using which neurologists can predict a
+        potential early onset of some neurological disorder. The idea is
+        to see if we can use speech and langauge signals to predict such
+        cases early on. The fact that we don't have cures for those
+        conditions yet, the best we can do it identify them earlier with
+        the hope that the progression can be slowed down.
+    -   .
+    -   This is sort of what is happening with the deep learning hype.
+        It is not to say that there hasn;t been a significant
+        advancement in the technologies, but to say that the models can
+        "learn" is an extremely overstatement. 
+
+
+
+-   Q: Reminds me of the advantages of pre computer copy and paste. Cut
+    up paper and rearange but having more stuff with your pieces.
+    -   A: Right! 
+    -   Kind of like that, but more "intelligent" than copy/paste,
+        because you could have various local constraints that would
+        ensure that the information that is consistent with the whole. I
+        am also ensioning this as a usecase of hooks. And if you can
+        have rich local dependencies, then you can be sure (as much as
+        you can) that the information signal is not too corrupted.
+    -   .
+    -   I did not read the "cut up paper" you mentioned. That is an
+        interesting thought. In fact, the kind of thing I was/am
+        envisioning is that you can cut the paper a million ways but
+        then you can still join them back to form the original piece of
+        paper. 
+
+```{=html}
+<!-- -->
+```
+
+
+
+-   Q: Have you used it on some real life situation? where have you experimented with this?
+    -   A: NO. 
+    -   I am probably the only person who is doing this crazy thing. It
+        would be nice, or rather I have a feeling that something like
+        this, if worked upon for a while by many might lead to a really
+        potent tool for the masses. I feel strongly about giving such
+        power to the users, and be able to edit and share the data
+        openly so that they are not stuck in some corporate vault
+        somewhere :-) One thing at a time.
+    -   .
+    -   I am in the process of creating a minimally viable package and
+        see where that goes.
+    -   .
+    -   The idea is to start within emacs and orgmode but not
+        necessarily be limited to it.
+
+-   Q:Do you see this as a format for this type of annotation
+    specifically, or something more general that can be used for
+    interlinear glosses, lexicons, etc? -- Does wordsense include a
+    valence on positive or negative words-- (mood) . 
+
+-   Interesting. question.  There are sub-corpora that have some of this
+    data. 
+
+-   -   A: Absolutely. IN fact, the project I mentioned OntoNotes has
+        multiple layers of annotation. One of them being the
+        propositional structure which uses a large lexicon that covers
+        about 15K verbs and nouns and all their argument structures that
+        we have been seen so far in the corpora. There is about a
+        million "propositions" that have been released recently (we
+        just recently celebrated a 20th birthday of the corpus. It is
+        called the PropBank. 
+
+-   There is an interesting history of the "Banks" . It started with
+    Treebank, and then there was PropBank (with a capital B), but then
+    when we were developing OntoNotes which contains:
+    -   Syntax
+    -   Named Entities
+    -   Coreference Resolutoion
+    -   Propositions
+    -   Word Sensse 
+
+-   All in the same whole and across various genre... (can add more
+    information here later... )
+
+-   Q: Are there parallel efforts to analyze literary texts or news
+    articles? Pulling the ambiguity of meaning and not just the syntax
+    out of works? (Granted this may be out of your area-- ignore as
+    desired)
+    -   A: :-) Nothing that relates to "meaning" falls too far away
+        from where I would like to be. It is a very large landscape and
+        growing very fast, so it is hard to be able to be everywhere at
+        the same time :-)
+    -   .
+    -   Many people are working on trying to analyze literature.
+        Analyzing news stories has been happening since the beginning of
+        the statistical NLP revolution---sort of linked to the fact that
+        the first million "trees" were curated using WSJ articles :-)
+
+-   Q: Have you considered support for conlangs, such as Toki Pona?  The
+    simplicity of Toki Pona seems like it would lend itself well to
+    machine processing.
+    -   A:  This is the first time I hearing of conlangs and Toki Pona.
+        I would love to know more about them to say more, but I cannot
+        imaging any langauge not being able to use this framework.
+    -   conlangs are "constructed languages" such as Esperanto ---
+        languages designed with intent, rather than evolved over
+        centuries.  Toki Pona is a minimal conlang created in 2001, with
+        a uniform syntax and small (<200 word) vocabulary.
+    -   Thanks for the information! I would love to look into it.
+
+-   Q: Is there a roadmap of sorts for GRAIL?
+    -   A: 
+    -   Yes. I am now actually using real world annotations on larg
+        corpora---both text and speech and am validating the concept
+        further. I am sure there will be some bumps in the way, and I am
+        not saying that this is going to be a cure-all, but I feel
+        (after spending most of my professional life building/using
+        corpora) that this approach does seem very appealing to me. The
+        speed of its development will depend on how many buy into the
+        idea and pitch in, I guess.
+
+-   Q: How can GRAIL be used by common people?
+    -   A: I don't think it can be used by common people at the very
+        moment---partly because most "common man" has never heard of
+        emacs or org-mode. But if we can valide the concept and if it
+        does "grow legs" and walk out of the emacs room into the
+        larger universe, then absolutely, anyone who can have any say
+        about langauge could use it. And the contributions would be as
+        useful as the consistency with which one can capture a certain
+        phenomena.
+    -   .
+    -   Everytime you use a capta these days, the algorithms used by the
+        company storing the data get slightly better. What if we could
+        democratize this concept. That could lead to fascinating things.
+        Like Wikipedia did for the sum total of human knowledge.
+
+-   Q: 
+    -   A: 
+
+
+
+
+[[!inline pages="internal(2022/info/grail-after)" raw="yes"]]
+
+[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]
+
+[[!taglink CategoryLinguistics]]