summaryrefslogtreecommitdiffstats
path: root/2022/talks/grail.md
diff options
context:
space:
mode:
Diffstat (limited to '2022/talks/grail.md')
-rw-r--r--2022/talks/grail.md322
1 files changed, 322 insertions, 0 deletions
diff --git a/2022/talks/grail.md b/2022/talks/grail.md
new file mode 100644
index 00000000..20e7f6d5
--- /dev/null
+++ b/2022/talks/grail.md
@@ -0,0 +1,322 @@
+[[!sidebar content=""]]
+[[!meta title="GRAIL---A Generalized Representation and Aggregation of Information Layers"]]
+[[!meta copyright="Copyright © 2022 Sameer Pradhan"]]
+[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]
+
+<!-- Initially generated with emacsconf-generate-talk-page and then left alone for manual editing -->
+<!-- You can manually edit this file to update the abstract, add links, etc. --->
+
+
+# GRAIL---A Generalized Representation and Aggregation of Information Layers
+Sameer Pradhan (he/him)
+
+[[!inline pages="internal(2022/info/grail-before)" raw="yes"]]
+[[!template id="help"
+volunteer=""
+summary="Q&A could be indexed with chapter markers"
+tags="help_with_chapter_markers"
+message="""The Q&A session for this talk does not have chapter markers yet.
+Would you like to help? See [[help_with_chapter_markers]] for more details. You can use the vidid="grail-qanda" if adding the markers to this wiki page, or e-mail your chapter notes to <emacsconf-submit@gnu.org>."""]]
+
+
+The human brain receives various signals that it assimilates (filters,
+splices, corrects, etc.) to build a syntactic structure and its semantic
+interpretation. This is a complex process that enables human communication.
+The field of artificial intelligence (AI) is devoted to studying how we
+generate symbols and derive meaning from such signals and to building
+predictive models that allow effective human-computer interaction.
+
+For the purpose of this talk we will limit the scope of signals to the
+domain to language&#x2014;text and speech. Computational Linguistics (CL),
+a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to
+interpret them. It involves modeling and predicting complex linguistic
+structures from these signals. These models tend to rely heavily on a large
+amount of ``raw'' (naturally occurring) data and a varying amount of
+(manually) enriched data, commonly known as ``annotations''. The models are
+only as good as the quality of the annotations. Owing to the complex and
+numerous nature of linguistic phenomena, a divide and conquer approach is
+common. The upside is that it allows one to focus on one, or few, related
+linguistic phenomena. The downside is that the universe of these phenomena
+keeps expanding as language is context sensitive and evolves over time. For
+example, depending on the context, the word ``bank'' can refer to a financial
+institution, or the rising ground surrounding a lake, or something else. The
+verb ``google'' did not exist before the company came into being.
+
+Manually annotating data can be a very task specific, labor intensive,
+endeavor. Owing to this, advances in multiple modalities have happened in
+silos until recently. Recent advances in computer hardware and machine
+learning algorithms have opened doors to interpretation of multimodal data.
+However, the need to piece together such related but disjoint predictions
+poses a huge challenge.
+
+This brings us to the two questions that we will try to address in this
+talk:
+
+1. How can we come up with a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? and,
+
+2. What role might Emacs play in this process?
+
+Emacs provides a rich environment for editing and manipulating recursive
+embedded structures found in programming languages. Its view of text,
+however, is more or less linear&#x2013;strings broken into words, strings ended by
+periods, strings identified using delimiters, etc. It does not assume
+embedded or recursive structure in text. However, the process of interpreting
+natural language involves operating on such structures. What if we could
+adapt Emacs to manipulate rich structures derived from text? Unlike
+programming languages, which are designed to be parsed and interpreted
+deterministically, interpretation of statements in natural languages has to
+frequently deal with phenomena such as ambiguity, inconsistency,
+incompleteness, etc. and can get quite complex.
+
+We present an architecture (GRAIL) which utilizes the capabilities of Emacs
+to allow the representation and aggregation of such rich structures in
+a systematic fashion. Our approach is not tied to Emacs, but uses its many
+built-in capabilities for creating and evaluating solution prototypes.
+
+
+# Discussion
+
+## Notes
+
+- I will plan to fix the issues with the subtitles in a more
+ systematic fashion and make the video available on the
+ emacsconf/grail  URL. My sense is that this URL will be active for
+ the foreseeable future.
+- I am going to try and revise some of the answers which I typed quite
+ quickly and may not have provided useful context or might have made
+ errors.
+- .
+- Please feel free to email me at pradhan@cemantix.org for any futher
+ questions or discussions you may want to have with me or be part of
+ the grail community (doesn't exist yet :-), or is a community of 1)
+- .
+
+## Questions and answers
+
+- Q: Has the '92 UPenn corpus of articles feat been reproduced over
+ and over again using these tools?
+ - A: 
+ - Yes. The '92 corpus only annotated syntactic structure. It was
+ probably the first time that the details captured in syntax were
+ selected not purely based on linguistic accuracy, but on the
+ consistency of such annotations across multiple annotators. This
+ is often referred to as Inter-Annotator Agreement. The high IAA
+ for this corpus was probably one of the reasons that parsers
+ trained on it got accuracies in the mid 80s or so. Then over the
+ next 30 years (and still continuing..) academics improved on
+ parsers and today the performance on the test set from this
+ corpus is somewhere around F-score of 95. But this has to be
+ taken with a big grain of salt given overfitting and how many
+ times people have seen the test set. 
+ - One thing that might be worth mentioing is that over the past 30
+ years, there have been many different phenomena that have been
+ annotated on a part of this corpus. However, as I mentioned
+ given the difficulty of current tools and representations to
+ integrate disparate layers of annotations. Some such issues
+ being related to the complexity of the phenomena and others
+ related to the brittleness of the representations. For example,
+ I remember when we were building the OntoNotes corpus, there was
+ a point where the guidelines were changed to split all words at
+ a 'hyphen'. That simple change cause a lot of heartache
+ because the interdependencies were not captured at a level that
+ could be programmatically manipulated. That was around 2007 when
+ I decided to use a relational database architecture to represent
+ the layers. The great thing is that it was an almost perfect
+ representation but for some reason it never caught up because
+ using a database to prepare data for training was something that
+ was kind of unthinkable 15 years ago. Maybe? Anyway, the format
+ that is the easiest to use but very rigid in the sense that you
+ can quickly make use of it, but if something changes somewhere
+ you have no idea if the whole is consistent. And when came
+ across org-mode sometime around 2011/12 (if I remember
+ correctly) I thought it would be a great tool. And indeed about
+ decade in the future I am trying to stand on it's and emacs'
+ shoulders.
+ - This corpus was one of the first large scale manually annotated
+ corpora that bootstrapped the statistical natural language
+ processing era.  That can be considered the first wave... 
+ SInce then, there have been  more corpora built on the same
+ philosophy.  In fact I spent about 8 years about a decade ago
+ building a much larger corpus with more layers of information
+ and it is called the OntoNotes. It covers Chinese and Arabic as
+ well (DARPA funding!) This is freely available for research to
+ anyone anywhere. that was quite a feat. 
+- Q:Is this only for natural languagles like english or more general?
+ Would this be used for programing laungages.
+ - A: I am using English as a use case, but the idea is to have it
+ completely multilingual. 
+ - I cannot think why you would want to use it for programming
+ languages. In fact the concept of an AST in programming
+ languages was what I thought would be worth exploring in this
+ area of research.  Org Mode, the way I sometimes view it is a
+ somewhat crude incarnation of that and can be sort of manually
+ built, but the idea is to identify patterns and build upon them
+ to create a larger collection of transformations that could be
+ generally useful.  That could help capture the abstract
+ reprsentation of "meaning" and help the models learn better. 
+ - These days most models are trained on a boat load of data and no
+ matter how much data you use to train your largest model, it is
+ still going to be a small spec in the universe of ever growing
+ data that are are sitting in today. So, not surprisingly, these
+ models tend to overfit the data they are trained on.  
+ - So, if you have a smaller data set which is not quite the same
+ as the one that you had the training data for, then the models
+ really do poorly. It is sometimes compared to learning a sine
+ function using the points on the sine wave as opposed to
+ deriving the function itself. You can get close, but then then
+ you cannot really do a lot better with that model :-)
+ - I did a brief stint at the Harvard Medical School/Boston
+ Childrens' Hospital to see if we would use the same underlying
+ philosophy to build better models for understanding clinical
+ notes. It would be an extremely useful and socially beneficial
+ use case, but then after a few years and realizing that the
+ legal and policy issues realted to making such data available on
+ a larger scale might need a few more decades, I decided to step
+ off that wagon (if I am using the figure of speech correctly).
+ - .
+ - More recently, since I joined the Linguistic Data Consortium, we
+ have been looking at spoken neurological tests that are taken by
+ older people and using which neurologists can predict a
+ potential early onset of some neurological disorder. The idea is
+ to see if we can use speech and langauge signals to predict such
+ cases early on. The fact that we don't have cures for those
+ conditions yet, the best we can do it identify them earlier with
+ the hope that the progression can be slowed down.
+ - .
+ - This is sort of what is happening with the deep learning hype.
+ It is not to say that there hasn;t been a significant
+ advancement in the technologies, but to say that the models can
+ "learn" is an extremely overstatement. 
+
+
+
+- Q: Reminds me of the advantages of pre computer copy and paste. Cut
+ up paper and rearange but having more stuff with your pieces.
+ - A: Right! 
+ - Kind of like that, but more "intelligent" than copy/paste,
+ because you could have various local constraints that would
+ ensure that the information that is consistent with the whole. I
+ am also ensioning this as a usecase of hooks. And if you can
+ have rich local dependencies, then you can be sure (as much as
+ you can) that the information signal is not too corrupted.
+ - .
+ - I did not read the "cut up paper" you mentioned. That is an
+ interesting thought. In fact, the kind of thing I was/am
+ envisioning is that you can cut the paper a million ways but
+ then you can still join them back to form the original piece of
+ paper. 
+
+```{=html}
+<!-- -->
+```
+
+
+
+- Q: Have you used it on some real life situation? where have you experimented with this?
+ - A: NO. 
+ - I am probably the only person who is doing this crazy thing. It
+ would be nice, or rather I have a feeling that something like
+ this, if worked upon for a while by many might lead to a really
+ potent tool for the masses. I feel strongly about giving such
+ power to the users, and be able to edit and share the data
+ openly so that they are not stuck in some corporate vault
+ somewhere :-) One thing at a time.
+ - .
+ - I am in the process of creating a minimally viable package and
+ see where that goes.
+ - .
+ - The idea is to start within emacs and orgmode but not
+ necessarily be limited to it.
+
+- Q:Do you see this as a format for this type of annotation
+ specifically, or something more general that can be used for
+ interlinear glosses, lexicons, etc? -- Does wordsense include a
+ valence on positive or negative words-- (mood) . 
+
+- Interesting. question.  There are sub-corpora that have some of this
+ data. 
+
+- - A: Absolutely. IN fact, the project I mentioned OntoNotes has
+ multiple layers of annotation. One of them being the
+ propositional structure which uses a large lexicon that covers
+ about 15K verbs and nouns and all their argument structures that
+ we have been seen so far in the corpora. There is about a
+ million "propositions" that have been released recently (we
+ just recently celebrated a 20th birthday of the corpus. It is
+ called the PropBank. 
+
+- There is an interesting history of the "Banks" . It started with
+ Treebank, and then there was PropBank (with a capital B), but then
+ when we were developing OntoNotes which contains:
+ - Syntax
+ - Named Entities
+ - Coreference Resolutoion
+ - Propositions
+ - Word Sensse 
+
+- All in the same whole and across various genre... (can add more
+ information here later... )
+
+- Q: Are there parallel efforts to analyze literary texts or news
+ articles? Pulling the ambiguity of meaning and not just the syntax
+ out of works? (Granted this may be out of your area-- ignore as
+ desired)
+ - A: :-) Nothing that relates to "meaning" falls too far away
+ from where I would like to be. It is a very large landscape and
+ growing very fast, so it is hard to be able to be everywhere at
+ the same time :-)
+ - .
+ - Many people are working on trying to analyze literature.
+ Analyzing news stories has been happening since the beginning of
+ the statistical NLP revolution---sort of linked to the fact that
+ the first million "trees" were curated using WSJ articles :-)
+
+- Q: Have you considered support for conlangs, such as Toki Pona?  The
+ simplicity of Toki Pona seems like it would lend itself well to
+ machine processing.
+ - A:  This is the first time I hearing of conlangs and Toki Pona.
+ I would love to know more about them to say more, but I cannot
+ imaging any langauge not being able to use this framework.
+ - conlangs are "constructed languages" such as Esperanto ---
+ languages designed with intent, rather than evolved over
+ centuries.  Toki Pona is a minimal conlang created in 2001, with
+ a uniform syntax and small (<200 word) vocabulary.
+ - Thanks for the information! I would love to look into it.
+
+- Q: Is there a roadmap of sorts for GRAIL?
+ - A: 
+ - Yes. I am now actually using real world annotations on larg
+ corpora---both text and speech and am validating the concept
+ further. I am sure there will be some bumps in the way, and I am
+ not saying that this is going to be a cure-all, but I feel
+ (after spending most of my professional life building/using
+ corpora) that this approach does seem very appealing to me. The
+ speed of its development will depend on how many buy into the
+ idea and pitch in, I guess.
+
+- Q: How can GRAIL be used by common people?
+ - A: I don't think it can be used by common people at the very
+ moment---partly because most "common man" has never heard of
+ emacs or org-mode. But if we can valide the concept and if it
+ does "grow legs" and walk out of the emacs room into the
+ larger universe, then absolutely, anyone who can have any say
+ about langauge could use it. And the contributions would be as
+ useful as the consistency with which one can capture a certain
+ phenomena.
+ - .
+ - Everytime you use a capta these days, the algorithms used by the
+ company storing the data get slightly better. What if we could
+ democratize this concept. That could lead to fascinating things.
+ Like Wikipedia did for the sum total of human knowledge.
+
+- Q: 
+ - A: 
+
+
+
+
+[[!inline pages="internal(2022/info/grail-after)" raw="yes"]]
+
+[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]
+
+[[!taglink CategoryLinguistics]]