summaryrefslogtreecommitdiffstats
path: root/2022/talks/grail.md
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--2022/talks/grail.md255
1 files changed, 251 insertions, 4 deletions
diff --git a/2022/talks/grail.md b/2022/talks/grail.md
index 824f76b3..20e7f6d5 100644
--- a/2022/talks/grail.md
+++ b/2022/talks/grail.md
@@ -11,6 +11,13 @@
Sameer Pradhan (he/him)
[[!inline pages="internal(2022/info/grail-before)" raw="yes"]]
+[[!template id="help"
+volunteer=""
+summary="Q&A could be indexed with chapter markers"
+tags="help_with_chapter_markers"
+message="""The Q&A session for this talk does not have chapter markers yet.
+Would you like to help? See [[help_with_chapter_markers]] for more details. You can use the vidid="grail-qanda" if adding the markers to this wiki page, or e-mail your chapter notes to <emacsconf-submit@gnu.org>."""]]
+
The human brain receives various signals that it assimilates (filters,
splices, corrects, etc.) to build a syntactic structure and its semantic
@@ -24,16 +31,16 @@ domain to language&#x2014;text and speech. Computational Linguistics (CL),
a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to
interpret them. It involves modeling and predicting complex linguistic
structures from these signals. These models tend to rely heavily on a large
-amount of \`\`raw'' (naturally occurring) data and a varying amount of
-(manually) enriched data, commonly known as \`\`annotations''. The models are
+amount of ``raw'' (naturally occurring) data and a varying amount of
+(manually) enriched data, commonly known as ``annotations''. The models are
only as good as the quality of the annotations. Owing to the complex and
numerous nature of linguistic phenomena, a divide and conquer approach is
common. The upside is that it allows one to focus on one, or few, related
linguistic phenomena. The downside is that the universe of these phenomena
keeps expanding as language is context sensitive and evolves over time. For
-example, depending on the context, the word \`\`bank'' can refer to a financial
+example, depending on the context, the word ``bank'' can refer to a financial
institution, or the rising ground surrounding a lake, or something else. The
-verb \`\`google'' did not exist before the company came into being.
+verb ``google'' did not exist before the company came into being.
Manually annotating data can be a very task specific, labor intensive,
endeavor. Owing to this, advances in multiple modalities have happened in
@@ -67,6 +74,246 @@ a systematic fashion. Our approach is not tied to Emacs, but uses its many
built-in capabilities for creating and evaluating solution prototypes.
+# Discussion
+
+## Notes
+
+- I will plan to fix the issues with the subtitles in a more
+ systematic fashion and make the video available on the
+ emacsconf/grail  URL. My sense is that this URL will be active for
+ the foreseeable future.
+- I am going to try and revise some of the answers which I typed quite
+ quickly and may not have provided useful context or might have made
+ errors.
+- .
+- Please feel free to email me at pradhan@cemantix.org for any futher
+ questions or discussions you may want to have with me or be part of
+ the grail community (doesn't exist yet :-), or is a community of 1)
+- .
+
+## Questions and answers
+
+- Q: Has the '92 UPenn corpus of articles feat been reproduced over
+ and over again using these tools?
+ - A: 
+ - Yes. The '92 corpus only annotated syntactic structure. It was
+ probably the first time that the details captured in syntax were
+ selected not purely based on linguistic accuracy, but on the
+ consistency of such annotations across multiple annotators. This
+ is often referred to as Inter-Annotator Agreement. The high IAA
+ for this corpus was probably one of the reasons that parsers
+ trained on it got accuracies in the mid 80s or so. Then over the
+ next 30 years (and still continuing..) academics improved on
+ parsers and today the performance on the test set from this
+ corpus is somewhere around F-score of 95. But this has to be
+ taken with a big grain of salt given overfitting and how many
+ times people have seen the test set. 
+ - One thing that might be worth mentioing is that over the past 30
+ years, there have been many different phenomena that have been
+ annotated on a part of this corpus. However, as I mentioned
+ given the difficulty of current tools and representations to
+ integrate disparate layers of annotations. Some such issues
+ being related to the complexity of the phenomena and others
+ related to the brittleness of the representations. For example,
+ I remember when we were building the OntoNotes corpus, there was
+ a point where the guidelines were changed to split all words at
+ a 'hyphen'. That simple change cause a lot of heartache
+ because the interdependencies were not captured at a level that
+ could be programmatically manipulated. That was around 2007 when
+ I decided to use a relational database architecture to represent
+ the layers. The great thing is that it was an almost perfect
+ representation but for some reason it never caught up because
+ using a database to prepare data for training was something that
+ was kind of unthinkable 15 years ago. Maybe? Anyway, the format
+ that is the easiest to use but very rigid in the sense that you
+ can quickly make use of it, but if something changes somewhere
+ you have no idea if the whole is consistent. And when came
+ across org-mode sometime around 2011/12 (if I remember
+ correctly) I thought it would be a great tool. And indeed about
+ decade in the future I am trying to stand on it's and emacs'
+ shoulders.
+ - This corpus was one of the first large scale manually annotated
+ corpora that bootstrapped the statistical natural language
+ processing era.  That can be considered the first wave... 
+ SInce then, there have been  more corpora built on the same
+ philosophy.  In fact I spent about 8 years about a decade ago
+ building a much larger corpus with more layers of information
+ and it is called the OntoNotes. It covers Chinese and Arabic as
+ well (DARPA funding!) This is freely available for research to
+ anyone anywhere. that was quite a feat. 
+- Q:Is this only for natural languagles like english or more general?
+ Would this be used for programing laungages.
+ - A: I am using English as a use case, but the idea is to have it
+ completely multilingual. 
+ - I cannot think why you would want to use it for programming
+ languages. In fact the concept of an AST in programming
+ languages was what I thought would be worth exploring in this
+ area of research.  Org Mode, the way I sometimes view it is a
+ somewhat crude incarnation of that and can be sort of manually
+ built, but the idea is to identify patterns and build upon them
+ to create a larger collection of transformations that could be
+ generally useful.  That could help capture the abstract
+ reprsentation of "meaning" and help the models learn better. 
+ - These days most models are trained on a boat load of data and no
+ matter how much data you use to train your largest model, it is
+ still going to be a small spec in the universe of ever growing
+ data that are are sitting in today. So, not surprisingly, these
+ models tend to overfit the data they are trained on.  
+ - So, if you have a smaller data set which is not quite the same
+ as the one that you had the training data for, then the models
+ really do poorly. It is sometimes compared to learning a sine
+ function using the points on the sine wave as opposed to
+ deriving the function itself. You can get close, but then then
+ you cannot really do a lot better with that model :-)
+ - I did a brief stint at the Harvard Medical School/Boston
+ Childrens' Hospital to see if we would use the same underlying
+ philosophy to build better models for understanding clinical
+ notes. It would be an extremely useful and socially beneficial
+ use case, but then after a few years and realizing that the
+ legal and policy issues realted to making such data available on
+ a larger scale might need a few more decades, I decided to step
+ off that wagon (if I am using the figure of speech correctly).
+ - .
+ - More recently, since I joined the Linguistic Data Consortium, we
+ have been looking at spoken neurological tests that are taken by
+ older people and using which neurologists can predict a
+ potential early onset of some neurological disorder. The idea is
+ to see if we can use speech and langauge signals to predict such
+ cases early on. The fact that we don't have cures for those
+ conditions yet, the best we can do it identify them earlier with
+ the hope that the progression can be slowed down.
+ - .
+ - This is sort of what is happening with the deep learning hype.
+ It is not to say that there hasn;t been a significant
+ advancement in the technologies, but to say that the models can
+ "learn" is an extremely overstatement. 
+
+
+
+- Q: Reminds me of the advantages of pre computer copy and paste. Cut
+ up paper and rearange but having more stuff with your pieces.
+ - A: Right! 
+ - Kind of like that, but more "intelligent" than copy/paste,
+ because you could have various local constraints that would
+ ensure that the information that is consistent with the whole. I
+ am also ensioning this as a usecase of hooks. And if you can
+ have rich local dependencies, then you can be sure (as much as
+ you can) that the information signal is not too corrupted.
+ - .
+ - I did not read the "cut up paper" you mentioned. That is an
+ interesting thought. In fact, the kind of thing I was/am
+ envisioning is that you can cut the paper a million ways but
+ then you can still join them back to form the original piece of
+ paper. 
+
+```{=html}
+<!-- -->
+```
+
+
+
+- Q: Have you used it on some real life situation? where have you experimented with this?
+ - A: NO. 
+ - I am probably the only person who is doing this crazy thing. It
+ would be nice, or rather I have a feeling that something like
+ this, if worked upon for a while by many might lead to a really
+ potent tool for the masses. I feel strongly about giving such
+ power to the users, and be able to edit and share the data
+ openly so that they are not stuck in some corporate vault
+ somewhere :-) One thing at a time.
+ - .
+ - I am in the process of creating a minimally viable package and
+ see where that goes.
+ - .
+ - The idea is to start within emacs and orgmode but not
+ necessarily be limited to it.
+
+- Q:Do you see this as a format for this type of annotation
+ specifically, or something more general that can be used for
+ interlinear glosses, lexicons, etc? -- Does wordsense include a
+ valence on positive or negative words-- (mood) . 
+
+- Interesting. question.  There are sub-corpora that have some of this
+ data. 
+
+- - A: Absolutely. IN fact, the project I mentioned OntoNotes has
+ multiple layers of annotation. One of them being the
+ propositional structure which uses a large lexicon that covers
+ about 15K verbs and nouns and all their argument structures that
+ we have been seen so far in the corpora. There is about a
+ million "propositions" that have been released recently (we
+ just recently celebrated a 20th birthday of the corpus. It is
+ called the PropBank. 
+
+- There is an interesting history of the "Banks" . It started with
+ Treebank, and then there was PropBank (with a capital B), but then
+ when we were developing OntoNotes which contains:
+ - Syntax
+ - Named Entities
+ - Coreference Resolutoion
+ - Propositions
+ - Word Sensse 
+
+- All in the same whole and across various genre... (can add more
+ information here later... )
+
+- Q: Are there parallel efforts to analyze literary texts or news
+ articles? Pulling the ambiguity of meaning and not just the syntax
+ out of works? (Granted this may be out of your area-- ignore as
+ desired)
+ - A: :-) Nothing that relates to "meaning" falls too far away
+ from where I would like to be. It is a very large landscape and
+ growing very fast, so it is hard to be able to be everywhere at
+ the same time :-)
+ - .
+ - Many people are working on trying to analyze literature.
+ Analyzing news stories has been happening since the beginning of
+ the statistical NLP revolution---sort of linked to the fact that
+ the first million "trees" were curated using WSJ articles :-)
+
+- Q: Have you considered support for conlangs, such as Toki Pona?  The
+ simplicity of Toki Pona seems like it would lend itself well to
+ machine processing.
+ - A:  This is the first time I hearing of conlangs and Toki Pona.
+ I would love to know more about them to say more, but I cannot
+ imaging any langauge not being able to use this framework.
+ - conlangs are "constructed languages" such as Esperanto ---
+ languages designed with intent, rather than evolved over
+ centuries.  Toki Pona is a minimal conlang created in 2001, with
+ a uniform syntax and small (<200 word) vocabulary.
+ - Thanks for the information! I would love to look into it.
+
+- Q: Is there a roadmap of sorts for GRAIL?
+ - A: 
+ - Yes. I am now actually using real world annotations on larg
+ corpora---both text and speech and am validating the concept
+ further. I am sure there will be some bumps in the way, and I am
+ not saying that this is going to be a cure-all, but I feel
+ (after spending most of my professional life building/using
+ corpora) that this approach does seem very appealing to me. The
+ speed of its development will depend on how many buy into the
+ idea and pitch in, I guess.
+
+- Q: How can GRAIL be used by common people?
+ - A: I don't think it can be used by common people at the very
+ moment---partly because most "common man" has never heard of
+ emacs or org-mode. But if we can valide the concept and if it
+ does "grow legs" and walk out of the emacs room into the
+ larger universe, then absolutely, anyone who can have any say
+ about langauge could use it. And the contributions would be as
+ useful as the consistency with which one can capture a certain
+ phenomena.
+ - .
+ - Everytime you use a capta these days, the algorithms used by the
+ company storing the data get slightly better. What if we could
+ democratize this concept. That could lead to fascinating things.
+ Like Wikipedia did for the sum total of human knowledge.
+
+- Q: 
+ - A: 
+
+
+
[[!inline pages="internal(2022/info/grail-after)" raw="yes"]]