summaryrefslogtreecommitdiffstats
path: root/2022/talks/grail.md
diff options
context:
space:
mode:
Diffstat (limited to '2022/talks/grail.md')
-rw-r--r--2022/talks/grail.md240
1 files changed, 240 insertions, 0 deletions
diff --git a/2022/talks/grail.md b/2022/talks/grail.md
index 824f76b3..b53efc44 100644
--- a/2022/talks/grail.md
+++ b/2022/talks/grail.md
@@ -67,6 +67,246 @@ a systematic fashion. Our approach is not tied to Emacs, but uses its many
built-in capabilities for creating and evaluating solution prototypes.
+# Discussion
+
+## Notes
+
+- I will plan to fix the issues with the subtitles in a more
+ systematic fashion and make the video available on the
+ emacsconf/grail  URL. My sense is that this URL will be active for
+ the foreseeable future.
+- I am going to try and revise some of the answers which I typed quite
+ quickly and may not have provided useful context or might have made
+ errors.
+- .
+- Please feel free to email me at pradhan\@cemantix.org for any futher
+ questions or discussions you may want to have with me or be part of
+ the grail community (doesn\'t exist yet :-), or is a community of 1)
+- .
+
+## Questions and answers
+
+- Q: Has the \'92 UPenn corpus of articles feat been reproduced over
+ and over again using these tools?
+ - A: 
+ - Yes. The \'92 corpus only annotated syntactic structure. It was
+ probably the first time that the details captured in syntax were
+ selected not purely based on linguistic accuracy, but on the
+ consistency of such annotations across multiple annotators. This
+ is often referred to as Inter-Annotator Agreement. The high IAA
+ for this corpus was probably one of the reasons that parsers
+ trained on it got accuracies in the mid 80s or so. Then over the
+ next 30 years (and still continuing..) academics improved on
+ parsers and today the performance on the test set from this
+ corpus is somewhere around F-score of 95. But this has to be
+ taken with a big grain of salt given overfitting and how many
+ times people have seen the test set. 
+ - One thing that might be worth mentioing is that over the past 30
+ years, there have been many different phenomena that have been
+ annotated on a part of this corpus. However, as I mentioned
+ given the difficulty of current tools and representations to
+ integrate disparate layers of annotations. Some such issues
+ being related to the complexity of the phenomena and others
+ related to the brittleness of the representations. For example,
+ I remember when we were building the OntoNotes corpus, there was
+ a point where the guidelines were changed to split all words at
+ a \'hyphen\'. That simple change cause a lot of heartache
+ because the interdependencies were not captured at a level that
+ could be programmatically manipulated. That was around 2007 when
+ I decided to use a relational database architecture to represent
+ the layers. The great thing is that it was an almost perfect
+ representation but for some reason it never caught up because
+ using a database to prepare data for training was something that
+ was kind of unthinkable 15 years ago. Maybe? Anyway, the format
+ that is the easiest to use but very rigid in the sense that you
+ can quickly make use of it, but if something changes somewhere
+ you have no idea if the whole is consistent. And when came
+ across org-mode sometime around 2011/12 (if I remember
+ correctly) I thought it would be a great tool. And indeed about
+ decade in the future I am trying to stand on it\'s and emacs\'
+ shoulders.
+ - This corpus was one of the first large scale manually annotated
+ corpora that bootstrapped the statistical natural language
+ processing era.  That can be considered the first wave\... 
+ SInce then, there have been  more corpora built on the same
+ philosophy.  In fact I spent about 8 years about a decade ago
+ building a much larger corpus with more layers of information
+ and it is called the OntoNotes. It covers Chinese and Arabic as
+ well (DARPA funding!) This is freely available for research to
+ anyone anywhere. that was quite a feat. 
+- Q:Is this only for natural languagles like english or more general?
+ Would this be used for programing laungages.
+ - A: I am using English as a use case, but the idea is to have it
+ completely multilingual. 
+ - I cannot think why you would want to use it for programming
+ languages. In fact the concept of an AST in programming
+ languages was what I thought would be worth exploring in this
+ area of research.  Org Mode, the way I sometimes view it is a
+ somewhat crude incarnation of that and can be sort of manually
+ built, but the idea is to identify patterns and build upon them
+ to create a larger collection of transformations that could be
+ generally useful.  That could help capture the abstract
+ reprsentation of \"meaning\" and help the models learn better. 
+ - These days most models are trained on a boat load of data and no
+ matter how much data you use to train your largest model, it is
+ still going to be a small spec in the universe of ever growing
+ data that are are sitting in today. So, not surprisingly, these
+ models tend to overfit the data they are trained on.  
+ - So, if you have a smaller data set which is not quite the same
+ as the one that you had the training data for, then the models
+ really do poorly. It is sometimes compared to learning a sine
+ function using the points on the sine wave as opposed to
+ deriving the function itself. You can get close, but then then
+ you cannot really do a lot better with that model :-)
+ - I did a brief stint at the Harvard Medical School/Boston
+ Childrens\' Hospital to see if we would use the same underlying
+ philosophy to build better models for understanding clinical
+ notes. It would be an extremely useful and socially beneficial
+ use case, but then after a few years and realizing that the
+ legal and policy issues realted to making such data available on
+ a larger scale might need a few more decades, I decided to step
+ off that wagon (if I am using the figure of speech correctly).
+ - .
+ - More recently, since I joined the Linguistic Data Consortium, we
+ have been looking at spoken neurological tests that are taken by
+ older people and using which neurologists can predict a
+ potential early onset of some neurological disorder. The idea is
+ to see if we can use speech and langauge signals to predict such
+ cases early on. The fact that we don\'t have cures for those
+ conditions yet, the best we can do it identify them earlier with
+ the hope that the progression can be slowed down.
+ - .
+ - This is sort of what is happening with the deep learning hype.
+ It is not to say that there hasn;t been a significant
+ advancement in the technologies, but to say that the models can
+ \"learn\" is an extremely overstatement. 
+
+\
+
+- Q: Reminds me of the advantages of pre computer copy and paste. Cut
+ up paper and rearange but having more stuff with your pieces.
+ - A: Right! 
+ - Kind of like that, but more \"intelligent\" than copy/paste,
+ because you could have various local constraints that would
+ ensure that the information that is consistent with the whole. I
+ am also ensioning this as a usecase of hooks. And if you can
+ have rich local dependencies, then you can be sure (as much as
+ you can) that the information signal is not too corrupted.
+ - .
+ - I did not read the \"cut up paper\" you mentioned. That is an
+ interesting thought. In fact, the kind of thing I was/am
+ envisioning is that you can cut the paper a million ways but
+ then you can still join them back to form the original piece of
+ paper. 
+
+```{=html}
+<!-- -->
+```
+
+\
+
+- Q: Have you used it on some real life situation?
+ - A: NO. 
+ - I am probably the only person who is doing this crazy thing. It
+ would be nice, or rather I have a feeling that something like
+ this, if worked upon for a while by many might lead to a really
+ potent tool for the masses. I feel strongly about giving such
+ power to the users, and be able to edit and share the data
+ openly so that they are not stuck in some corporate vault
+ somewhere :-) One thing at a time.
+ - .
+ - I am in the process of creating a minimally viable package and
+ see where that goes.
+ - .
+ - The idea is to start within emacs and orgmode but not
+ necessarily be limited to it.
+
+- Q:Do you see this as a format for this type of annotation
+ specifically, or something more general that can be used for
+ interlinear glosses, lexicons, etc? \-- Does wordsense include a
+ valence on positive or negative words\-- (mood) . 
+
+- Interesting. question.  There are sub-corpora that have some of this
+ data. 
+
+- - A: Absolutely. IN fact, the project I mentioned OntoNotes has
+ multiple layers of annotation. One of them being the
+ propositional structure which uses a large lexicon that covers
+ about 15K verbs and nouns and all their argument structures that
+ we have been seen so far in the corpora. There is about a
+ million \"propositions\" that have been released recently (we
+ just recently celebrated a 20th birthday of the corpus. It is
+ called the PropBank. 
+
+- There is an interesting history of the \"Banks\" . It started with
+ Treebank, and then there was PropBank (with a capital B), but then
+ when we were developing OntoNotes which contains:
+ - Syntax
+ - Named Entities
+ - Coreference Resolutoion
+ - Propositions
+ - Word Sensse 
+
+- All in the same whole and across various genre\... (can add more
+ information here later\... )
+
+- Q: Are there parallel efforts to analyze literary texts or news
+ articles? Pulling the ambiguity of meaning and not just the syntax
+ out of works? (Granted this may be out of your area\-- ignore as
+ desired)
+ - A: :-) Nothing that relates to \"meaning\" falls too far away
+ from where I would like to be. It is a very large landscape and
+ growing very fast, so it is hard to be able to be everywhere at
+ the same time :-)
+ - .
+ - Many people are working on trying to analyze literature.
+ Analyzing news stories has been happening since the beginning of
+ the statistical NLP revolution---sort of linked to the fact that
+ the first million \"trees\" were curated using WSJ articles :-)
+
+- Q: Have you considered support for conlangs, such as Toki Pona?  The
+ simplicity of Toki Pona seems like it would lend itself well to
+ machine processing.
+ - A:  This is the first time I hearing of conlangs and Toki Pona.
+ I would love to know more about them to say more, but I cannot
+ imaging any langauge not being able to use this framework.
+ - conlangs are \"constructed languages\" such as Esperanto ---
+ languages designed with intent, rather than evolved over
+ centuries.  Toki Pona is a minimal conlang created in 2001, with
+ a uniform syntax and small (\<200 word) vocabulary.
+ - Thanks for the information! I would love to look into it.
+
+- Q: Is there a roadmap of sorts for GRAIL?
+ - A: 
+ - Yes. I am now actually using real world annotations on larg
+ corpora---both text and speech and am validating the concept
+ further. I am sure there will be some bumps in the way, and I am
+ not saying that this is going to be a cure-all, but I feel
+ (after spending most of my professional life building/using
+ corpora) that this approach does seem very appealing to me. The
+ speed of its development will depend on how many buy into the
+ idea and pitch in, I guess.
+
+- Q: How can GRAIL be used by common people?
+ - A: I don\'t think it can be used by common people at the very
+ moment---partly because most \"common man\" has never heard of
+ emacs or org-mode. But if we can valide the concept and if it
+ does \"grow legs\" and walk out of the emacs room into the
+ larger universe, then absolutely, anyone who can have any say
+ about langauge could use it. And the contributions would be as
+ useful as the consistency with which one can capture a certain
+ phenomena.
+ - .
+ - Everytime you use a capta these days, the algorithms used by the
+ company storing the data get slightly better. What if we could
+ democratize this concept. That could lead to fascinating things.
+ Like Wikipedia did for the sum total of human knowledge.
+
+- Q: 
+ - A: 
+
+\
+
[[!inline pages="internal(2022/info/grail-after)" raw="yes"]]