diff options
Diffstat (limited to '')
-rw-r--r-- | 2022/talks/grail.md | 322 |
1 files changed, 322 insertions, 0 deletions
diff --git a/2022/talks/grail.md b/2022/talks/grail.md new file mode 100644 index 00000000..20e7f6d5 --- /dev/null +++ b/2022/talks/grail.md @@ -0,0 +1,322 @@ +[[!sidebar content=""]] +[[!meta title="GRAIL---A Generalized Representation and Aggregation of Information Layers"]] +[[!meta copyright="Copyright © 2022 Sameer Pradhan"]] +[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]] + +<!-- Initially generated with emacsconf-generate-talk-page and then left alone for manual editing --> +<!-- You can manually edit this file to update the abstract, add links, etc. ---> + + +# GRAIL---A Generalized Representation and Aggregation of Information Layers +Sameer Pradhan (he/him) + +[[!inline pages="internal(2022/info/grail-before)" raw="yes"]] +[[!template id="help" +volunteer="" +summary="Q&A could be indexed with chapter markers" +tags="help_with_chapter_markers" +message="""The Q&A session for this talk does not have chapter markers yet. +Would you like to help? See [[help_with_chapter_markers]] for more details. You can use the vidid="grail-qanda" if adding the markers to this wiki page, or e-mail your chapter notes to <emacsconf-submit@gnu.org>."""]] + + +The human brain receives various signals that it assimilates (filters, +splices, corrects, etc.) to build a syntactic structure and its semantic +interpretation. This is a complex process that enables human communication. +The field of artificial intelligence (AI) is devoted to studying how we +generate symbols and derive meaning from such signals and to building +predictive models that allow effective human-computer interaction. + +For the purpose of this talk we will limit the scope of signals to the +domain to language—text and speech. Computational Linguistics (CL), +a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to +interpret them. It involves modeling and predicting complex linguistic +structures from these signals. These models tend to rely heavily on a large +amount of ``raw'' (naturally occurring) data and a varying amount of +(manually) enriched data, commonly known as ``annotations''. The models are +only as good as the quality of the annotations. Owing to the complex and +numerous nature of linguistic phenomena, a divide and conquer approach is +common. The upside is that it allows one to focus on one, or few, related +linguistic phenomena. The downside is that the universe of these phenomena +keeps expanding as language is context sensitive and evolves over time. For +example, depending on the context, the word ``bank'' can refer to a financial +institution, or the rising ground surrounding a lake, or something else. The +verb ``google'' did not exist before the company came into being. + +Manually annotating data can be a very task specific, labor intensive, +endeavor. Owing to this, advances in multiple modalities have happened in +silos until recently. Recent advances in computer hardware and machine +learning algorithms have opened doors to interpretation of multimodal data. +However, the need to piece together such related but disjoint predictions +poses a huge challenge. + +This brings us to the two questions that we will try to address in this +talk: + +1. How can we come up with a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? and, + +2. What role might Emacs play in this process? + +Emacs provides a rich environment for editing and manipulating recursive +embedded structures found in programming languages. Its view of text, +however, is more or less linear–strings broken into words, strings ended by +periods, strings identified using delimiters, etc. It does not assume +embedded or recursive structure in text. However, the process of interpreting +natural language involves operating on such structures. What if we could +adapt Emacs to manipulate rich structures derived from text? Unlike +programming languages, which are designed to be parsed and interpreted +deterministically, interpretation of statements in natural languages has to +frequently deal with phenomena such as ambiguity, inconsistency, +incompleteness, etc. and can get quite complex. + +We present an architecture (GRAIL) which utilizes the capabilities of Emacs +to allow the representation and aggregation of such rich structures in +a systematic fashion. Our approach is not tied to Emacs, but uses its many +built-in capabilities for creating and evaluating solution prototypes. + + +# Discussion + +## Notes + +- I will plan to fix the issues with the subtitles in a more + systematic fashion and make the video available on the + emacsconf/grail URL. My sense is that this URL will be active for + the foreseeable future. +- I am going to try and revise some of the answers which I typed quite + quickly and may not have provided useful context or might have made + errors. +- . +- Please feel free to email me at pradhan@cemantix.org for any futher + questions or discussions you may want to have with me or be part of + the grail community (doesn't exist yet :-), or is a community of 1) +- . + +## Questions and answers + +- Q: Has the '92 UPenn corpus of articles feat been reproduced over + and over again using these tools? + - A: + - Yes. The '92 corpus only annotated syntactic structure. It was + probably the first time that the details captured in syntax were + selected not purely based on linguistic accuracy, but on the + consistency of such annotations across multiple annotators. This + is often referred to as Inter-Annotator Agreement. The high IAA + for this corpus was probably one of the reasons that parsers + trained on it got accuracies in the mid 80s or so. Then over the + next 30 years (and still continuing..) academics improved on + parsers and today the performance on the test set from this + corpus is somewhere around F-score of 95. But this has to be + taken with a big grain of salt given overfitting and how many + times people have seen the test set. + - One thing that might be worth mentioing is that over the past 30 + years, there have been many different phenomena that have been + annotated on a part of this corpus. However, as I mentioned + given the difficulty of current tools and representations to + integrate disparate layers of annotations. Some such issues + being related to the complexity of the phenomena and others + related to the brittleness of the representations. For example, + I remember when we were building the OntoNotes corpus, there was + a point where the guidelines were changed to split all words at + a 'hyphen'. That simple change cause a lot of heartache + because the interdependencies were not captured at a level that + could be programmatically manipulated. That was around 2007 when + I decided to use a relational database architecture to represent + the layers. The great thing is that it was an almost perfect + representation but for some reason it never caught up because + using a database to prepare data for training was something that + was kind of unthinkable 15 years ago. Maybe? Anyway, the format + that is the easiest to use but very rigid in the sense that you + can quickly make use of it, but if something changes somewhere + you have no idea if the whole is consistent. And when came + across org-mode sometime around 2011/12 (if I remember + correctly) I thought it would be a great tool. And indeed about + decade in the future I am trying to stand on it's and emacs' + shoulders. + - This corpus was one of the first large scale manually annotated + corpora that bootstrapped the statistical natural language + processing era. That can be considered the first wave... + SInce then, there have been more corpora built on the same + philosophy. In fact I spent about 8 years about a decade ago + building a much larger corpus with more layers of information + and it is called the OntoNotes. It covers Chinese and Arabic as + well (DARPA funding!) This is freely available for research to + anyone anywhere. that was quite a feat. +- Q:Is this only for natural languagles like english or more general? + Would this be used for programing laungages. + - A: I am using English as a use case, but the idea is to have it + completely multilingual. + - I cannot think why you would want to use it for programming + languages. In fact the concept of an AST in programming + languages was what I thought would be worth exploring in this + area of research. Org Mode, the way I sometimes view it is a + somewhat crude incarnation of that and can be sort of manually + built, but the idea is to identify patterns and build upon them + to create a larger collection of transformations that could be + generally useful. That could help capture the abstract + reprsentation of "meaning" and help the models learn better. + - These days most models are trained on a boat load of data and no + matter how much data you use to train your largest model, it is + still going to be a small spec in the universe of ever growing + data that are are sitting in today. So, not surprisingly, these + models tend to overfit the data they are trained on. + - So, if you have a smaller data set which is not quite the same + as the one that you had the training data for, then the models + really do poorly. It is sometimes compared to learning a sine + function using the points on the sine wave as opposed to + deriving the function itself. You can get close, but then then + you cannot really do a lot better with that model :-) + - I did a brief stint at the Harvard Medical School/Boston + Childrens' Hospital to see if we would use the same underlying + philosophy to build better models for understanding clinical + notes. It would be an extremely useful and socially beneficial + use case, but then after a few years and realizing that the + legal and policy issues realted to making such data available on + a larger scale might need a few more decades, I decided to step + off that wagon (if I am using the figure of speech correctly). + - . + - More recently, since I joined the Linguistic Data Consortium, we + have been looking at spoken neurological tests that are taken by + older people and using which neurologists can predict a + potential early onset of some neurological disorder. The idea is + to see if we can use speech and langauge signals to predict such + cases early on. The fact that we don't have cures for those + conditions yet, the best we can do it identify them earlier with + the hope that the progression can be slowed down. + - . + - This is sort of what is happening with the deep learning hype. + It is not to say that there hasn;t been a significant + advancement in the technologies, but to say that the models can + "learn" is an extremely overstatement. + + + +- Q: Reminds me of the advantages of pre computer copy and paste. Cut + up paper and rearange but having more stuff with your pieces. + - A: Right! + - Kind of like that, but more "intelligent" than copy/paste, + because you could have various local constraints that would + ensure that the information that is consistent with the whole. I + am also ensioning this as a usecase of hooks. And if you can + have rich local dependencies, then you can be sure (as much as + you can) that the information signal is not too corrupted. + - . + - I did not read the "cut up paper" you mentioned. That is an + interesting thought. In fact, the kind of thing I was/am + envisioning is that you can cut the paper a million ways but + then you can still join them back to form the original piece of + paper. + +```{=html} +<!-- --> +``` + + + +- Q: Have you used it on some real life situation? where have you experimented with this? + - A: NO. + - I am probably the only person who is doing this crazy thing. It + would be nice, or rather I have a feeling that something like + this, if worked upon for a while by many might lead to a really + potent tool for the masses. I feel strongly about giving such + power to the users, and be able to edit and share the data + openly so that they are not stuck in some corporate vault + somewhere :-) One thing at a time. + - . + - I am in the process of creating a minimally viable package and + see where that goes. + - . + - The idea is to start within emacs and orgmode but not + necessarily be limited to it. + +- Q:Do you see this as a format for this type of annotation + specifically, or something more general that can be used for + interlinear glosses, lexicons, etc? -- Does wordsense include a + valence on positive or negative words-- (mood) . + +- Interesting. question. There are sub-corpora that have some of this + data. + +- - A: Absolutely. IN fact, the project I mentioned OntoNotes has + multiple layers of annotation. One of them being the + propositional structure which uses a large lexicon that covers + about 15K verbs and nouns and all their argument structures that + we have been seen so far in the corpora. There is about a + million "propositions" that have been released recently (we + just recently celebrated a 20th birthday of the corpus. It is + called the PropBank. + +- There is an interesting history of the "Banks" . It started with + Treebank, and then there was PropBank (with a capital B), but then + when we were developing OntoNotes which contains: + - Syntax + - Named Entities + - Coreference Resolutoion + - Propositions + - Word Sensse + +- All in the same whole and across various genre... (can add more + information here later... ) + +- Q: Are there parallel efforts to analyze literary texts or news + articles? Pulling the ambiguity of meaning and not just the syntax + out of works? (Granted this may be out of your area-- ignore as + desired) + - A: :-) Nothing that relates to "meaning" falls too far away + from where I would like to be. It is a very large landscape and + growing very fast, so it is hard to be able to be everywhere at + the same time :-) + - . + - Many people are working on trying to analyze literature. + Analyzing news stories has been happening since the beginning of + the statistical NLP revolution---sort of linked to the fact that + the first million "trees" were curated using WSJ articles :-) + +- Q: Have you considered support for conlangs, such as Toki Pona? The + simplicity of Toki Pona seems like it would lend itself well to + machine processing. + - A: This is the first time I hearing of conlangs and Toki Pona. + I would love to know more about them to say more, but I cannot + imaging any langauge not being able to use this framework. + - conlangs are "constructed languages" such as Esperanto --- + languages designed with intent, rather than evolved over + centuries. Toki Pona is a minimal conlang created in 2001, with + a uniform syntax and small (<200 word) vocabulary. + - Thanks for the information! I would love to look into it. + +- Q: Is there a roadmap of sorts for GRAIL? + - A: + - Yes. I am now actually using real world annotations on larg + corpora---both text and speech and am validating the concept + further. I am sure there will be some bumps in the way, and I am + not saying that this is going to be a cure-all, but I feel + (after spending most of my professional life building/using + corpora) that this approach does seem very appealing to me. The + speed of its development will depend on how many buy into the + idea and pitch in, I guess. + +- Q: How can GRAIL be used by common people? + - A: I don't think it can be used by common people at the very + moment---partly because most "common man" has never heard of + emacs or org-mode. But if we can valide the concept and if it + does "grow legs" and walk out of the emacs room into the + larger universe, then absolutely, anyone who can have any say + about langauge could use it. And the contributions would be as + useful as the consistency with which one can capture a certain + phenomena. + - . + - Everytime you use a capta these days, the algorithms used by the + company storing the data get slightly better. What if we could + democratize this concept. That could lead to fascinating things. + Like Wikipedia did for the sum total of human knowledge. + +- Q: + - A: + + + + +[[!inline pages="internal(2022/info/grail-after)" raw="yes"]] + +[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]] + +[[!taglink CategoryLinguistics]] |