From c6c2d25cb561946e993e5dc5919afed8017cd087 Mon Sep 17 00:00:00 2001 From: Sacha Chua Date: Thu, 8 Dec 2022 20:18:23 -0500 Subject: add etherpads to wiki pages --- 2022/talks/grail.md | 240 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 240 insertions(+) (limited to '2022/talks/grail.md') diff --git a/2022/talks/grail.md b/2022/talks/grail.md index 824f76b3..b53efc44 100644 --- a/2022/talks/grail.md +++ b/2022/talks/grail.md @@ -67,6 +67,246 @@ a systematic fashion. Our approach is not tied to Emacs, but uses its many built-in capabilities for creating and evaluating solution prototypes. +# Discussion + +## Notes + +- I will plan to fix the issues with the subtitles in a more + systematic fashion and make the video available on the + emacsconf/grail  URL. My sense is that this URL will be active for + the foreseeable future. +- I am going to try and revise some of the answers which I typed quite + quickly and may not have provided useful context or might have made + errors. +- . +- Please feel free to email me at pradhan\@cemantix.org for any futher + questions or discussions you may want to have with me or be part of + the grail community (doesn\'t exist yet :-), or is a community of 1) +- . + +## Questions and answers + +- Q: Has the \'92 UPenn corpus of articles feat been reproduced over + and over again using these tools? + - A:  + - Yes. The \'92 corpus only annotated syntactic structure. It was + probably the first time that the details captured in syntax were + selected not purely based on linguistic accuracy, but on the + consistency of such annotations across multiple annotators. This + is often referred to as Inter-Annotator Agreement. The high IAA + for this corpus was probably one of the reasons that parsers + trained on it got accuracies in the mid 80s or so. Then over the + next 30 years (and still continuing..) academics improved on + parsers and today the performance on the test set from this + corpus is somewhere around F-score of 95. But this has to be + taken with a big grain of salt given overfitting and how many + times people have seen the test set.  + - One thing that might be worth mentioing is that over the past 30 + years, there have been many different phenomena that have been + annotated on a part of this corpus. However, as I mentioned + given the difficulty of current tools and representations to + integrate disparate layers of annotations. Some such issues + being related to the complexity of the phenomena and others + related to the brittleness of the representations. For example, + I remember when we were building the OntoNotes corpus, there was + a point where the guidelines were changed to split all words at + a \'hyphen\'. That simple change cause a lot of heartache + because the interdependencies were not captured at a level that + could be programmatically manipulated. That was around 2007 when + I decided to use a relational database architecture to represent + the layers. The great thing is that it was an almost perfect + representation but for some reason it never caught up because + using a database to prepare data for training was something that + was kind of unthinkable 15 years ago. Maybe? Anyway, the format + that is the easiest to use but very rigid in the sense that you + can quickly make use of it, but if something changes somewhere + you have no idea if the whole is consistent. And when came + across org-mode sometime around 2011/12 (if I remember + correctly) I thought it would be a great tool. And indeed about + decade in the future I am trying to stand on it\'s and emacs\' + shoulders. + - This corpus was one of the first large scale manually annotated + corpora that bootstrapped the statistical natural language + processing era.  That can be considered the first wave\...  + SInce then, there have been  more corpora built on the same + philosophy.  In fact I spent about 8 years about a decade ago + building a much larger corpus with more layers of information + and it is called the OntoNotes. It covers Chinese and Arabic as + well (DARPA funding!) This is freely available for research to + anyone anywhere. that was quite a feat.  +- Q:Is this only for natural languagles like english or more general? + Would this be used for programing laungages. + - A: I am using English as a use case, but the idea is to have it + completely multilingual.  + - I cannot think why you would want to use it for programming + languages. In fact the concept of an AST in programming + languages was what I thought would be worth exploring in this + area of research.  Org Mode, the way I sometimes view it is a + somewhat crude incarnation of that and can be sort of manually + built, but the idea is to identify patterns and build upon them + to create a larger collection of transformations that could be + generally useful.  That could help capture the abstract + reprsentation of \"meaning\" and help the models learn better.  + - These days most models are trained on a boat load of data and no + matter how much data you use to train your largest model, it is + still going to be a small spec in the universe of ever growing + data that are are sitting in today. So, not surprisingly, these + models tend to overfit the data they are trained on.   + - So, if you have a smaller data set which is not quite the same + as the one that you had the training data for, then the models + really do poorly. It is sometimes compared to learning a sine + function using the points on the sine wave as opposed to + deriving the function itself. You can get close, but then then + you cannot really do a lot better with that model :-) + - I did a brief stint at the Harvard Medical School/Boston + Childrens\' Hospital to see if we would use the same underlying + philosophy to build better models for understanding clinical + notes. It would be an extremely useful and socially beneficial + use case, but then after a few years and realizing that the + legal and policy issues realted to making such data available on + a larger scale might need a few more decades, I decided to step + off that wagon (if I am using the figure of speech correctly). + - . + - More recently, since I joined the Linguistic Data Consortium, we + have been looking at spoken neurological tests that are taken by + older people and using which neurologists can predict a + potential early onset of some neurological disorder. The idea is + to see if we can use speech and langauge signals to predict such + cases early on. The fact that we don\'t have cures for those + conditions yet, the best we can do it identify them earlier with + the hope that the progression can be slowed down. + - . + - This is sort of what is happening with the deep learning hype. + It is not to say that there hasn;t been a significant + advancement in the technologies, but to say that the models can + \"learn\" is an extremely overstatement.  + +\ + +- Q: Reminds me of the advantages of pre computer copy and paste. Cut + up paper and rearange but having more stuff with your pieces. + - A: Right!  + - Kind of like that, but more \"intelligent\" than copy/paste, + because you could have various local constraints that would + ensure that the information that is consistent with the whole. I + am also ensioning this as a usecase of hooks. And if you can + have rich local dependencies, then you can be sure (as much as + you can) that the information signal is not too corrupted. + - . + - I did not read the \"cut up paper\" you mentioned. That is an + interesting thought. In fact, the kind of thing I was/am + envisioning is that you can cut the paper a million ways but + then you can still join them back to form the original piece of + paper.  + +```{=html} + +``` + +\ + +- Q: Have you used it on some real life situation? + - A: NO.  + - I am probably the only person who is doing this crazy thing. It + would be nice, or rather I have a feeling that something like + this, if worked upon for a while by many might lead to a really + potent tool for the masses. I feel strongly about giving such + power to the users, and be able to edit and share the data + openly so that they are not stuck in some corporate vault + somewhere :-) One thing at a time. + - . + - I am in the process of creating a minimally viable package and + see where that goes. + - . + - The idea is to start within emacs and orgmode but not + necessarily be limited to it. + +- Q:Do you see this as a format for this type of annotation + specifically, or something more general that can be used for + interlinear glosses, lexicons, etc? \-- Does wordsense include a + valence on positive or negative words\-- (mood) .  + +- Interesting. question.  There are sub-corpora that have some of this + data.  + +- - A: Absolutely. IN fact, the project I mentioned OntoNotes has + multiple layers of annotation. One of them being the + propositional structure which uses a large lexicon that covers + about 15K verbs and nouns and all their argument structures that + we have been seen so far in the corpora. There is about a + million \"propositions\" that have been released recently (we + just recently celebrated a 20th birthday of the corpus. It is + called the PropBank.  + +- There is an interesting history of the \"Banks\" . It started with + Treebank, and then there was PropBank (with a capital B), but then + when we were developing OntoNotes which contains: + - Syntax + - Named Entities + - Coreference Resolutoion + - Propositions + - Word Sensse  + +- All in the same whole and across various genre\... (can add more + information here later\... ) + +- Q: Are there parallel efforts to analyze literary texts or news + articles? Pulling the ambiguity of meaning and not just the syntax + out of works? (Granted this may be out of your area\-- ignore as + desired) + - A: :-) Nothing that relates to \"meaning\" falls too far away + from where I would like to be. It is a very large landscape and + growing very fast, so it is hard to be able to be everywhere at + the same time :-) + - . + - Many people are working on trying to analyze literature. + Analyzing news stories has been happening since the beginning of + the statistical NLP revolution---sort of linked to the fact that + the first million \"trees\" were curated using WSJ articles :-) + +- Q: Have you considered support for conlangs, such as Toki Pona?  The + simplicity of Toki Pona seems like it would lend itself well to + machine processing. + - A:  This is the first time I hearing of conlangs and Toki Pona. + I would love to know more about them to say more, but I cannot + imaging any langauge not being able to use this framework. + - conlangs are \"constructed languages\" such as Esperanto --- + languages designed with intent, rather than evolved over + centuries.  Toki Pona is a minimal conlang created in 2001, with + a uniform syntax and small (\<200 word) vocabulary. + - Thanks for the information! I would love to look into it. + +- Q: Is there a roadmap of sorts for GRAIL? + - A:  + - Yes. I am now actually using real world annotations on larg + corpora---both text and speech and am validating the concept + further. I am sure there will be some bumps in the way, and I am + not saying that this is going to be a cure-all, but I feel + (after spending most of my professional life building/using + corpora) that this approach does seem very appealing to me. The + speed of its development will depend on how many buy into the + idea and pitch in, I guess. + +- Q: How can GRAIL be used by common people? + - A: I don\'t think it can be used by common people at the very + moment---partly because most \"common man\" has never heard of + emacs or org-mode. But if we can valide the concept and if it + does \"grow legs\" and walk out of the emacs room into the + larger universe, then absolutely, anyone who can have any say + about langauge could use it. And the contributions would be as + useful as the consistency with which one can capture a certain + phenomena. + - . + - Everytime you use a capta these days, the algorithms used by the + company storing the data get slightly better. What if we could + democratize this concept. That could lead to fascinating things. + Like Wikipedia did for the sum total of human knowledge. + +- Q:  + - A:  + +\ + [[!inline pages="internal(2022/info/grail-after)" raw="yes"]] -- cgit v1.2.3