summaryrefslogtreecommitdiffstats
path: root/2022/talks/grail.md
blob: 829450d612afed53e203daee56f4d6f9181a74df (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
[[!sidebar content=""]]
[[!meta title="GRAIL---A Generalized Representation and Aggregation of Information Layers"]]
[[!meta copyright="Copyright © 2022 Sameer Pradhan"]]
[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-generate-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->


# GRAIL---A Generalized Representation and Aggregation of Information Layers
Sameer Pradhan (he/him)

[[!inline pages="internal(2022/info/grail-before)" raw="yes"]]

The human brain receives various signals that it assimilates (filters,
splices, corrects, etc.) to build a syntactic structure and its semantic
interpretation.  This is a complex process that enables human communication.
The field of artificial intelligence (AI) is devoted to studying how we
generate symbols and derive meaning from such signals and to building
predictive models that allow effective human-computer interaction.

For the purpose of this talk we will limit the scope of signals to the
domain to language&#x2014;text and speech.  Computational Linguistics (CL),
a.k.a. Natural Language Processing (NLP), is a sub-area of AI that tries to
interpret them.  It involves modeling and predicting complex linguistic
structures from these signals.  These models tend to rely heavily on a large
amount of ``raw'' (naturally occurring) data and a varying amount of
(manually) enriched data, commonly known as ``annotations''.  The models are
only as good as the quality of the annotations. Owing to the complex and
numerous nature of linguistic phenomena, a divide and conquer approach is
common.  The upside is that it allows one to focus on one, or few, related
linguistic phenomena.  The downside is that the universe of these phenomena
keeps expanding as language is context sensitive and evolves over time.  For
example, depending on the context, the word ``bank'' can refer to a financial
institution, or the rising ground surrounding a lake, or something else.  The
verb ``google'' did not exist before the company came into being.

Manually annotating data can be a very task specific, labor intensive,
endeavor.  Owing to this, advances in multiple modalities have happened in
silos until recently.  Recent advances in computer hardware and machine
learning algorithms have opened doors to interpretation of multimodal data.
However, the need to piece together such related but disjoint predictions
poses a huge challenge.

This brings us to the two questions that we will try to address in this
talk:

1.  How can we come up with a unified representation of data and annotations that encompasses arbitrary levels of linguistic information? and,

2.  What role might Emacs play in this process?

Emacs provides a rich environment for editing and manipulating recursive
embedded structures found in programming languages.  Its view of text,
however, is more or less linear&#x2013;strings broken into words, strings ended by
periods, strings identified using delimiters, etc.  It does not assume
embedded or recursive structure in text.  However, the process of interpreting
natural language involves operating on such structures.  What if we could
adapt Emacs to manipulate rich structures derived from text?  Unlike
programming languages, which are designed to be parsed and interpreted
deterministically, interpretation of statements in natural languages has to
frequently deal with phenomena such as ambiguity, inconsistency,
incompleteness, etc. and can get quite complex.

We present an architecture (GRAIL) which utilizes the capabilities of Emacs
to allow the representation and aggregation of such rich structures in
a systematic fashion.  Our approach is not tied to Emacs, but uses its many
built-in capabilities for creating and evaluating solution prototypes.


# Discussion

## Notes

-   I will plan to fix the issues with the subtitles in a more
    systematic fashion and make the video available on the
    emacsconf/grail  URL. My sense is that this URL will be active for
    the foreseeable future.
-   I am going to try and revise some of the answers which I typed quite
    quickly and may not have provided useful context or might have made
    errors.
-   .
-   Please feel free to email me at pradhan@cemantix.org for any futher
    questions or discussions you may want to have with me or be part of
    the grail community (doesn't exist yet :-), or is a community of 1)
-   .

## Questions and answers

-   Q: Has the '92 UPenn corpus of articles feat been reproduced over
    and over again using these tools?
    -   A: 
    -   Yes. The '92 corpus only annotated syntactic structure. It was
        probably the first time that the details captured in syntax were
        selected not purely based on linguistic accuracy, but on the
        consistency of such annotations across multiple annotators. This
        is often referred to as Inter-Annotator Agreement. The high IAA
        for this corpus was probably one of the reasons that parsers
        trained on it got accuracies in the mid 80s or so. Then over the
        next 30 years (and still continuing..) academics improved on
        parsers and today the performance on the test set from this
        corpus is somewhere around F-score of 95. But this has to be
        taken with a big grain of salt given overfitting and how many
        times people have seen the test set. 
    -   One thing that might be worth mentioing is that over the past 30
        years, there have been many different phenomena that have been
        annotated on a part of this corpus. However, as I mentioned
        given the difficulty of current tools and representations to
        integrate disparate layers of annotations. Some such issues
        being related to the complexity of the phenomena and others
        related to the brittleness of the representations. For example,
        I remember when we were building the OntoNotes corpus, there was
        a point where the guidelines were changed to split all words at
        a 'hyphen'. That simple change cause a lot of heartache
        because the interdependencies were not captured at a level that
        could be programmatically manipulated. That was around 2007 when
        I decided to use a relational database architecture to represent
        the layers. The great thing is that it was an almost perfect
        representation but for some reason it never caught up because
        using a database to prepare data for training was something that
        was kind of unthinkable 15 years ago. Maybe? Anyway, the format
        that is the easiest to use but very rigid in the sense that you
        can quickly make use of it, but if something changes somewhere
        you have no idea if the whole is consistent. And when came
        across org-mode sometime around 2011/12 (if I remember
        correctly) I thought it would be a great tool. And indeed about
        decade in the future I am trying to stand on it's and emacs'
        shoulders.
    -   This corpus was one of the first large scale manually annotated
        corpora that bootstrapped the statistical natural language
        processing era.  That can be considered the first wave... 
        SInce then, there have been  more corpora built on the same
        philosophy.  In fact I spent about 8 years about a decade ago
        building a much larger corpus with more layers of information
        and it is called the OntoNotes. It covers Chinese and Arabic as
        well (DARPA funding!) This is freely available for research to
        anyone anywhere. that was quite a feat. 
-   Q:Is this only for natural languagles like english or more general?
    Would this be used for programing laungages.
    -   A: I am using English as a use case, but the idea is to have it
        completely multilingual. 
    -   I cannot think why you would want to use it for programming
        languages. In fact the concept of an AST in programming
        languages was what I thought would be worth exploring in this
        area of research.  Org Mode, the way I sometimes view it is a
        somewhat crude incarnation of that and can be sort of manually
        built, but the idea is to identify patterns and build upon them
        to create a larger collection of transformations that could be
        generally useful.  That could help capture the abstract
        reprsentation of "meaning" and help the models learn better. 
    -   These days most models are trained on a boat load of data and no
        matter how much data you use to train your largest model, it is
        still going to be a small spec in the universe of ever growing
        data that are are sitting in today. So, not surprisingly, these
        models tend to overfit the data they are trained on.  
    -   So, if you have a smaller data set which is not quite the same
        as the one that you had the training data for, then the models
        really do poorly. It is sometimes compared to learning a sine
        function using the points on the sine wave as opposed to
        deriving the function itself. You can get close, but then then
        you cannot really do a lot better with that model :-)
    -   I did a brief stint at the Harvard Medical School/Boston
        Childrens' Hospital to see if we would use the same underlying
        philosophy to build better models for understanding clinical
        notes. It would be an extremely useful and socially beneficial
        use case, but then after a few years and realizing that the
        legal and policy issues realted to making such data available on
        a larger scale might need a few more decades, I decided to step
        off that wagon (if I am using the figure of speech correctly).
    -   .
    -   More recently, since I joined the Linguistic Data Consortium, we
        have been looking at spoken neurological tests that are taken by
        older people and using which neurologists can predict a
        potential early onset of some neurological disorder. The idea is
        to see if we can use speech and langauge signals to predict such
        cases early on. The fact that we don't have cures for those
        conditions yet, the best we can do it identify them earlier with
        the hope that the progression can be slowed down.
    -   .
    -   This is sort of what is happening with the deep learning hype.
        It is not to say that there hasn;t been a significant
        advancement in the technologies, but to say that the models can
        "learn" is an extremely overstatement. 



-   Q: Reminds me of the advantages of pre computer copy and paste. Cut
    up paper and rearange but having more stuff with your pieces.
    -   A: Right! 
    -   Kind of like that, but more "intelligent" than copy/paste,
        because you could have various local constraints that would
        ensure that the information that is consistent with the whole. I
        am also ensioning this as a usecase of hooks. And if you can
        have rich local dependencies, then you can be sure (as much as
        you can) that the information signal is not too corrupted.
    -   .
    -   I did not read the "cut up paper" you mentioned. That is an
        interesting thought. In fact, the kind of thing I was/am
        envisioning is that you can cut the paper a million ways but
        then you can still join them back to form the original piece of
        paper. 

```{=html}
<!-- -->
```



-   Q: Have you used it on some real life situation? where have you experimented with this?
    -   A: NO. 
    -   I am probably the only person who is doing this crazy thing. It
        would be nice, or rather I have a feeling that something like
        this, if worked upon for a while by many might lead to a really
        potent tool for the masses. I feel strongly about giving such
        power to the users, and be able to edit and share the data
        openly so that they are not stuck in some corporate vault
        somewhere :-) One thing at a time.
    -   .
    -   I am in the process of creating a minimally viable package and
        see where that goes.
    -   .
    -   The idea is to start within emacs and orgmode but not
        necessarily be limited to it.

-   Q:Do you see this as a format for this type of annotation
    specifically, or something more general that can be used for
    interlinear glosses, lexicons, etc? -- Does wordsense include a
    valence on positive or negative words-- (mood) . 

-   Interesting. question.  There are sub-corpora that have some of this
    data. 

-   -   A: Absolutely. IN fact, the project I mentioned OntoNotes has
        multiple layers of annotation. One of them being the
        propositional structure which uses a large lexicon that covers
        about 15K verbs and nouns and all their argument structures that
        we have been seen so far in the corpora. There is about a
        million "propositions" that have been released recently (we
        just recently celebrated a 20th birthday of the corpus. It is
        called the PropBank. 

-   There is an interesting history of the "Banks" . It started with
    Treebank, and then there was PropBank (with a capital B), but then
    when we were developing OntoNotes which contains:
    -   Syntax
    -   Named Entities
    -   Coreference Resolutoion
    -   Propositions
    -   Word Sensse 

-   All in the same whole and across various genre... (can add more
    information here later... )

-   Q: Are there parallel efforts to analyze literary texts or news
    articles? Pulling the ambiguity of meaning and not just the syntax
    out of works? (Granted this may be out of your area-- ignore as
    desired)
    -   A: :-) Nothing that relates to "meaning" falls too far away
        from where I would like to be. It is a very large landscape and
        growing very fast, so it is hard to be able to be everywhere at
        the same time :-)
    -   .
    -   Many people are working on trying to analyze literature.
        Analyzing news stories has been happening since the beginning of
        the statistical NLP revolution---sort of linked to the fact that
        the first million "trees" were curated using WSJ articles :-)

-   Q: Have you considered support for conlangs, such as Toki Pona?  The
    simplicity of Toki Pona seems like it would lend itself well to
    machine processing.
    -   A:  This is the first time I hearing of conlangs and Toki Pona.
        I would love to know more about them to say more, but I cannot
        imaging any langauge not being able to use this framework.
    -   conlangs are "constructed languages" such as Esperanto ---
        languages designed with intent, rather than evolved over
        centuries.  Toki Pona is a minimal conlang created in 2001, with
        a uniform syntax and small (<200 word) vocabulary.
    -   Thanks for the information! I would love to look into it.

-   Q: Is there a roadmap of sorts for GRAIL?
    -   A: 
    -   Yes. I am now actually using real world annotations on larg
        corpora---both text and speech and am validating the concept
        further. I am sure there will be some bumps in the way, and I am
        not saying that this is going to be a cure-all, but I feel
        (after spending most of my professional life building/using
        corpora) that this approach does seem very appealing to me. The
        speed of its development will depend on how many buy into the
        idea and pitch in, I guess.

-   Q: How can GRAIL be used by common people?
    -   A: I don't think it can be used by common people at the very
        moment---partly because most "common man" has never heard of
        emacs or org-mode. But if we can valide the concept and if it
        does "grow legs" and walk out of the emacs room into the
        larger universe, then absolutely, anyone who can have any say
        about langauge could use it. And the contributions would be as
        useful as the consistency with which one can capture a certain
        phenomena.
    -   .
    -   Everytime you use a capta these days, the algorithms used by the
        company storing the data get slightly better. What if we could
        democratize this concept. That could lead to fascinating things.
        Like Wikipedia did for the sum total of human knowledge.

-   Q: 
    -   A: 




[[!inline pages="internal(2022/info/grail-after)" raw="yes"]]

[[!inline pages="internal(2022/info/grail-nav)" raw="yes"]]

[[!taglink CategoryLinguistics]]