WEBVTT captioned by sameer

NOTE Introduction

00:00:00.000 --> 00:00:05.839
Thank you for joining me today. I'm Sameer Pradhan

00:00:05.840 --> 00:00:07.799
from the Linguistic Data Consortium

00:00:07.800 --> 00:00:10.079
at the University of Pennsylvania

00:00:10.080 --> 00:00:14.519
and founder of cemantix.org .

00:00:14.520 --> 00:00:16.879
Today we'll be addressing research

00:00:16.880 --> 00:00:18.719
in computational linguistics,

00:00:18.720 --> 00:00:22.039
also known as natural language processing

00:00:22.040 --> 00:00:24.719
a sub area of artificial intelligence

00:00:24.720 --> 00:00:27.759
with a focus on modeling and predicting

00:00:27.760 --> 00:00:31.919
complex linguistic structures from various signals.

00:00:31.920 --> 00:00:35.799
The work we present is limited to text and speech signals.

00:00:35.800 --> 00:00:38.639
but it can be extended to other signals.

00:00:38.640 --> 00:00:40.799
We propose an architecture,

00:00:40.800 --> 00:00:42.959
and we call it GRAIL, which allows

00:00:42.960 --> 00:00:44.639
the representation and aggregation

00:00:44.640 --> 00:00:50.199
of such rich structures in a systematic fashion.

00:00:50.200 --> 00:00:52.679
I'll demonstrate a proof of concept

00:00:52.680 --> 00:00:56.559
for representing and manipulating data and annotations

00:00:56.560 --> 00:00:58.519
for the specific purpose of building

00:00:58.520 --> 00:01:02.879
machine learning models that simulate understanding.

00:01:02.880 --> 00:01:05.679
These technologies have the potential for impact

00:01:05.680 --> 00:01:09.119
in almost every conceivable field

00:01:09.120 --> 00:01:13.399
that generates and uses data.

NOTE Processing language

00:01:13.400 --> 00:01:15.039
We process human language

00:01:15.040 --> 00:01:16.719
when our brains receive and assimilate

00:01:16.720 --> 00:01:20.079
various signals which are then manipulated

00:01:20.080 --> 00:01:23.879
and interpreted within a syntactic structure.

00:01:23.880 --> 00:01:27.319
it's a complex process that I have simplified here

00:01:27.320 --> 00:01:30.759
for the purpose of comparison to machine learning.

00:01:30.760 --> 00:01:33.959
Recent machine learning models tend to require

00:01:33.960 --> 00:01:37.039
a large amount of raw, naturally occurring data

00:01:37.040 --> 00:01:40.199
and a varying amount of manually enriched data,

00:01:40.200 --> 00:01:43.199
commonly known as "annotations".

00:01:43.200 --> 00:01:45.959
Owing to the complex and numerous nature

00:01:45.960 --> 00:01:49.959
of linguistic phenomena, we have most often used

00:01:49.960 --> 00:01:52.999
a divide and conquer approach.

00:01:53.000 --> 00:01:55.399
The strength of this approach is that it allows us

00:01:55.400 --> 00:01:58.159
to focus on a single, or perhaps a few related

00:01:58.160 --> 00:02:00.439
linguistic phenomena.

00:02:00.440 --> 00:02:03.879
The weaknesses are the universe of these phenomena

00:02:03.880 --> 00:02:07.239
keep expanding, as language itself

00:02:07.240 --> 00:02:09.359
evolves and changes over time,

00:02:09.360 --> 00:02:13.119
and second, this approach requires an additional task

00:02:13.120 --> 00:02:14.839
of aggregating the interpretations,

00:02:14.840 --> 00:02:18.359
creating more opportunities for computer error.

00:02:18.360 --> 00:02:21.519
Our challenge, then, is to find the sweet spot

00:02:21.520 --> 00:02:25.239
that allows us to encode complex information

00:02:25.240 --> 00:02:27.719
without the use of manual annotation,

00:02:27.720 --> 00:02:34.559
or without the additional task of aggregation by computers.

NOTE Annotation

00:02:34.560 --> 00:02:37.119
So what do I mean by "annotation"?

00:02:37.120 --> 00:02:39.759
In this talk the word annotation refers to

00:02:39.760 --> 00:02:43.519
the manual assignment of certain attributes

00:02:43.520 --> 00:02:48.639
to portions of a signal which is necessary

00:02:48.640 --> 00:02:51.639
to perform the end task.

00:02:51.640 --> 00:02:54.439
For example, in order for the algorithm

00:02:54.440 --> 00:02:57.439
to accurately interpret a pronoun,

00:02:57.440 --> 00:03:00.279
it needs to know that pronoun,

00:03:00.280 --> 00:03:03.799
what that pronoun refers back to.

00:03:03.800 --> 00:03:06.719
We may find this task trivial, however,

00:03:06.720 --> 00:03:10.599
current algorithms repeatedly fail in this task.

00:03:10.600 --> 00:03:13.319
So the complexities of understanding

00:03:13.320 --> 00:03:16.639
in computational linguistics require annotation.

00:03:16.640 --> 00:03:20.799
The world annotation itself is a useful example,

00:03:20.800 --> 00:03:22.679
because it also reminds us

00:03:22.680 --> 00:03:25.119
that words have multiple meetings

00:03:25.120 --> 00:03:27.519
as annotation itself does—

00:03:27.520 --> 00:03:30.559
just as I needed to define it in this context,

00:03:30.560 --> 00:03:33.799
so that my message won't be misinterpreted.

00:03:33.800 --> 00:03:39.039
So, too, must annotators do this for algorithms

00:03:39.040 --> 00:03:43.239
through the manual intervention.

NOTE Learning from data

00:03:43.240 --> 00:03:44.759
Learning from raw data

00:03:44.760 --> 00:03:47.039
(commonly known as unsupervised learning)

00:03:47.040 --> 00:03:50.079
poses limitations for machine learning.

00:03:50.080 --> 00:03:53.039
As I described, modeling complex phenomena

00:03:53.040 --> 00:03:55.559
need manual annotations.

00:03:55.560 --> 00:03:58.559
The learning algorithm uses these annotations

00:03:58.560 --> 00:04:01.319
as examples to build statistical models.

00:04:01.320 --> 00:04:04.879
This is called supervised learning.

00:04:04.880 --> 00:04:06.319
Without going into too much detail,

00:04:06.320 --> 00:04:10.039
I'll simply note that the recent popularity

00:04:10.040 --> 00:04:12.519
of the concept of deep learning

00:04:12.520 --> 00:04:14.679
is that evolutionary step

00:04:14.680 --> 00:04:17.319
where we have learned to train models

00:04:17.320 --> 00:04:20.799
using trillions of parameters in ways that they can

00:04:20.800 --> 00:04:25.079
learn richer hierarchical structures

00:04:25.080 --> 00:04:29.399
from very large amounts of annotate, unannotated data.

00:04:29.400 --> 00:04:32.319
These models can then be fine-tuned,

00:04:32.320 --> 00:04:35.599
using varying amounts of annotated examples

00:04:35.600 --> 00:04:37.639
depending on the complexity of the task

00:04:37.640 --> 00:04:39.679
to generate better predictions.

NOTE Manual annotation

00:04:39.680 --> 00:04:44.919
As you might imagine, manually annotating

00:04:44.920 --> 00:04:47.359
complex, linguistic phenomena

00:04:47.360 --> 00:04:51.719
can be very specific, labor-intensive task.

00:04:51.720 --> 00:04:54.279
For example, imagine if we were

00:04:54.280 --> 00:04:56.399
to go back through this presentation

00:04:56.400 --> 00:04:58.399
and connect all the pronouns

00:04:58.400 --> 00:04:59.919
with the nouns to which they refer.

00:04:59.920 --> 00:05:03.239
Even for a short 18 min presentation,

00:05:03.240 --> 00:05:05.239
this would require hundreds of annotations.

00:05:05.240 --> 00:05:08.519
The models we build are only as good

00:05:08.520 --> 00:05:11.119
as the quality of the annotations we make.

00:05:11.120 --> 00:05:12.679
We need guidelines

00:05:12.680 --> 00:05:15.759
that ensure that the annotations are done

00:05:15.760 --> 00:05:19.719
by at least two humans who have substantial agreement

00:05:19.720 --> 00:05:22.119
with each other in their interpretations.

00:05:22.120 --> 00:05:25.599
We know that if we try to trade a model using annotations

00:05:25.600 --> 00:05:28.519
that are very subjective, or have more noise,

00:05:28.520 --> 00:05:30.919
we will receive poor predictions.

00:05:30.920 --> 00:05:33.679
Additionally, there is the concern of introducing

00:05:33.680 --> 00:05:37.079
various unexpected biases into one's models.

00:05:37.080 --> 00:05:44.399
So annotation is really both an art and a science.

NOTE How can we develop a unified representation?

00:05:44.400 --> 00:05:47.439
In the remaining time,

00:05:47.440 --> 00:05:49.999
we will turn to two fundamental questions.

00:05:50.000 --> 00:05:54.239
First, how can we develop a unified representation

00:05:54.240 --> 00:05:55.599
of data and annotations

00:05:55.600 --> 00:05:59.759
that encompasses arbitrary levels of linguistic information?

00:05:59.760 --> 00:06:03.839
There is a long history of attempting to answer

00:06:03.840 --> 00:06:04.839
this first question.

00:06:04.840 --> 00:06:08.839
This history is documented in our recent article,

00:06:08.840 --> 00:06:11.519
and you can refer to that article.

00:06:11.520 --> 00:06:16.719
It will be on the website.

00:06:16.720 --> 00:06:18.999
It is as if we, as a community,

00:06:19.000 --> 00:06:22.519
have been searching for our own Holy Grail.

NOTE What role might Emacs and Org mode play?

00:06:22.520 --> 00:06:26.519
The second question we will pose is

00:06:26.520 --> 00:06:30.159
what role might Emacs, along with Org mode,

00:06:30.160 --> 00:06:31.919
play in this process?

00:06:31.920 --> 00:06:35.359
Well, the solution itself may not be tied to Emacs.

00:06:35.360 --> 00:06:38.359
Emacs has built in capabilities

00:06:38.360 --> 00:06:42.599
that could be useful for evaluating potential solutions.

00:06:42.600 --> 00:06:45.759
It's also one of the most extensively documented

00:06:45.760 --> 00:06:48.519
pieces of software and the most customizable

00:06:48.520 --> 00:06:51.599
piece of software that I have ever come across,

00:06:51.600 --> 00:06:55.279
and many would agree with that.

NOTE The complex structure of language

00:06:55.280 --> 00:07:00.639
In order to approach this second question,

00:07:00.640 --> 00:07:03.919
we turn to the complex structure of language itself.

00:07:03.920 --> 00:07:07.679
At first glance, language appears to us

00:07:07.680 --> 00:07:09.879
as a series of words.

00:07:09.880 --> 00:07:13.439
Words form sentences, sentences form paragraphs,

00:07:13.440 --> 00:07:16.239
and paragraphs form completed text.

00:07:16.240 --> 00:07:19.039
If this was a sufficient description

00:07:19.040 --> 00:07:21.159
of the complexity of language,

00:07:21.160 --> 00:07:24.199
all of us would be able to speak and read

00:07:24.200 --> 00:07:26.559
at least ten different languages.

00:07:26.560 --> 00:07:29.279
We know it is much more complex than this.

00:07:29.280 --> 00:07:33.199
There is a rich, underlying recursive tree structure--

00:07:33.200 --> 00:07:36.439
in fact, many possible tree structures

00:07:36.440 --> 00:07:39.439
which makes a particular sequence meaningful

00:07:39.440 --> 00:07:42.079
and many others meaningless.

00:07:42.080 --> 00:07:45.239
One of the better understood tree structures

00:07:45.240 --> 00:07:47.119
is the syntactic structure.

00:07:47.120 --> 00:07:49.439
While natural language

00:07:49.440 --> 00:07:51.679
has rich ambiguities and complexities,

00:07:51.680 --> 00:07:55.119
programming languages are designed to be parsed

00:07:55.120 --> 00:07:56.999
and interpreted deterministically.

00:07:57.000 --> 00:08:02.159
Emacs has been used for programming very effectively.

00:08:02.160 --> 00:08:05.359
So there is a potential for using Emacs

00:08:05.360 --> 00:08:06.559
as a tool for annotation.

00:08:06.560 --> 00:08:10.799
This would significantly improve our current set of tools.

NOTE Annotation tools

00:08:10.800 --> 00:08:16.559
It is important to note that most of the annotation tools

00:08:16.560 --> 00:08:19.639
that have been developed over the past few decades

00:08:19.640 --> 00:08:22.879
have relied on graphical interfaces,

00:08:22.880 --> 00:08:26.919
even those used for enriching textual information.

00:08:26.920 --> 00:08:30.399
Most of the tools in current use

00:08:30.400 --> 00:08:36.159
are designed for a end user to add very specific,

00:08:36.160 --> 00:08:38.639
very restricted information.

00:08:38.640 --> 00:08:42.799
We have not really made use of the potential

00:08:42.800 --> 00:08:45.639
that an editor or a rich editing environment like Emacs

00:08:45.640 --> 00:08:47.239
can add to the mix.

00:08:47.240 --> 00:08:52.479
Emacs has long enabled the editing of, the manipulation of

00:08:52.480 --> 00:08:56.359
complex embedded tree structures abundant in source code.

00:08:56.360 --> 00:08:58.599
So it's not difficult to imagine that it would have

00:08:58.600 --> 00:09:00.359
many capabilities that we we need

00:09:00.360 --> 00:09:02.599
to represent actual language.

00:09:02.600 --> 00:09:04.759
In fact, it already does that with features

00:09:04.760 --> 00:09:06.399
that allow us to quickly navigate

00:09:06.400 --> 00:09:07.919
through sentences and paragraphs,

00:09:07.920 --> 00:09:09.799
and we don't need a few key strokes.

00:09:09.800 --> 00:09:13.599
Or to add various text properties to text spans

00:09:13.600 --> 00:09:17.039
to create overlays, to name but a few.

00:09:17.040 --> 00:09:22.719
Emacs figured out this way to handle Unicode,

00:09:22.720 --> 00:09:26.799
so you don't even have to worry about the complexity

00:09:26.800 --> 00:09:29.439
of managing multiple languages.

00:09:29.440 --> 00:09:34.039
It's built into Emacs. In fact, this is not the first time

00:09:34.040 --> 00:09:37.399
Emacs has been used for linguistic analysis.

00:09:37.400 --> 00:09:41.159
One of the breakthrough moments in language,

00:09:41.160 --> 00:09:44.439
natural language processing was the creation

00:09:44.440 --> 00:09:48.639
of manually created syntactic trees

00:09:48.640 --> 00:09:50.439
for a 1 million word collection

00:09:50.440 --> 00:09:52.399
of Wall Street Journal articles.

00:09:52.400 --> 00:09:54.879
This was else around 1992

00:09:54.880 --> 00:09:59.279
before Java or graphical interfaces were common.

00:09:59.280 --> 00:10:03.279
The tool that was used to create that corpus was Emacs.

00:10:03.280 --> 00:10:08.959
It was created at UPenn, and is famously known as

00:10:08.960 --> 00:10:12.719
the Penn Treebank. '92 was about when

00:10:12.720 --> 00:10:16.439
the Linguistic Data Consortium was also established,

00:10:16.440 --> 00:10:18.039
and it's been about 30 years

00:10:18.040 --> 00:10:20.719
that it has been creating various

00:10:20.720 --> 00:10:22.359
language-related resources.

NOTE Org mode

00:10:22.360 --> 00:10:28.519
Org mode--in particular, the outlining mode,

00:10:28.520 --> 00:10:32.399
or rather the enhanced form of outlining mode--

00:10:32.400 --> 00:10:35.599
allows us to create rich outlines,

00:10:35.600 --> 00:10:37.799
attaching properties to nodes,

00:10:37.800 --> 00:10:41.119
and provides commands for easily customizing

00:10:41.120 --> 00:10:43.879
sorting of various pieces of information

00:10:43.880 --> 00:10:45.639
as per one's requirement.

00:10:45.640 --> 00:10:50.239
This can also be a very useful tool.

00:10:50.240 --> 00:10:59.159
This enhanced form of outline-mode adds more power to Emacs.

00:10:59.160 --> 00:11:03.359
It provides commands for easily customizing

00:11:03.360 --> 00:11:05.159
and filtering information,

00:11:05.160 --> 00:11:08.999
while at the same time hiding unnecessary context.

00:11:09.000 --> 00:11:11.919
It also allows structural editing.

00:11:11.920 --> 00:11:16.039
This can be a very useful tool to enrich corpora

00:11:16.040 --> 00:11:20.919
where we are focusing on limited amount of phenomena.

00:11:20.920 --> 00:11:24.519
The two together allow us to create

00:11:24.520 --> 00:11:27.199
a rich representation

00:11:27.200 --> 00:11:32.999
that can simultaneously capture multiple possible sequences,

00:11:33.000 --> 00:11:38.759
capture details necessary to recreate the original source,

00:11:38.760 --> 00:11:42.079
allow the creation of hierarchical representation,

00:11:42.080 --> 00:11:44.679
provide structural editing capabilities

00:11:44.680 --> 00:11:47.439
that can take advantage of the concept of inheritance

00:11:47.440 --> 00:11:48.999
within the tree structure.

00:11:49.000 --> 00:11:54.279
Together they allow local manipulations of structures,

00:11:54.280 --> 00:11:56.199
thereby minimizing data coupling.

00:11:56.200 --> 00:11:59.119
The concept of tags in Org mode

00:11:59.120 --> 00:12:01.599
complement the hierarchy part.

00:12:01.600 --> 00:12:03.839
Hierarchies can be very rigid,

00:12:03.840 --> 00:12:06.039
but to tags on hierarchies,

00:12:06.040 --> 00:12:08.839
we can have a multifaceted representations.

00:12:08.840 --> 00:12:12.759
As a matter of fact, Org mode has the ability for the tags

00:12:12.760 --> 00:12:15.039
to have their own hierarchical structure

00:12:15.040 --> 00:12:18.639
which further enhances the representational power.

00:12:18.640 --> 00:12:22.639
All of this can be done as a sequence

00:12:22.640 --> 00:12:25.679
of mostly functional data transformations,

00:12:25.680 --> 00:12:27.439
because most of the capabilities

00:12:27.440 --> 00:12:29.759
can be configured and customized.

00:12:29.760 --> 00:12:32.799
It is not necessary to do everything at once.

00:12:32.800 --> 00:12:36.199
Instead, it allows us to incrementally increase

00:12:36.200 --> 00:12:37.919
the complexity of the representation.

00:12:37.920 --> 00:12:39.799
Finally, all of this can be done

00:12:39.800 --> 00:12:42.359
in plain-text representation

00:12:42.360 --> 00:12:45.479
which comes with its own advantages.

NOTE Example

00:12:45.480 --> 00:12:50.679
Now let's take a simple example.

00:12:50.680 --> 00:12:55.999
This is a a short video that I'll play.

00:12:56.000 --> 00:12:59.679
The sentence is "I saw the moon with a telescope,"

00:12:59.680 --> 00:13:03.999
and let's just make a copy of the sentence.

00:13:04.000 --> 00:13:09.199
What we can do now is to see:

00:13:09.200 --> 00:13:11.879
what does this sentence comprise?

00:13:11.880 --> 00:13:13.679
It has a noun phrase "I,"

00:13:13.680 --> 00:13:17.479
followed by a word "saw."

00:13:17.480 --> 00:13:21.359
Then "the moon" is another noun phrase,

00:13:21.360 --> 00:13:24.839
and "with the telescope" is a prepositional phrase.

00:13:24.840 --> 00:13:30.759
Now one thing that you might remember,

00:13:30.760 --> 00:13:36.119
from grammar school or syntax is that

00:13:36.120 --> 00:13:41.279
there is a syntactic structure.

00:13:41.280 --> 00:13:44.359
And if you in this particular case--

00:13:44.360 --> 00:13:47.919
because we know that the moon is not typically

00:13:47.920 --> 00:13:51.679
something that can hold the telescope,

00:13:51.680 --> 00:13:56.239
that the seeing must be done by me or "I,"

00:13:56.240 --> 00:14:01.039
and the telescope must be in my hand,

00:14:01.040 --> 00:14:04.479
or "I" am viewing the moon with a telescope.

00:14:04.480 --> 00:14:13.519
However, it is possible that in a different context

00:14:13.520 --> 00:14:17.159
the moon could be referring to an animated character

00:14:17.160 --> 00:14:22.319
in a animated series, and could actually hold the telescope.

00:14:22.320 --> 00:14:23.479
And this is one of the most--

00:14:23.480 --> 00:14:24.839
the oldest and one of the most--

00:14:24.840 --> 00:14:26.319
and in that case the situation might be

00:14:26.320 --> 00:14:30.959
that I'm actually seeing the moon holding a telescope...

00:14:30.960 --> 00:14:36.079
I mean. The moon is holding the telescope,

00:14:36.080 --> 00:14:40.959
and I'm just seeing the moon holding the telescope.

00:14:40.960 --> 00:14:47.999
Complex linguistic ambiguity or linguistic

00:14:48.000 --> 00:14:53.599
phenomena that requires world knowledge,

00:14:53.600 --> 00:14:55.719
and it's called the PP attachment problem

00:14:55.720 --> 00:14:59.239
where the propositional phrase attachment

00:14:59.240 --> 00:15:04.599
can be ambiguous, and various different contextual cues

00:15:04.600 --> 00:15:06.879
have to be used to resolve the ambiguity.

00:15:06.880 --> 00:15:09.079
So in this case, as you saw,

00:15:09.080 --> 00:15:11.199
both the readings are technically true,

00:15:11.200 --> 00:15:13.959
depending on different contexts.

00:15:13.960 --> 00:15:16.599
So one thing we could do is just

00:15:16.600 --> 00:15:19.919
to cut the tree and duplicate it,

00:15:19.920 --> 00:15:21.599
and then let's create another node

00:15:21.600 --> 00:15:24.479
and call it an "OR" node.

00:15:24.480 --> 00:15:26.119
And because we are saying,

00:15:26.120 --> 00:15:28.359
this is one of the two interpretations.

00:15:28.360 --> 00:15:32.159
Now let's call one interpretation "a",

00:15:32.160 --> 00:15:36.159
and that interpretation essentially

00:15:36.160 --> 00:15:39.319
is this child of that node "a"

00:15:39.320 --> 00:15:41.799
and that says that the moon

00:15:41.800 --> 00:15:43.999
is holding the telescope.

00:15:44.000 --> 00:15:46.359
Now we can create another representation "b"

00:15:46.360 --> 00:15:53.919
where we capture the other interpretation,

00:15:53.920 --> 00:15:59.959
where this, the act, the moon or--I am actually

00:15:59.960 --> 00:16:00.519
holding the telescope,

00:16:00.520 --> 00:16:06.799
and watching the moon using it.

00:16:06.800 --> 00:16:09.199
So now we have two separate interpretations

00:16:09.200 --> 00:16:11.679
in the same structure,

00:16:11.680 --> 00:16:15.519
and all we do--we're able to do is with this,

00:16:15.520 --> 00:16:18.159
with very quick key strokes now...

00:16:18.160 --> 00:16:22.439
While we are at it, let's add another interesting thing,

00:16:22.440 --> 00:16:25.159
this node that represents "I":

00:16:25.160 --> 00:16:28.919
"He." It can be "She".

00:16:28.920 --> 00:16:35.759
It can be "the children," or it can be "The people".

00:16:35.760 --> 00:16:45.039
Basically, any entity that has the capability to "see"

00:16:45.040 --> 00:16:53.359
can be substituted in this particular node.

00:16:53.360 --> 00:16:57.399
Let's see what we have here now.

00:16:57.400 --> 00:17:01.239
We just are getting sort of a zoom view

00:17:01.240 --> 00:17:04.599
of the entire structure, what we created,

00:17:04.600 --> 00:17:08.039
and essentially you can see that

00:17:08.040 --> 00:17:11.879
by just, you know, using a few keystrokes,

00:17:11.880 --> 00:17:17.839
we were able to capture two different interpretations

00:17:17.840 --> 00:17:20.879
of a a simple sentence,

00:17:20.880 --> 00:17:23.759
and they are also able to add

00:17:23.760 --> 00:17:27.799
these alternate pieces of information

00:17:27.800 --> 00:17:30.559
that could help machine learning algorithms

00:17:30.560 --> 00:17:32.439
generalize better.

00:17:32.440 --> 00:17:36.239
All right.

NOTE Different readings

00:17:36.240 --> 00:17:40.359
Now, let's look at the next thing. So in a sense,

00:17:40.360 --> 00:17:46.679
we can use this power of functional data structures

00:17:46.680 --> 00:17:50.239
to represent various potentially conflicting

00:17:50.240 --> 00:17:55.559
and structural readings of that piece of text.

00:17:55.560 --> 00:17:58.079
In addition to that, we can also create more texts,

00:17:58.080 --> 00:17:59.799
each with different structure,

00:17:59.800 --> 00:18:01.559
and have them all in the same place.

00:18:01.560 --> 00:18:04.239
This allows us to address the interpretation

00:18:04.240 --> 00:18:06.879
of a static sentence that might be occurring in the world,

00:18:06.880 --> 00:18:09.639
while simultaneously inserting information

00:18:09.640 --> 00:18:11.519
that would add more value to it.

00:18:11.520 --> 00:18:14.999
This makes the enrichment process also very efficient.

00:18:15.000 --> 00:18:19.519
Additionally, we can envision

00:18:19.520 --> 00:18:23.999
a power user of the future, or present,

00:18:24.000 --> 00:18:27.479
who can not only annotate a span,

00:18:27.480 --> 00:18:31.279
but also edit the information in situ

00:18:31.280 --> 00:18:34.639
in a way that would help machine algorithms

00:18:34.640 --> 00:18:36.879
generalize better by making more efficient use

00:18:36.880 --> 00:18:37.719
of the annotations.

00:18:37.720 --> 00:18:41.519
So together, Emacs and Org mode can speed up

00:18:41.520 --> 00:18:42.959
the enrichment of the signals

00:18:42.960 --> 00:18:44.519
in a way that allows us

00:18:44.520 --> 00:18:47.719
to focus on certain aspects and ignore others.

00:18:47.720 --> 00:18:50.839
Extremely complex landscape of rich structures

00:18:50.840 --> 00:18:53.039
can be captured consistently,

00:18:53.040 --> 00:18:55.639
in a fashion that allows computers

00:18:55.640 --> 00:18:56.759
to understand language.

00:18:56.760 --> 00:19:00.879
We can then build tools to enhance the tasks

00:19:00.880 --> 00:19:03.319
that we do in our everyday life.

00:19:03.320 --> 00:19:10.759
YAMR is acronym, or the file's type or specification

00:19:10.760 --> 00:19:15.239
that we are creating to capture this new

00:19:15.240 --> 00:19:17.679
rich representation.

NOTE Spontaneous speech

00:19:17.680 --> 00:19:21.959
We'll now look at an example of spontaneous speech

00:19:21.960 --> 00:19:24.799
that occurs in spoken conversations.

00:19:24.800 --> 00:19:28.599
Conversations frequently contain errors in speech:

00:19:28.600 --> 00:19:30.799
interruptions, disfluencies,

00:19:30.800 --> 00:19:33.959
verbal sounds such as cough or laugh,

00:19:33.960 --> 00:19:35.039
and other noises.

00:19:35.040 --> 00:19:38.199
In this sense, spontaneous speech is similar

00:19:38.200 --> 00:19:39.799
to a functional data stream.

00:19:39.800 --> 00:19:42.759
We cannot take back words that come out of our mouth,

00:19:42.760 --> 00:19:47.239
but we tend to make mistakes, and we correct ourselves

00:19:47.240 --> 00:19:49.039
as soon as we realize that we have made--

00:19:49.040 --> 00:19:50.679
we have misspoken.

00:19:50.680 --> 00:19:53.159
This process manifests through a combination

00:19:53.160 --> 00:19:56.279
of a handful of mechanisms, including immediate correction

00:19:56.280 --> 00:20:00.959
after an error, and we do this unconsciously.

00:20:00.960 --> 00:20:02.719
Computers, on the other hand,

00:20:02.720 --> 00:20:06.639
must be taught to understand these cases.

00:20:06.640 --> 00:20:12.799
What we see here is a example document or outline,

00:20:12.800 --> 00:20:18.119
or part of a document that illustrates

00:20:18.120 --> 00:20:22.919
various different aspects of the representation.

00:20:22.920 --> 00:20:25.919
We don't have a lot of time to go through

00:20:25.920 --> 00:20:28.239
many of the details.

00:20:28.240 --> 00:20:31.759
I would highly encourage you to play a...

00:20:31.760 --> 00:20:39.159
I'm planning on making some videos, or ascii cinemas,

00:20:39.160 --> 00:20:42.559
that I'll be posting, and you can,

00:20:42.560 --> 00:20:46.759
if you're interested, you can go through those.

00:20:46.760 --> 00:20:50.359
The idea here is to try to do

00:20:50.360 --> 00:20:54.599
a slightly more complex use case.

00:20:54.600 --> 00:20:57.639
But again, given the time constraint

00:20:57.640 --> 00:21:00.279
and the amount of information

00:21:00.280 --> 00:21:01.519
that needs to fit in the screen,

00:21:01.520 --> 00:21:05.559
this may not be very informative,

00:21:05.560 --> 00:21:08.399
but at least it will give you some idea

00:21:08.400 --> 00:21:10.439
of what can be possible.

00:21:10.440 --> 00:21:13.279
And in this particular case, what you're seeing is that

00:21:13.280 --> 00:21:18.319
there is a sentence which is "What I'm I'm tr- telling now."

00:21:18.320 --> 00:21:21.159
Essentially, there is a repetition of the word "I'm",

00:21:21.160 --> 00:21:23.279
and then there is a partial word

00:21:23.280 --> 00:21:25.159
that somebody tried to say "telling",

00:21:25.160 --> 00:21:29.599
but started saying "tr-", and then corrected themselves

00:21:29.600 --> 00:21:30.959
and said, "telling now."

00:21:30.960 --> 00:21:39.239
So in this case, you see, we can capture words

00:21:39.240 --> 00:21:44.919
or a sequence of words, or a sequence of tokens.

00:21:44.920 --> 00:21:52.279
One thing to... An interesting thing to note is that in NLP,

00:21:52.280 --> 00:21:55.319
sometimes we have to break typically

00:21:55.320 --> 00:22:01.199
words that don't have spaces into two separate words,

00:22:01.200 --> 00:22:04.119
especially contractions like "I'm",

00:22:04.120 --> 00:22:08.199
so the syntactic parser needs needs two separate nodes.

00:22:08.200 --> 00:22:11.199
But anyway, so I'll... You can see that here.

00:22:11.200 --> 00:22:15.759
The other... This view. What this view shows is that

00:22:15.760 --> 00:22:19.759
with each of the nodes in the sentence

00:22:19.760 --> 00:22:23.079
or in the representation,

00:22:23.080 --> 00:22:26.079
you can have a lot of different properties

00:22:26.080 --> 00:22:27.559
that you can attach to them,

00:22:27.560 --> 00:22:30.119
and these properties are typically hidden,

00:22:30.120 --> 00:22:32.719
like you saw in the earlier slide.

00:22:32.720 --> 00:22:35.599
But you can make use of all these properties

00:22:35.600 --> 00:22:39.439
to do various kind of searches and filtering.

00:22:39.440 --> 00:22:43.519
And on the right hand side here--

00:22:43.520 --> 00:22:48.799
this is actually not a legitimate syntax--

00:22:48.800 --> 00:22:51.279
but on the right are descriptions

00:22:51.280 --> 00:22:53.479
of what each of these represent.

00:22:53.480 --> 00:22:57.319
All the information is also available in the article.

00:22:57.320 --> 00:23:04.279
You can see there... It shows how much rich context

00:23:04.280 --> 00:23:05.879
you can capture.

00:23:05.880 --> 00:23:08.799
This is just a closer snapshot

00:23:08.800 --> 00:23:10.159
of the properties on the node,

00:23:10.160 --> 00:23:13.119
and you can see we can have things like,

00:23:13.120 --> 00:23:14.799
whether the word is a token or not,

00:23:14.800 --> 00:23:17.359
or that it's incomplete, whether some words

00:23:17.360 --> 00:23:19.959
might want to be filtered out for parsing,

00:23:19.960 --> 00:23:23.039
and we can say this: PARSE_IGNORE,

00:23:23.040 --> 00:23:25.519
or some words or restart markers...

00:23:25.520 --> 00:23:29.239
We can mark, add a RESTART_MARKER, or sometimes,

00:23:29.240 --> 00:23:31.999
some of these might have durations. Things like that.

NOTE Editing properties in column view

00:23:32.000 --> 00:23:38.799
The other fascinating thing of this representation

00:23:38.800 --> 00:23:42.599
is that you can edit properties in the column view.

00:23:42.600 --> 00:23:45.399
And suddenly, you have this tabular data structure

00:23:45.400 --> 00:23:48.879
combined with the hierarchical data structure.

00:23:48.880 --> 00:23:53.119
And as you can--you may not be able to see it here,

00:23:53.120 --> 00:23:56.879
but what has also happened here is that

00:23:56.880 --> 00:24:01.159
some of the tags have been inherited

00:24:01.160 --> 00:24:02.479
from the earlier nodes.

00:24:02.480 --> 00:24:07.919
And so you get a much fuller picture of things.

00:24:07.920 --> 00:24:13.919
Essentially you, can filter out things

00:24:13.920 --> 00:24:15.319
that you want to process,

00:24:15.320 --> 00:24:20.279
process them, and then reintegrate it into the whole.

NOTE Conclusion

00:24:20.280 --> 00:24:25.479
So, in conclusion, today we have proposed and demonstrated

00:24:25.480 --> 00:24:27.559
the use of an architecture (GRAIL),

00:24:27.560 --> 00:24:31.319
which allows the representation, manipulation,

00:24:31.320 --> 00:24:34.759
and aggregation of rich linguistic structures

00:24:34.760 --> 00:24:36.519
in a systematic fashion.

00:24:36.520 --> 00:24:41.359
We have shown how GRAIL advances the tools

00:24:41.360 --> 00:24:44.599
available for building machine learning models

00:24:44.600 --> 00:24:46.879
that simulate understanding.

00:24:46.880 --> 00:24:51.679
Thank you very much for your time and attention today.

00:24:51.680 --> 00:24:54.639
My contact information is on this slide.

00:24:54.640 --> 00:25:02.599
If you are interested in an additional example

00:25:02.600 --> 00:25:05.439
that demonstrates the representation

00:25:05.440 --> 00:25:08.039
of speech and written text together,

00:25:08.040 --> 00:25:10.719
please continue watching.

00:25:10.720 --> 00:25:12.199
Otherwise, you can stop here

00:25:12.200 --> 00:25:15.279
and enjoy the rest of the conference.

NOTE Bonus material

00:25:15.280 --> 00:25:39.079
Welcome to the bonus material.

00:25:39.080 --> 00:25:43.959
I'm glad for those of you who are stuck around.

00:25:43.960 --> 00:25:46.559
We are now going to examine an instance

00:25:46.560 --> 00:25:49.159
of speech and text signals together

00:25:49.160 --> 00:25:51.479
that produce multiple layers.

00:25:51.480 --> 00:25:54.839
When we have--when we take a spoken conversation

00:25:54.840 --> 00:25:58.719
and use the best language processing models available,

00:25:58.720 --> 00:26:00.679
we suddenly hit a hard spot

00:26:00.680 --> 00:26:03.239
because the tools are typically not trained

00:26:03.240 --> 00:26:05.359
to filter out the unnecessary cruft

00:26:05.360 --> 00:26:07.559
in order to automatically interpret

00:26:07.560 --> 00:26:09.559
the part of what is being said

00:26:09.560 --> 00:26:11.799
that is actually relevant.

00:26:11.800 --> 00:26:14.639
Over time, language researchers

00:26:14.640 --> 00:26:17.719
have created many interdependent layers of annotations,

00:26:17.720 --> 00:26:21.039
yet the assumptions underlying them are seldom the same.

00:26:21.040 --> 00:26:25.039
Piecing together such related but disjointed annotations

00:26:25.040 --> 00:26:28.039
on their predictions poses a huge challenge.

00:26:28.040 --> 00:26:30.719
This is another place where we can leverage

00:26:30.720 --> 00:26:33.119
the data model underlying the Emacs editor,

00:26:33.120 --> 00:26:35.359
along with the structural editing capabilities

00:26:35.360 --> 00:26:38.519
of Org mode to improve current tools.

00:26:38.520 --> 00:26:42.839
Let's take this very simple looking utterance.

00:26:42.840 --> 00:26:48.039
"Um \{lipsmack\} and that's it. (\{laugh\})"

00:26:48.040 --> 00:26:50.319
Looks like the person-- so this is--

00:26:50.320 --> 00:26:54.519
what you are seeing here is a transcript of an audio signal

00:26:54.520 --> 00:27:00.759
that has a lip smack and a laugh as part of it,

00:27:00.760 --> 00:27:04.199
and there is also a "Um" like interjection.

00:27:04.200 --> 00:27:08.199
So this has a few interesting noises

00:27:08.200 --> 00:27:13.999
and specific things that would be illustrative

00:27:14.000 --> 00:27:20.479
of what we are going to, how we are going to represent it.

NOTE Syntactic analysis

00:27:20.480 --> 00:27:25.839
Okay. So let's say you want to have

00:27:25.840 --> 00:27:28.879
a syntactic analysis of this sentence or utterance.

00:27:28.880 --> 00:27:30.959
One common technique people use

00:27:30.960 --> 00:27:32.879
is just to remove the cruft, and, you know,

00:27:32.880 --> 00:27:35.079
write some rules, clean up the utterance,

00:27:35.080 --> 00:27:36.719
make it look like it's proper English,

00:27:36.720 --> 00:27:40.239
and then, you know, tokenize it,

00:27:40.240 --> 00:27:43.079
and basically just use standard tools to process it.

00:27:43.080 --> 00:27:47.279
But in that process, they end up eliminating

00:27:47.280 --> 00:27:51.119
valid pieces of signal that have meaning to others

00:27:51.120 --> 00:27:52.799
studying different phenomena of language.

00:27:52.800 --> 00:27:56.479
Here you have the rich transcript,

00:27:56.480 --> 00:28:00.119
the input to the syntactic parser.

00:28:00.120 --> 00:28:05.919
As you can see, there is a little tokenization happening

00:28:05.920 --> 00:28:07.199
where you'll be inserting space

00:28:07.200 --> 00:28:12.119
between "that" and the contracted is ('s),

00:28:12.120 --> 00:28:15.599
and between the period and the "it,"

00:28:15.600 --> 00:28:18.199
and the output of the syntactic parser is shown below.

00:28:18.200 --> 00:28:21.639
which (surprise) is a S-expression.

00:28:21.640 --> 00:28:24.919
Like I said, the parse trees, when they were created,

00:28:24.920 --> 00:28:29.799
and still largely when they are used, are S-expressions,

00:28:29.800 --> 00:28:32.999
and most of the viewers here

00:28:33.000 --> 00:28:35.119
should not have much problem reading it.

00:28:35.120 --> 00:28:37.279
You can see this tree structure

00:28:37.280 --> 00:28:39.279
of this syntactic parser here.

NOTE Forced alignment

00:28:39.280 --> 00:28:40.919
Now let's say you want to integrate

00:28:40.920 --> 00:28:44.479
phonetic information or phonetic layer

00:28:44.480 --> 00:28:49.119
that's in the audio signal, and do some analysis.

00:28:49.120 --> 00:28:57.519
Now, it would need you to do a few-- take a few steps.

00:28:57.520 --> 00:29:01.679
First, you would need to align the transcript

00:29:01.680 --> 00:29:06.479
with the audio. This process is called forced alignment,

00:29:06.480 --> 00:29:10.399
where you already know what the transcript is,

00:29:10.400 --> 00:29:14.599
and you have the audio, and you can get a good alignment

00:29:14.600 --> 00:29:17.599
using both pieces of information.

00:29:17.600 --> 00:29:20.119
And this is typically a technique that is used to

00:29:20.120 --> 00:29:23.079
create training data for training

00:29:23.080 --> 00:29:25.839
automatic speech recognizers.

00:29:25.840 --> 00:29:29.639
One interesting thing is that in order to do

00:29:29.640 --> 00:29:32.879
this forced alignment, you have to keep

00:29:32.880 --> 00:29:35.799
the non-speech events in transcript,

00:29:35.800 --> 00:29:39.079
because they consume some audio signal,

00:29:39.080 --> 00:29:41.399
and if you don't have that signal,

00:29:41.400 --> 00:29:44.399
the alignment process doesn't know exactly...

00:29:44.400 --> 00:29:45.759
you know, it doesn't do a good job,

00:29:45.760 --> 00:29:50.039
because it needs to align all parts of the signal

00:29:50.040 --> 00:29:54.999
with something, either pause or silence or noise or words.

00:29:55.000 --> 00:29:59.719
Interestingly, punctuations really don't factor in,

00:29:59.720 --> 00:30:01.559
because we don't speak in punctuations.

00:30:01.560 --> 00:30:04.239
So one of the things that you need to do

00:30:04.240 --> 00:30:05.679
is remove most of the punctuations,

00:30:05.680 --> 00:30:08.039
although you'll see there are some punctuations

00:30:08.040 --> 00:30:12.599
that can be kept, or that are to be kept.

NOTE Alignment before tokenization

00:30:12.600 --> 00:30:15.319
And the other thing is that the alignment has to be done

00:30:15.320 --> 00:30:20.159
before tokenization, as it impacts pronunciation.

00:30:20.160 --> 00:30:24.399
To show an example: Here you see "that's".

00:30:24.400 --> 00:30:26.919
When it's one word,

00:30:26.920 --> 00:30:31.959
it has a slightly different pronunciation

00:30:31.960 --> 00:30:35.679
than when it is two words, which is "that is",

00:30:35.680 --> 00:30:38.399
like you can see "is." And so,

00:30:38.400 --> 00:30:44.279
if you split the tokens or split the words

00:30:44.280 --> 00:30:48.119
in order for syntactic parser to process it,

00:30:48.120 --> 00:30:51.599
you would end up getting the wrong phonetic analysis.

00:30:51.600 --> 00:30:54.239
And if you have--if you process it

00:30:54.240 --> 00:30:55.319
through the phonetic analysis,

00:30:55.320 --> 00:30:59.159
and you don't know how to integrate it

00:30:59.160 --> 00:31:02.719
with the tokenized syntax, you can, you know,

00:31:02.720 --> 00:31:07.519
that can be pretty tricky. And a lot of time,

00:31:07.520 --> 00:31:10.759
people write one-off pieces of code that handle these,

00:31:10.760 --> 00:31:14.279
but the idea here is to try to have a general architecture

00:31:14.280 --> 00:31:17.239
that seamlessly integrates all these pieces.

00:31:17.240 --> 00:31:21.319
Then you do the syntactic parsing of the remaining tokens.

00:31:21.320 --> 00:31:24.799
Then you align the data and the two annotations,

00:31:24.800 --> 00:31:27.959
and then integrate the two layers.

00:31:27.960 --> 00:31:31.359
Once that is done, then you can do all kinds of

00:31:31.360 --> 00:31:33.919
interesting analysis, and test various hypotheses

00:31:33.920 --> 00:31:35.279
and generate the statistics,

00:31:35.280 --> 00:31:39.359
but without that you only are dealing

00:31:39.360 --> 00:31:42.879
with one or the other part.

NOTE Layers

00:31:42.880 --> 00:31:48.319
Let's just take a quick look at how each of the layers

00:31:48.320 --> 00:31:51.159
that are involved look like.

00:31:51.160 --> 00:31:56.719
So this is "Um \{lipsmack\}, and that's it. \{laugh\}"

00:31:56.720 --> 00:32:00.159
This is the transcript, and on the right hand side,

00:32:00.160 --> 00:32:04.199
you see the same thing as a transcript

00:32:04.200 --> 00:32:06.239
listed in a vertical in a column.

00:32:06.240 --> 00:32:08.199
You'll see why, in just a second.

00:32:08.200 --> 00:32:09.879
And there are some place--

00:32:09.880 --> 00:32:11.279
there are some rows that are empty,

00:32:11.280 --> 00:32:15.079
some rows that are wider than the others, and we'll see why.

00:32:15.080 --> 00:32:19.319
The next is the tokenized sentence

00:32:19.320 --> 00:32:20.959
where you have space added,

00:32:20.960 --> 00:32:23.599
you know space between these two tokens:

00:32:23.600 --> 00:32:26.599
"that" and the apostrophe "s" ('s),

00:32:26.600 --> 00:32:28.079
and the "it" and the "period".

00:32:28.080 --> 00:32:30.679
And you see on the right hand side

00:32:30.680 --> 00:32:33.559
that the tokens have attributes.

00:32:33.560 --> 00:32:36.439
So there is a token index, and there are 1, 2,

00:32:36.440 --> 00:32:38.839
you know 0, 1, 2, 3, 4, 5 tokens,

00:32:38.840 --> 00:32:41.479
and each token has a start and end character,

00:32:41.480 --> 00:32:45.799
and space (sp) also has a start and end character,

00:32:45.800 --> 00:32:50.399
and space is represented by a "sp".  And there are

00:32:50.400 --> 00:32:54.319
these other things that we removed,

00:32:54.320 --> 00:32:56.239
like the "\{LS\}" which is for "\{lipsmack\}"

00:32:56.240 --> 00:32:59.399
and "\{LG\}" which is "\{laugh\}" are showing grayed out,

00:32:59.400 --> 00:33:02.439
and you'll see why some of these things are grayed out

00:33:02.440 --> 00:33:03.399
in a little bit.

00:33:03.400 --> 00:33:11.919
This is what the forced alignment tool produces.

00:33:11.920 --> 00:33:17.159
Basically, it takes the transcript,

00:33:17.160 --> 00:33:19.159
and this is the transcript

00:33:19.160 --> 00:33:24.119
that has slightly different symbols,

00:33:24.120 --> 00:33:26.239
because different tools use different symbols

00:33:26.240 --> 00:33:28.159
and their various configurational things.

00:33:28.160 --> 00:33:33.679
But this is what is used to get an alignment

00:33:33.680 --> 00:33:36.039
or time alignment with phones.

00:33:36.040 --> 00:33:40.079
So this column shows the phones, and so each word...

00:33:40.080 --> 00:33:43.879
So, for example, "and" has been aligned with these phones,

00:33:43.880 --> 00:33:46.879
and these on the start and end

00:33:46.880 --> 00:33:52.959
are essentially temporal or time stamps that it aligned--

00:33:52.960 --> 00:33:54.279
that has been aligned to it.

00:33:54.280 --> 00:34:00.759
Interestingly, sometimes we don't really have any pause

00:34:00.760 --> 00:34:05.159
or any time duration between some words

00:34:05.160 --> 00:34:08.199
and those are highlighted as gray here.

00:34:08.200 --> 00:34:12.759
See, there's this space... Actually

00:34:12.760 --> 00:34:17.799
it does not have any temporal content,

00:34:17.800 --> 00:34:21.319
whereas this other space has some duration.

00:34:21.320 --> 00:34:24.839
So the ones that have some duration are captured,

00:34:24.840 --> 00:34:29.519
while the others are the ones that in the earlier diagram

00:34:29.520 --> 00:34:31.319
we saw were left out.

NOTE Variations

00:34:31.320 --> 00:34:37.639
And the aligner actually produces multiple files.

00:34:37.640 --> 00:34:44.399
One of the files has a different, slightly different

00:34:44.400 --> 00:34:46.679
variation on the same information,

00:34:46.680 --> 00:34:49.999
and in this case, you can see

00:34:50.000 --> 00:34:52.399
that the punctuation is missing,

00:34:52.400 --> 00:34:57.599
and the punctuation is, you know, deliberately missing,

00:34:57.600 --> 00:35:02.279
because there is no time associated with it,

00:35:02.280 --> 00:35:06.439
and you see that it's not the tokenized sentence--

00:35:06.440 --> 00:35:17.119
a tokenized word. This... Now it gives you a full table,

00:35:17.120 --> 00:35:21.239
and you can't really look into it very carefully.

00:35:21.240 --> 00:35:25.879
But we can focus on the part that seems legible,

00:35:25.880 --> 00:35:28.559
or, you know, properly written sentence,

00:35:28.560 --> 00:35:32.879
process it and reincorporate it back into the whole.

00:35:32.880 --> 00:35:35.879
So if somebody wants to look at, for example,

00:35:35.880 --> 00:35:39.679
how many pauses the person made while they were talking,

00:35:39.680 --> 00:35:42.919
And they can actually measure the pause, the number,

00:35:42.920 --> 00:35:46.279
the duration, and make connections between that

00:35:46.280 --> 00:35:49.639
and the rich syntactic structure that is being produced.

00:35:49.640 --> 00:35:57.279
And in order to do that, you have to get these layers

00:35:57.280 --> 00:35:59.039
to align with each other,

00:35:59.040 --> 00:36:04.359
and this table is just a tabular representation

00:36:04.360 --> 00:36:08.679
of the information that we'll be storing in the YAMR file.

00:36:08.680 --> 00:36:11.719
Congratulations! You have reached

00:36:11.720 --> 00:36:13.479
the end of this demonstration.

00:36:13.480 --> 00:36:17.000
Thank you for your time and attention.