WEBVTT captioned by sameer
NOTE Introduction
00:00:00.000 --> 00:00:05.839
Thank you for joining me today. I'm Sameer Pradhan
00:00:05.840 --> 00:00:07.799
from the Linguistic Data Consortium
00:00:07.800 --> 00:00:10.079
at the University of Pennsylvania
00:00:10.080 --> 00:00:14.519
and founder of cemantix.org .
00:00:14.520 --> 00:00:16.879
Today we'll be addressing research
00:00:16.880 --> 00:00:18.719
in computational linguistics,
00:00:18.720 --> 00:00:22.039
also known as natural language processing
00:00:22.040 --> 00:00:24.719
a sub area of artificial intelligence
00:00:24.720 --> 00:00:27.759
with a focus on modeling and predicting
00:00:27.760 --> 00:00:31.919
complex linguistic structures from various signals.
00:00:31.920 --> 00:00:35.799
The work we present is limited to text and speech signals.
00:00:35.800 --> 00:00:38.639
but it can be extended to other signals.
00:00:38.640 --> 00:00:40.799
We propose an architecture,
00:00:40.800 --> 00:00:42.959
and we call it GRAIL, which allows
00:00:42.960 --> 00:00:44.639
the representation and aggregation
00:00:44.640 --> 00:00:50.199
of such rich structures in a systematic fashion.
00:00:50.200 --> 00:00:52.679
I'll demonstrate a proof of concept
00:00:52.680 --> 00:00:56.559
for representing and manipulating data and annotations
00:00:56.560 --> 00:00:58.519
for the specific purpose of building
00:00:58.520 --> 00:01:02.879
machine learning models that simulate understanding.
00:01:02.880 --> 00:01:05.679
These technologies have the potential for impact
00:01:05.680 --> 00:01:09.119
in almost every conceivable field
00:01:09.120 --> 00:01:13.399
that generates and uses data.
NOTE Processing language
00:01:13.400 --> 00:01:15.039
We process human language
00:01:15.040 --> 00:01:16.719
when our brains receive and assimilate
00:01:16.720 --> 00:01:20.079
various signals which are then manipulated
00:01:20.080 --> 00:01:23.879
and interpreted within a syntactic structure.
00:01:23.880 --> 00:01:27.319
it's a complex process that I have simplified here
00:01:27.320 --> 00:01:30.759
for the purpose of comparison to machine learning.
00:01:30.760 --> 00:01:33.959
Recent machine learning models tend to require
00:01:33.960 --> 00:01:37.039
a large amount of raw, naturally occurring data
00:01:37.040 --> 00:01:40.199
and a varying amount of manually enriched data,
00:01:40.200 --> 00:01:43.199
commonly known as "annotations".
00:01:43.200 --> 00:01:45.959
Owing to the complex and numerous nature
00:01:45.960 --> 00:01:49.959
of linguistic phenomena, we have most often used
00:01:49.960 --> 00:01:52.999
a divide and conquer approach.
00:01:53.000 --> 00:01:55.399
The strength of this approach is that it allows us
00:01:55.400 --> 00:01:58.159
to focus on a single, or perhaps a few related
00:01:58.160 --> 00:02:00.439
linguistic phenomena.
00:02:00.440 --> 00:02:03.879
The weaknesses are the universe of these phenomena
00:02:03.880 --> 00:02:07.239
keep expanding, as language itself
00:02:07.240 --> 00:02:09.359
evolves and changes over time,
00:02:09.360 --> 00:02:13.119
and second, this approach requires an additional task
00:02:13.120 --> 00:02:14.839
of aggregating the interpretations,
00:02:14.840 --> 00:02:18.359
creating more opportunities for computer error.
00:02:18.360 --> 00:02:21.519
Our challenge, then, is to find the sweet spot
00:02:21.520 --> 00:02:25.239
that allows us to encode complex information
00:02:25.240 --> 00:02:27.719
without the use of manual annotation,
00:02:27.720 --> 00:02:34.559
or without the additional task of aggregation by computers.
NOTE Annotation
00:02:34.560 --> 00:02:37.119
So what do I mean by "annotation"?
00:02:37.120 --> 00:02:39.759
In this talk the word annotation refers to
00:02:39.760 --> 00:02:43.519
the manual assignment of certain attributes
00:02:43.520 --> 00:02:48.639
to portions of a signal which is necessary
00:02:48.640 --> 00:02:51.639
to perform the end task.
00:02:51.640 --> 00:02:54.439
For example, in order for the algorithm
00:02:54.440 --> 00:02:57.439
to accurately interpret a pronoun,
00:02:57.440 --> 00:03:00.279
it needs to know that pronoun,
00:03:00.280 --> 00:03:03.799
what that pronoun refers back to.
00:03:03.800 --> 00:03:06.719
We may find this task trivial, however,
00:03:06.720 --> 00:03:10.599
current algorithms repeatedly fail in this task.
00:03:10.600 --> 00:03:13.319
So the complexities of understanding
00:03:13.320 --> 00:03:16.639
in computational linguistics require annotation.
00:03:16.640 --> 00:03:20.799
The world annotation itself is a useful example,
00:03:20.800 --> 00:03:22.679
because it also reminds us
00:03:22.680 --> 00:03:25.119
that words have multiple meetings
00:03:25.120 --> 00:03:27.519
as annotation itself does—
00:03:27.520 --> 00:03:30.559
just as I needed to define it in this context,
00:03:30.560 --> 00:03:33.799
so that my message won't be misinterpreted.
00:03:33.800 --> 00:03:39.039
So, too, must annotators do this for algorithms
00:03:39.040 --> 00:03:43.239
through the manual intervention.
NOTE Learning from data
00:03:43.240 --> 00:03:44.759
Learning from raw data
00:03:44.760 --> 00:03:47.039
(commonly known as unsupervised learning)
00:03:47.040 --> 00:03:50.079
poses limitations for machine learning.
00:03:50.080 --> 00:03:53.039
As I described, modeling complex phenomena
00:03:53.040 --> 00:03:55.559
need manual annotations.
00:03:55.560 --> 00:03:58.559
The learning algorithm uses these annotations
00:03:58.560 --> 00:04:01.319
as examples to build statistical models.
00:04:01.320 --> 00:04:04.879
This is called supervised learning.
00:04:04.880 --> 00:04:06.319
Without going into too much detail,
00:04:06.320 --> 00:04:10.039
I'll simply note that the recent popularity
00:04:10.040 --> 00:04:12.519
of the concept of deep learning
00:04:12.520 --> 00:04:14.679
is that evolutionary step
00:04:14.680 --> 00:04:17.319
where we have learned to train models
00:04:17.320 --> 00:04:20.799
using trillions of parameters in ways that they can
00:04:20.800 --> 00:04:25.079
learn richer hierarchical structures
00:04:25.080 --> 00:04:29.399
from very large amounts of annotate, unannotated data.
00:04:29.400 --> 00:04:32.319
These models can then be fine-tuned,
00:04:32.320 --> 00:04:35.599
using varying amounts of annotated examples
00:04:35.600 --> 00:04:37.639
depending on the complexity of the task
00:04:37.640 --> 00:04:39.679
to generate better predictions.
NOTE Manual annotation
00:04:39.680 --> 00:04:44.919
As you might imagine, manually annotating
00:04:44.920 --> 00:04:47.359
complex, linguistic phenomena
00:04:47.360 --> 00:04:51.719
can be very specific, labor-intensive task.
00:04:51.720 --> 00:04:54.279
For example, imagine if we were
00:04:54.280 --> 00:04:56.399
to go back through this presentation
00:04:56.400 --> 00:04:58.399
and connect all the pronouns
00:04:58.400 --> 00:04:59.919
with the nouns to which they refer.
00:04:59.920 --> 00:05:03.239
Even for a short 18 min presentation,
00:05:03.240 --> 00:05:05.239
this would require hundreds of annotations.
00:05:05.240 --> 00:05:08.519
The models we build are only as good
00:05:08.520 --> 00:05:11.119
as the quality of the annotations we make.
00:05:11.120 --> 00:05:12.679
We need guidelines
00:05:12.680 --> 00:05:15.759
that ensure that the annotations are done
00:05:15.760 --> 00:05:19.719
by at least two humans who have substantial agreement
00:05:19.720 --> 00:05:22.119
with each other in their interpretations.
00:05:22.120 --> 00:05:25.599
We know that if we try to trade a model using annotations
00:05:25.600 --> 00:05:28.519
that are very subjective, or have more noise,
00:05:28.520 --> 00:05:30.919
we will receive poor predictions.
00:05:30.920 --> 00:05:33.679
Additionally, there is the concern of introducing
00:05:33.680 --> 00:05:37.079
various unexpected biases into one's models.
00:05:37.080 --> 00:05:44.399
So annotation is really both an art and a science.
NOTE How can we develop a unified representation?
00:05:44.400 --> 00:05:47.439
In the remaining time,
00:05:47.440 --> 00:05:49.999
we will turn to two fundamental questions.
00:05:50.000 --> 00:05:54.239
First, how can we develop a unified representation
00:05:54.240 --> 00:05:55.599
of data and annotations
00:05:55.600 --> 00:05:59.759
that encompasses arbitrary levels of linguistic information?
00:05:59.760 --> 00:06:03.839
There is a long history of attempting to answer
00:06:03.840 --> 00:06:04.839
this first question.
00:06:04.840 --> 00:06:08.839
This history is documented in our recent article,
00:06:08.840 --> 00:06:11.519
and you can refer to that article.
00:06:11.520 --> 00:06:16.719
It will be on the website.
00:06:16.720 --> 00:06:18.999
It is as if we, as a community,
00:06:19.000 --> 00:06:22.519
have been searching for our own Holy Grail.
NOTE What role might Emacs and Org mode play?
00:06:22.520 --> 00:06:26.519
The second question we will pose is
00:06:26.520 --> 00:06:30.159
what role might Emacs, along with Org mode,
00:06:30.160 --> 00:06:31.919
play in this process?
00:06:31.920 --> 00:06:35.359
Well, the solution itself may not be tied to Emacs.
00:06:35.360 --> 00:06:38.359
Emacs has built in capabilities
00:06:38.360 --> 00:06:42.599
that could be useful for evaluating potential solutions.
00:06:42.600 --> 00:06:45.759
It's also one of the most extensively documented
00:06:45.760 --> 00:06:48.519
pieces of software and the most customizable
00:06:48.520 --> 00:06:51.599
piece of software that I have ever come across,
00:06:51.600 --> 00:06:55.279
and many would agree with that.
NOTE The complex structure of language
00:06:55.280 --> 00:07:00.639
In order to approach this second question,
00:07:00.640 --> 00:07:03.919
we turn to the complex structure of language itself.
00:07:03.920 --> 00:07:07.679
At first glance, language appears to us
00:07:07.680 --> 00:07:09.879
as a series of words.
00:07:09.880 --> 00:07:13.439
Words form sentences, sentences form paragraphs,
00:07:13.440 --> 00:07:16.239
and paragraphs form completed text.
00:07:16.240 --> 00:07:19.039
If this was a sufficient description
00:07:19.040 --> 00:07:21.159
of the complexity of language,
00:07:21.160 --> 00:07:24.199
all of us would be able to speak and read
00:07:24.200 --> 00:07:26.559
at least ten different languages.
00:07:26.560 --> 00:07:29.279
We know it is much more complex than this.
00:07:29.280 --> 00:07:33.199
There is a rich, underlying recursive tree structure--
00:07:33.200 --> 00:07:36.439
in fact, many possible tree structures
00:07:36.440 --> 00:07:39.439
which makes a particular sequence meaningful
00:07:39.440 --> 00:07:42.079
and many others meaningless.
00:07:42.080 --> 00:07:45.239
One of the better understood tree structures
00:07:45.240 --> 00:07:47.119
is the syntactic structure.
00:07:47.120 --> 00:07:49.439
While natural language
00:07:49.440 --> 00:07:51.679
has rich ambiguities and complexities,
00:07:51.680 --> 00:07:55.119
programming languages are designed to be parsed
00:07:55.120 --> 00:07:56.999
and interpreted deterministically.
00:07:57.000 --> 00:08:02.159
Emacs has been used for programming very effectively.
00:08:02.160 --> 00:08:05.359
So there is a potential for using Emacs
00:08:05.360 --> 00:08:06.559
as a tool for annotation.
00:08:06.560 --> 00:08:10.799
This would significantly improve our current set of tools.
NOTE Annotation tools
00:08:10.800 --> 00:08:16.559
It is important to note that most of the annotation tools
00:08:16.560 --> 00:08:19.639
that have been developed over the past few decades
00:08:19.640 --> 00:08:22.879
have relied on graphical interfaces,
00:08:22.880 --> 00:08:26.919
even those used for enriching textual information.
00:08:26.920 --> 00:08:30.399
Most of the tools in current use
00:08:30.400 --> 00:08:36.159
are designed for a end user to add very specific,
00:08:36.160 --> 00:08:38.639
very restricted information.
00:08:38.640 --> 00:08:42.799
We have not really made use of the potential
00:08:42.800 --> 00:08:45.639
that an editor or a rich editing environment like Emacs
00:08:45.640 --> 00:08:47.239
can add to the mix.
00:08:47.240 --> 00:08:52.479
Emacs has long enabled the editing of, the manipulation of
00:08:52.480 --> 00:08:56.359
complex embedded tree structures abundant in source code.
00:08:56.360 --> 00:08:58.599
So it's not difficult to imagine that it would have
00:08:58.600 --> 00:09:00.359
many capabilities that we we need
00:09:00.360 --> 00:09:02.599
to represent actual language.
00:09:02.600 --> 00:09:04.759
In fact, it already does that with features
00:09:04.760 --> 00:09:06.399
that allow us to quickly navigate
00:09:06.400 --> 00:09:07.919
through sentences and paragraphs,
00:09:07.920 --> 00:09:09.799
and we don't need a few key strokes.
00:09:09.800 --> 00:09:13.599
Or to add various text properties to text spans
00:09:13.600 --> 00:09:17.039
to create overlays, to name but a few.
00:09:17.040 --> 00:09:22.719
Emacs figured out this way to handle Unicode,
00:09:22.720 --> 00:09:26.799
so you don't even have to worry about the complexity
00:09:26.800 --> 00:09:29.439
of managing multiple languages.
00:09:29.440 --> 00:09:34.039
It's built into Emacs. In fact, this is not the first time
00:09:34.040 --> 00:09:37.399
Emacs has been used for linguistic analysis.
00:09:37.400 --> 00:09:41.159
One of the breakthrough moments in language,
00:09:41.160 --> 00:09:44.439
natural language processing was the creation
00:09:44.440 --> 00:09:48.639
of manually created syntactic trees
00:09:48.640 --> 00:09:50.439
for a 1 million word collection
00:09:50.440 --> 00:09:52.399
of Wall Street Journal articles.
00:09:52.400 --> 00:09:54.879
This was else around 1992
00:09:54.880 --> 00:09:59.279
before Java or graphical interfaces were common.
00:09:59.280 --> 00:10:03.279
The tool that was used to create that corpus was Emacs.
00:10:03.280 --> 00:10:08.959
It was created at UPenn, and is famously known as
00:10:08.960 --> 00:10:12.719
the Penn Treebank. '92 was about when
00:10:12.720 --> 00:10:16.439
the Linguistic Data Consortium was also established,
00:10:16.440 --> 00:10:18.039
and it's been about 30 years
00:10:18.040 --> 00:10:20.719
that it has been creating various
00:10:20.720 --> 00:10:22.359
language-related resources.
NOTE Org mode
00:10:22.360 --> 00:10:28.519
Org mode--in particular, the outlining mode,
00:10:28.520 --> 00:10:32.399
or rather the enhanced form of outlining mode--
00:10:32.400 --> 00:10:35.599
allows us to create rich outlines,
00:10:35.600 --> 00:10:37.799
attaching properties to nodes,
00:10:37.800 --> 00:10:41.119
and provides commands for easily customizing
00:10:41.120 --> 00:10:43.879
sorting of various pieces of information
00:10:43.880 --> 00:10:45.639
as per one's requirement.
00:10:45.640 --> 00:10:50.239
This can also be a very useful tool.
00:10:50.240 --> 00:10:59.159
This enhanced form of outline-mode adds more power to Emacs.
00:10:59.160 --> 00:11:03.359
It provides commands for easily customizing
00:11:03.360 --> 00:11:05.159
and filtering information,
00:11:05.160 --> 00:11:08.999
while at the same time hiding unnecessary context.
00:11:09.000 --> 00:11:11.919
It also allows structural editing.
00:11:11.920 --> 00:11:16.039
This can be a very useful tool to enrich corpora
00:11:16.040 --> 00:11:20.919
where we are focusing on limited amount of phenomena.
00:11:20.920 --> 00:11:24.519
The two together allow us to create
00:11:24.520 --> 00:11:27.199
a rich representation
00:11:27.200 --> 00:11:32.999
that can simultaneously capture multiple possible sequences,
00:11:33.000 --> 00:11:38.759
capture details necessary to recreate the original source,
00:11:38.760 --> 00:11:42.079
allow the creation of hierarchical representation,
00:11:42.080 --> 00:11:44.679
provide structural editing capabilities
00:11:44.680 --> 00:11:47.439
that can take advantage of the concept of inheritance
00:11:47.440 --> 00:11:48.999
within the tree structure.
00:11:49.000 --> 00:11:54.279
Together they allow local manipulations of structures,
00:11:54.280 --> 00:11:56.199
thereby minimizing data coupling.
00:11:56.200 --> 00:11:59.119
The concept of tags in Org mode
00:11:59.120 --> 00:12:01.599
complement the hierarchy part.
00:12:01.600 --> 00:12:03.839
Hierarchies can be very rigid,
00:12:03.840 --> 00:12:06.039
but to tags on hierarchies,
00:12:06.040 --> 00:12:08.839
we can have a multifaceted representations.
00:12:08.840 --> 00:12:12.759
As a matter of fact, Org mode has the ability for the tags
00:12:12.760 --> 00:12:15.039
to have their own hierarchical structure
00:12:15.040 --> 00:12:18.639
which further enhances the representational power.
00:12:18.640 --> 00:12:22.639
All of this can be done as a sequence
00:12:22.640 --> 00:12:25.679
of mostly functional data transformations,
00:12:25.680 --> 00:12:27.439
because most of the capabilities
00:12:27.440 --> 00:12:29.759
can be configured and customized.
00:12:29.760 --> 00:12:32.799
It is not necessary to do everything at once.
00:12:32.800 --> 00:12:36.199
Instead, it allows us to incrementally increase
00:12:36.200 --> 00:12:37.919
the complexity of the representation.
00:12:37.920 --> 00:12:39.799
Finally, all of this can be done
00:12:39.800 --> 00:12:42.359
in plain-text representation
00:12:42.360 --> 00:12:45.479
which comes with its own advantages.
NOTE Example
00:12:45.480 --> 00:12:50.679
Now let's take a simple example.
00:12:50.680 --> 00:12:55.999
This is a a short video that I'll play.
00:12:56.000 --> 00:12:59.679
The sentence is "I saw the moon with a telescope,"
00:12:59.680 --> 00:13:03.999
and let's just make a copy of the sentence.
00:13:04.000 --> 00:13:09.199
What we can do now is to see:
00:13:09.200 --> 00:13:11.879
what does this sentence comprise?
00:13:11.880 --> 00:13:13.679
It has a noun phrase "I,"
00:13:13.680 --> 00:13:17.479
followed by a word "saw."
00:13:17.480 --> 00:13:21.359
Then "the moon" is another noun phrase,
00:13:21.360 --> 00:13:24.839
and "with the telescope" is a prepositional phrase.
00:13:24.840 --> 00:13:30.759
Now one thing that you might remember,
00:13:30.760 --> 00:13:36.119
from grammar school or syntax is that
00:13:36.120 --> 00:13:41.279
there is a syntactic structure.
00:13:41.280 --> 00:13:44.359
And if you in this particular case--
00:13:44.360 --> 00:13:47.919
because we know that the moon is not typically
00:13:47.920 --> 00:13:51.679
something that can hold the telescope,
00:13:51.680 --> 00:13:56.239
that the seeing must be done by me or "I,"
00:13:56.240 --> 00:14:01.039
and the telescope must be in my hand,
00:14:01.040 --> 00:14:04.479
or "I" am viewing the moon with a telescope.
00:14:04.480 --> 00:14:13.519
However, it is possible that in a different context
00:14:13.520 --> 00:14:17.159
the moon could be referring to an animated character
00:14:17.160 --> 00:14:22.319
in a animated series, and could actually hold the telescope.
00:14:22.320 --> 00:14:23.479
And this is one of the most--
00:14:23.480 --> 00:14:24.839
the oldest and one of the most--
00:14:24.840 --> 00:14:26.319
and in that case the situation might be
00:14:26.320 --> 00:14:30.959
that I'm actually seeing the moon holding a telescope...
00:14:30.960 --> 00:14:36.079
I mean. The moon is holding the telescope,
00:14:36.080 --> 00:14:40.959
and I'm just seeing the moon holding the telescope.
00:14:40.960 --> 00:14:47.999
Complex linguistic ambiguity or linguistic
00:14:48.000 --> 00:14:53.599
phenomena that requires world knowledge,
00:14:53.600 --> 00:14:55.719
and it's called the PP attachment problem
00:14:55.720 --> 00:14:59.239
where the propositional phrase attachment
00:14:59.240 --> 00:15:04.599
can be ambiguous, and various different contextual cues
00:15:04.600 --> 00:15:06.879
have to be used to resolve the ambiguity.
00:15:06.880 --> 00:15:09.079
So in this case, as you saw,
00:15:09.080 --> 00:15:11.199
both the readings are technically true,
00:15:11.200 --> 00:15:13.959
depending on different contexts.
00:15:13.960 --> 00:15:16.599
So one thing we could do is just
00:15:16.600 --> 00:15:19.919
to cut the tree and duplicate it,
00:15:19.920 --> 00:15:21.599
and then let's create another node
00:15:21.600 --> 00:15:24.479
and call it an "OR" node.
00:15:24.480 --> 00:15:26.119
And because we are saying,
00:15:26.120 --> 00:15:28.359
this is one of the two interpretations.
00:15:28.360 --> 00:15:32.159
Now let's call one interpretation "a",
00:15:32.160 --> 00:15:36.159
and that interpretation essentially
00:15:36.160 --> 00:15:39.319
is this child of that node "a"
00:15:39.320 --> 00:15:41.799
and that says that the moon
00:15:41.800 --> 00:15:43.999
is holding the telescope.
00:15:44.000 --> 00:15:46.359
Now we can create another representation "b"
00:15:46.360 --> 00:15:53.919
where we capture the other interpretation,
00:15:53.920 --> 00:15:59.959
where this, the act, the moon or--I am actually
00:15:59.960 --> 00:16:00.519
holding the telescope,
00:16:00.520 --> 00:16:06.799
and watching the moon using it.
00:16:06.800 --> 00:16:09.199
So now we have two separate interpretations
00:16:09.200 --> 00:16:11.679
in the same structure,
00:16:11.680 --> 00:16:15.519
and all we do--we're able to do is with this,
00:16:15.520 --> 00:16:18.159
with very quick key strokes now...
00:16:18.160 --> 00:16:22.439
While we are at it, let's add another interesting thing,
00:16:22.440 --> 00:16:25.159
this node that represents "I":
00:16:25.160 --> 00:16:28.919
"He." It can be "She".
00:16:28.920 --> 00:16:35.759
It can be "the children," or it can be "The people".
00:16:35.760 --> 00:16:45.039
Basically, any entity that has the capability to "see"
00:16:45.040 --> 00:16:53.359
can be substituted in this particular node.
00:16:53.360 --> 00:16:57.399
Let's see what we have here now.
00:16:57.400 --> 00:17:01.239
We just are getting sort of a zoom view
00:17:01.240 --> 00:17:04.599
of the entire structure, what we created,
00:17:04.600 --> 00:17:08.039
and essentially you can see that
00:17:08.040 --> 00:17:11.879
by just, you know, using a few keystrokes,
00:17:11.880 --> 00:17:17.839
we were able to capture two different interpretations
00:17:17.840 --> 00:17:20.879
of a a simple sentence,
00:17:20.880 --> 00:17:23.759
and they are also able to add
00:17:23.760 --> 00:17:27.799
these alternate pieces of information
00:17:27.800 --> 00:17:30.559
that could help machine learning algorithms
00:17:30.560 --> 00:17:32.439
generalize better.
00:17:32.440 --> 00:17:36.239
All right.
NOTE Different readings
00:17:36.240 --> 00:17:40.359
Now, let's look at the next thing. So in a sense,
00:17:40.360 --> 00:17:46.679
we can use this power of functional data structures
00:17:46.680 --> 00:17:50.239
to represent various potentially conflicting
00:17:50.240 --> 00:17:55.559
and structural readings of that piece of text.
00:17:55.560 --> 00:17:58.079
In addition to that, we can also create more texts,
00:17:58.080 --> 00:17:59.799
each with different structure,
00:17:59.800 --> 00:18:01.559
and have them all in the same place.
00:18:01.560 --> 00:18:04.239
This allows us to address the interpretation
00:18:04.240 --> 00:18:06.879
of a static sentence that might be occurring in the world,
00:18:06.880 --> 00:18:09.639
while simultaneously inserting information
00:18:09.640 --> 00:18:11.519
that would add more value to it.
00:18:11.520 --> 00:18:14.999
This makes the enrichment process also very efficient.
00:18:15.000 --> 00:18:19.519
Additionally, we can envision
00:18:19.520 --> 00:18:23.999
a power user of the future, or present,
00:18:24.000 --> 00:18:27.479
who can not only annotate a span,
00:18:27.480 --> 00:18:31.279
but also edit the information in situ
00:18:31.280 --> 00:18:34.639
in a way that would help machine algorithms
00:18:34.640 --> 00:18:36.879
generalize better by making more efficient use
00:18:36.880 --> 00:18:37.719
of the annotations.
00:18:37.720 --> 00:18:41.519
So together, Emacs and Org mode can speed up
00:18:41.520 --> 00:18:42.959
the enrichment of the signals
00:18:42.960 --> 00:18:44.519
in a way that allows us
00:18:44.520 --> 00:18:47.719
to focus on certain aspects and ignore others.
00:18:47.720 --> 00:18:50.839
Extremely complex landscape of rich structures
00:18:50.840 --> 00:18:53.039
can be captured consistently,
00:18:53.040 --> 00:18:55.639
in a fashion that allows computers
00:18:55.640 --> 00:18:56.759
to understand language.
00:18:56.760 --> 00:19:00.879
We can then build tools to enhance the tasks
00:19:00.880 --> 00:19:03.319
that we do in our everyday life.
00:19:03.320 --> 00:19:10.759
YAMR is acronym, or the file's type or specification
00:19:10.760 --> 00:19:15.239
that we are creating to capture this new
00:19:15.240 --> 00:19:17.679
rich representation.
NOTE Spontaneous speech
00:19:17.680 --> 00:19:21.959
We'll now look at an example of spontaneous speech
00:19:21.960 --> 00:19:24.799
that occurs in spoken conversations.
00:19:24.800 --> 00:19:28.599
Conversations frequently contain errors in speech:
00:19:28.600 --> 00:19:30.799
interruptions, disfluencies,
00:19:30.800 --> 00:19:33.959
verbal sounds such as cough or laugh,
00:19:33.960 --> 00:19:35.039
and other noises.
00:19:35.040 --> 00:19:38.199
In this sense, spontaneous speech is similar
00:19:38.200 --> 00:19:39.799
to a functional data stream.
00:19:39.800 --> 00:19:42.759
We cannot take back words that come out of our mouth,
00:19:42.760 --> 00:19:47.239
but we tend to make mistakes, and we correct ourselves
00:19:47.240 --> 00:19:49.039
as soon as we realize that we have made--
00:19:49.040 --> 00:19:50.679
we have misspoken.
00:19:50.680 --> 00:19:53.159
This process manifests through a combination
00:19:53.160 --> 00:19:56.279
of a handful of mechanisms, including immediate correction
00:19:56.280 --> 00:20:00.959
after an error, and we do this unconsciously.
00:20:00.960 --> 00:20:02.719
Computers, on the other hand,
00:20:02.720 --> 00:20:06.639
must be taught to understand these cases.
00:20:06.640 --> 00:20:12.799
What we see here is a example document or outline,
00:20:12.800 --> 00:20:18.119
or part of a document that illustrates
00:20:18.120 --> 00:20:22.919
various different aspects of the representation.
00:20:22.920 --> 00:20:25.919
We don't have a lot of time to go through
00:20:25.920 --> 00:20:28.239
many of the details.
00:20:28.240 --> 00:20:31.759
I would highly encourage you to play a...
00:20:31.760 --> 00:20:39.159
I'm planning on making some videos, or ascii cinemas,
00:20:39.160 --> 00:20:42.559
that I'll be posting, and you can,
00:20:42.560 --> 00:20:46.759
if you're interested, you can go through those.
00:20:46.760 --> 00:20:50.359
The idea here is to try to do
00:20:50.360 --> 00:20:54.599
a slightly more complex use case.
00:20:54.600 --> 00:20:57.639
But again, given the time constraint
00:20:57.640 --> 00:21:00.279
and the amount of information
00:21:00.280 --> 00:21:01.519
that needs to fit in the screen,
00:21:01.520 --> 00:21:05.559
this may not be very informative,
00:21:05.560 --> 00:21:08.399
but at least it will give you some idea
00:21:08.400 --> 00:21:10.439
of what can be possible.
00:21:10.440 --> 00:21:13.279
And in this particular case, what you're seeing is that
00:21:13.280 --> 00:21:18.319
there is a sentence which is "What I'm I'm tr- telling now."
00:21:18.320 --> 00:21:21.159
Essentially, there is a repetition of the word "I'm",
00:21:21.160 --> 00:21:23.279
and then there is a partial word
00:21:23.280 --> 00:21:25.159
that somebody tried to say "telling",
00:21:25.160 --> 00:21:29.599
but started saying "tr-", and then corrected themselves
00:21:29.600 --> 00:21:30.959
and said, "telling now."
00:21:30.960 --> 00:21:39.239
So in this case, you see, we can capture words
00:21:39.240 --> 00:21:44.919
or a sequence of words, or a sequence of tokens.
00:21:44.920 --> 00:21:52.279
One thing to... An interesting thing to note is that in NLP,
00:21:52.280 --> 00:21:55.319
sometimes we have to break typically
00:21:55.320 --> 00:22:01.199
words that don't have spaces into two separate words,
00:22:01.200 --> 00:22:04.119
especially contractions like "I'm",
00:22:04.120 --> 00:22:08.199
so the syntactic parser needs needs two separate nodes.
00:22:08.200 --> 00:22:11.199
But anyway, so I'll... You can see that here.
00:22:11.200 --> 00:22:15.759
The other... This view. What this view shows is that
00:22:15.760 --> 00:22:19.759
with each of the nodes in the sentence
00:22:19.760 --> 00:22:23.079
or in the representation,
00:22:23.080 --> 00:22:26.079
you can have a lot of different properties
00:22:26.080 --> 00:22:27.559
that you can attach to them,
00:22:27.560 --> 00:22:30.119
and these properties are typically hidden,
00:22:30.120 --> 00:22:32.719
like you saw in the earlier slide.
00:22:32.720 --> 00:22:35.599
But you can make use of all these properties
00:22:35.600 --> 00:22:39.439
to do various kind of searches and filtering.
00:22:39.440 --> 00:22:43.519
And on the right hand side here--
00:22:43.520 --> 00:22:48.799
this is actually not a legitimate syntax--
00:22:48.800 --> 00:22:51.279
but on the right are descriptions
00:22:51.280 --> 00:22:53.479
of what each of these represent.
00:22:53.480 --> 00:22:57.319
All the information is also available in the article.
00:22:57.320 --> 00:23:04.279
You can see there... It shows how much rich context
00:23:04.280 --> 00:23:05.879
you can capture.
00:23:05.880 --> 00:23:08.799
This is just a closer snapshot
00:23:08.800 --> 00:23:10.159
of the properties on the node,
00:23:10.160 --> 00:23:13.119
and you can see we can have things like,
00:23:13.120 --> 00:23:14.799
whether the word is a token or not,
00:23:14.800 --> 00:23:17.359
or that it's incomplete, whether some words
00:23:17.360 --> 00:23:19.959
might want to be filtered out for parsing,
00:23:19.960 --> 00:23:23.039
and we can say this: PARSE_IGNORE,
00:23:23.040 --> 00:23:25.519
or some words or restart markers...
00:23:25.520 --> 00:23:29.239
We can mark, add a RESTART_MARKER, or sometimes,
00:23:29.240 --> 00:23:31.999
some of these might have durations. Things like that.
NOTE Editing properties in column view
00:23:32.000 --> 00:23:38.799
The other fascinating thing of this representation
00:23:38.800 --> 00:23:42.599
is that you can edit properties in the column view.
00:23:42.600 --> 00:23:45.399
And suddenly, you have this tabular data structure
00:23:45.400 --> 00:23:48.879
combined with the hierarchical data structure.
00:23:48.880 --> 00:23:53.119
And as you can--you may not be able to see it here,
00:23:53.120 --> 00:23:56.879
but what has also happened here is that
00:23:56.880 --> 00:24:01.159
some of the tags have been inherited
00:24:01.160 --> 00:24:02.479
from the earlier nodes.
00:24:02.480 --> 00:24:07.919
And so you get a much fuller picture of things.
00:24:07.920 --> 00:24:13.919
Essentially you, can filter out things
00:24:13.920 --> 00:24:15.319
that you want to process,
00:24:15.320 --> 00:24:20.279
process them, and then reintegrate it into the whole.
NOTE Conclusion
00:24:20.280 --> 00:24:25.479
So, in conclusion, today we have proposed and demonstrated
00:24:25.480 --> 00:24:27.559
the use of an architecture (GRAIL),
00:24:27.560 --> 00:24:31.319
which allows the representation, manipulation,
00:24:31.320 --> 00:24:34.759
and aggregation of rich linguistic structures
00:24:34.760 --> 00:24:36.519
in a systematic fashion.
00:24:36.520 --> 00:24:41.359
We have shown how GRAIL advances the tools
00:24:41.360 --> 00:24:44.599
available for building machine learning models
00:24:44.600 --> 00:24:46.879
that simulate understanding.
00:24:46.880 --> 00:24:51.679
Thank you very much for your time and attention today.
00:24:51.680 --> 00:24:54.639
My contact information is on this slide.
00:24:54.640 --> 00:25:02.599
If you are interested in an additional example
00:25:02.600 --> 00:25:05.439
that demonstrates the representation
00:25:05.440 --> 00:25:08.039
of speech and written text together,
00:25:08.040 --> 00:25:10.719
please continue watching.
00:25:10.720 --> 00:25:12.199
Otherwise, you can stop here
00:25:12.200 --> 00:25:15.279
and enjoy the rest of the conference.
NOTE Bonus material
00:25:15.280 --> 00:25:39.079
Welcome to the bonus material.
00:25:39.080 --> 00:25:43.959
I'm glad for those of you who are stuck around.
00:25:43.960 --> 00:25:46.559
We are now going to examine an instance
00:25:46.560 --> 00:25:49.159
of speech and text signals together
00:25:49.160 --> 00:25:51.479
that produce multiple layers.
00:25:51.480 --> 00:25:54.839
When we have--when we take a spoken conversation
00:25:54.840 --> 00:25:58.719
and use the best language processing models available,
00:25:58.720 --> 00:26:00.679
we suddenly hit a hard spot
00:26:00.680 --> 00:26:03.239
because the tools are typically not trained
00:26:03.240 --> 00:26:05.359
to filter out the unnecessary cruft
00:26:05.360 --> 00:26:07.559
in order to automatically interpret
00:26:07.560 --> 00:26:09.559
the part of what is being said
00:26:09.560 --> 00:26:11.799
that is actually relevant.
00:26:11.800 --> 00:26:14.639
Over time, language researchers
00:26:14.640 --> 00:26:17.719
have created many interdependent layers of annotations,
00:26:17.720 --> 00:26:21.039
yet the assumptions underlying them are seldom the same.
00:26:21.040 --> 00:26:25.039
Piecing together such related but disjointed annotations
00:26:25.040 --> 00:26:28.039
on their predictions poses a huge challenge.
00:26:28.040 --> 00:26:30.719
This is another place where we can leverage
00:26:30.720 --> 00:26:33.119
the data model underlying the Emacs editor,
00:26:33.120 --> 00:26:35.359
along with the structural editing capabilities
00:26:35.360 --> 00:26:38.519
of Org mode to improve current tools.
00:26:38.520 --> 00:26:42.839
Let's take this very simple looking utterance.
00:26:42.840 --> 00:26:48.039
"Um \{lipsmack\} and that's it. (\{laugh\})"
00:26:48.040 --> 00:26:50.319
Looks like the person-- so this is--
00:26:50.320 --> 00:26:54.519
what you are seeing here is a transcript of an audio signal
00:26:54.520 --> 00:27:00.759
that has a lip smack and a laugh as part of it,
00:27:00.760 --> 00:27:04.199
and there is also a "Um" like interjection.
00:27:04.200 --> 00:27:08.199
So this has a few interesting noises
00:27:08.200 --> 00:27:13.999
and specific things that would be illustrative
00:27:14.000 --> 00:27:20.479
of what we are going to, how we are going to represent it.
NOTE Syntactic analysis
00:27:20.480 --> 00:27:25.839
Okay. So let's say you want to have
00:27:25.840 --> 00:27:28.879
a syntactic analysis of this sentence or utterance.
00:27:28.880 --> 00:27:30.959
One common technique people use
00:27:30.960 --> 00:27:32.879
is just to remove the cruft, and, you know,
00:27:32.880 --> 00:27:35.079
write some rules, clean up the utterance,
00:27:35.080 --> 00:27:36.719
make it look like it's proper English,
00:27:36.720 --> 00:27:40.239
and then, you know, tokenize it,
00:27:40.240 --> 00:27:43.079
and basically just use standard tools to process it.
00:27:43.080 --> 00:27:47.279
But in that process, they end up eliminating
00:27:47.280 --> 00:27:51.119
valid pieces of signal that have meaning to others
00:27:51.120 --> 00:27:52.799
studying different phenomena of language.
00:27:52.800 --> 00:27:56.479
Here you have the rich transcript,
00:27:56.480 --> 00:28:00.119
the input to the syntactic parser.
00:28:00.120 --> 00:28:05.919
As you can see, there is a little tokenization happening
00:28:05.920 --> 00:28:07.199
where you'll be inserting space
00:28:07.200 --> 00:28:12.119
between "that" and the contracted is ('s),
00:28:12.120 --> 00:28:15.599
and between the period and the "it,"
00:28:15.600 --> 00:28:18.199
and the output of the syntactic parser is shown below.
00:28:18.200 --> 00:28:21.639
which (surprise) is a S-expression.
00:28:21.640 --> 00:28:24.919
Like I said, the parse trees, when they were created,
00:28:24.920 --> 00:28:29.799
and still largely when they are used, are S-expressions,
00:28:29.800 --> 00:28:32.999
and most of the viewers here
00:28:33.000 --> 00:28:35.119
should not have much problem reading it.
00:28:35.120 --> 00:28:37.279
You can see this tree structure
00:28:37.280 --> 00:28:39.279
of this syntactic parser here.
NOTE Forced alignment
00:28:39.280 --> 00:28:40.919
Now let's say you want to integrate
00:28:40.920 --> 00:28:44.479
phonetic information or phonetic layer
00:28:44.480 --> 00:28:49.119
that's in the audio signal, and do some analysis.
00:28:49.120 --> 00:28:57.519
Now, it would need you to do a few-- take a few steps.
00:28:57.520 --> 00:29:01.679
First, you would need to align the transcript
00:29:01.680 --> 00:29:06.479
with the audio. This process is called forced alignment,
00:29:06.480 --> 00:29:10.399
where you already know what the transcript is,
00:29:10.400 --> 00:29:14.599
and you have the audio, and you can get a good alignment
00:29:14.600 --> 00:29:17.599
using both pieces of information.
00:29:17.600 --> 00:29:20.119
And this is typically a technique that is used to
00:29:20.120 --> 00:29:23.079
create training data for training
00:29:23.080 --> 00:29:25.839
automatic speech recognizers.
00:29:25.840 --> 00:29:29.639
One interesting thing is that in order to do
00:29:29.640 --> 00:29:32.879
this forced alignment, you have to keep
00:29:32.880 --> 00:29:35.799
the non-speech events in transcript,
00:29:35.800 --> 00:29:39.079
because they consume some audio signal,
00:29:39.080 --> 00:29:41.399
and if you don't have that signal,
00:29:41.400 --> 00:29:44.399
the alignment process doesn't know exactly...
00:29:44.400 --> 00:29:45.759
you know, it doesn't do a good job,
00:29:45.760 --> 00:29:50.039
because it needs to align all parts of the signal
00:29:50.040 --> 00:29:54.999
with something, either pause or silence or noise or words.
00:29:55.000 --> 00:29:59.719
Interestingly, punctuations really don't factor in,
00:29:59.720 --> 00:30:01.559
because we don't speak in punctuations.
00:30:01.560 --> 00:30:04.239
So one of the things that you need to do
00:30:04.240 --> 00:30:05.679
is remove most of the punctuations,
00:30:05.680 --> 00:30:08.039
although you'll see there are some punctuations
00:30:08.040 --> 00:30:12.599
that can be kept, or that are to be kept.
NOTE Alignment before tokenization
00:30:12.600 --> 00:30:15.319
And the other thing is that the alignment has to be done
00:30:15.320 --> 00:30:20.159
before tokenization, as it impacts pronunciation.
00:30:20.160 --> 00:30:24.399
To show an example: Here you see "that's".
00:30:24.400 --> 00:30:26.919
When it's one word,
00:30:26.920 --> 00:30:31.959
it has a slightly different pronunciation
00:30:31.960 --> 00:30:35.679
than when it is two words, which is "that is",
00:30:35.680 --> 00:30:38.399
like you can see "is." And so,
00:30:38.400 --> 00:30:44.279
if you split the tokens or split the words
00:30:44.280 --> 00:30:48.119
in order for syntactic parser to process it,
00:30:48.120 --> 00:30:51.599
you would end up getting the wrong phonetic analysis.
00:30:51.600 --> 00:30:54.239
And if you have--if you process it
00:30:54.240 --> 00:30:55.319
through the phonetic analysis,
00:30:55.320 --> 00:30:59.159
and you don't know how to integrate it
00:30:59.160 --> 00:31:02.719
with the tokenized syntax, you can, you know,
00:31:02.720 --> 00:31:07.519
that can be pretty tricky. And a lot of time,
00:31:07.520 --> 00:31:10.759
people write one-off pieces of code that handle these,
00:31:10.760 --> 00:31:14.279
but the idea here is to try to have a general architecture
00:31:14.280 --> 00:31:17.239
that seamlessly integrates all these pieces.
00:31:17.240 --> 00:31:21.319
Then you do the syntactic parsing of the remaining tokens.
00:31:21.320 --> 00:31:24.799
Then you align the data and the two annotations,
00:31:24.800 --> 00:31:27.959
and then integrate the two layers.
00:31:27.960 --> 00:31:31.359
Once that is done, then you can do all kinds of
00:31:31.360 --> 00:31:33.919
interesting analysis, and test various hypotheses
00:31:33.920 --> 00:31:35.279
and generate the statistics,
00:31:35.280 --> 00:31:39.359
but without that you only are dealing
00:31:39.360 --> 00:31:42.879
with one or the other part.
NOTE Layers
00:31:42.880 --> 00:31:48.319
Let's just take a quick look at how each of the layers
00:31:48.320 --> 00:31:51.159
that are involved look like.
00:31:51.160 --> 00:31:56.719
So this is "Um \{lipsmack\}, and that's it. \{laugh\}"
00:31:56.720 --> 00:32:00.159
This is the transcript, and on the right hand side,
00:32:00.160 --> 00:32:04.199
you see the same thing as a transcript
00:32:04.200 --> 00:32:06.239
listed in a vertical in a column.
00:32:06.240 --> 00:32:08.199
You'll see why, in just a second.
00:32:08.200 --> 00:32:09.879
And there are some place--
00:32:09.880 --> 00:32:11.279
there are some rows that are empty,
00:32:11.280 --> 00:32:15.079
some rows that are wider than the others, and we'll see why.
00:32:15.080 --> 00:32:19.319
The next is the tokenized sentence
00:32:19.320 --> 00:32:20.959
where you have space added,
00:32:20.960 --> 00:32:23.599
you know space between these two tokens:
00:32:23.600 --> 00:32:26.599
"that" and the apostrophe "s" ('s),
00:32:26.600 --> 00:32:28.079
and the "it" and the "period".
00:32:28.080 --> 00:32:30.679
And you see on the right hand side
00:32:30.680 --> 00:32:33.559
that the tokens have attributes.
00:32:33.560 --> 00:32:36.439
So there is a token index, and there are 1, 2,
00:32:36.440 --> 00:32:38.839
you know 0, 1, 2, 3, 4, 5 tokens,
00:32:38.840 --> 00:32:41.479
and each token has a start and end character,
00:32:41.480 --> 00:32:45.799
and space (sp) also has a start and end character,
00:32:45.800 --> 00:32:50.399
and space is represented by a "sp". And there are
00:32:50.400 --> 00:32:54.319
these other things that we removed,
00:32:54.320 --> 00:32:56.239
like the "\{LS\}" which is for "\{lipsmack\}"
00:32:56.240 --> 00:32:59.399
and "\{LG\}" which is "\{laugh\}" are showing grayed out,
00:32:59.400 --> 00:33:02.439
and you'll see why some of these things are grayed out
00:33:02.440 --> 00:33:03.399
in a little bit.
00:33:03.400 --> 00:33:11.919
This is what the forced alignment tool produces.
00:33:11.920 --> 00:33:17.159
Basically, it takes the transcript,
00:33:17.160 --> 00:33:19.159
and this is the transcript
00:33:19.160 --> 00:33:24.119
that has slightly different symbols,
00:33:24.120 --> 00:33:26.239
because different tools use different symbols
00:33:26.240 --> 00:33:28.159
and their various configurational things.
00:33:28.160 --> 00:33:33.679
But this is what is used to get an alignment
00:33:33.680 --> 00:33:36.039
or time alignment with phones.
00:33:36.040 --> 00:33:40.079
So this column shows the phones, and so each word...
00:33:40.080 --> 00:33:43.879
So, for example, "and" has been aligned with these phones,
00:33:43.880 --> 00:33:46.879
and these on the start and end
00:33:46.880 --> 00:33:52.959
are essentially temporal or time stamps that it aligned--
00:33:52.960 --> 00:33:54.279
that has been aligned to it.
00:33:54.280 --> 00:34:00.759
Interestingly, sometimes we don't really have any pause
00:34:00.760 --> 00:34:05.159
or any time duration between some words
00:34:05.160 --> 00:34:08.199
and those are highlighted as gray here.
00:34:08.200 --> 00:34:12.759
See, there's this space... Actually
00:34:12.760 --> 00:34:17.799
it does not have any temporal content,
00:34:17.800 --> 00:34:21.319
whereas this other space has some duration.
00:34:21.320 --> 00:34:24.839
So the ones that have some duration are captured,
00:34:24.840 --> 00:34:29.519
while the others are the ones that in the earlier diagram
00:34:29.520 --> 00:34:31.319
we saw were left out.
NOTE Variations
00:34:31.320 --> 00:34:37.639
And the aligner actually produces multiple files.
00:34:37.640 --> 00:34:44.399
One of the files has a different, slightly different
00:34:44.400 --> 00:34:46.679
variation on the same information,
00:34:46.680 --> 00:34:49.999
and in this case, you can see
00:34:50.000 --> 00:34:52.399
that the punctuation is missing,
00:34:52.400 --> 00:34:57.599
and the punctuation is, you know, deliberately missing,
00:34:57.600 --> 00:35:02.279
because there is no time associated with it,
00:35:02.280 --> 00:35:06.439
and you see that it's not the tokenized sentence--
00:35:06.440 --> 00:35:17.119
a tokenized word. This... Now it gives you a full table,
00:35:17.120 --> 00:35:21.239
and you can't really look into it very carefully.
00:35:21.240 --> 00:35:25.879
But we can focus on the part that seems legible,
00:35:25.880 --> 00:35:28.559
or, you know, properly written sentence,
00:35:28.560 --> 00:35:32.879
process it and reincorporate it back into the whole.
00:35:32.880 --> 00:35:35.879
So if somebody wants to look at, for example,
00:35:35.880 --> 00:35:39.679
how many pauses the person made while they were talking,
00:35:39.680 --> 00:35:42.919
And they can actually measure the pause, the number,
00:35:42.920 --> 00:35:46.279
the duration, and make connections between that
00:35:46.280 --> 00:35:49.639
and the rich syntactic structure that is being produced.
00:35:49.640 --> 00:35:57.279
And in order to do that, you have to get these layers
00:35:57.280 --> 00:35:59.039
to align with each other,
00:35:59.040 --> 00:36:04.359
and this table is just a tabular representation
00:36:04.360 --> 00:36:08.679
of the information that we'll be storing in the YAMR file.
00:36:08.680 --> 00:36:11.719
Congratulations! You have reached
00:36:11.720 --> 00:36:13.479
the end of this demonstration.
00:36:13.480 --> 00:36:17.000
Thank you for your time and attention.