add captions

author: Sacha Chua <sacha@sachachua.com> 2022-12-13 00:51:18 -0500
committer: Sacha Chua <sacha@sachachua.com> 2022-12-13 00:51:18 -0500
commit: a223dda5a9e14cd51d960533604cd9e284d7624f (patch)
tree: 9767eee6758b5d9b249e778aa97eaac801588243
parent: 2fd19da21925447affd446a7aa3fee2b12d94c0e (diff)
download: emacsconf-wiki-a223dda5a9e14cd51d960533604cd9e284d7624f.tar.xz
emacsconf-wiki-a223dda5a9e14cd51d960533604cd9e284d7624f.zip
1 files changed, 1945 insertions, 0 deletions
diff --git a/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt b/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt
new file mode 100644
index 00000000..a642f94a
--- /dev/null
+++ b/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt
@@ -0,0 +1,1945 @@
+WEBVTT captioned by sameer
+
+NOTE Introduction
+
+00:00:00.000 --> 00:00:05.839
+Thank you for joining me today. I'm Sameer Pradhan
+
+00:00:05.840 --> 00:00:07.799
+from the Linguistic Data Consortium
+
+00:00:07.800 --> 00:00:10.079
+at the University of Pennsylvania
+
+00:00:10.080 --> 00:00:14.519
+and founder of cemantix.org .
+
+00:00:14.520 --> 00:00:16.879
+Today we'll be addressing research
+
+00:00:16.880 --> 00:00:18.719
+in computational linguistics,
+
+00:00:18.720 --> 00:00:22.039
+also known as natural language processing
+
+00:00:22.040 --> 00:00:24.719
+a sub area of artificial intelligence
+
+00:00:24.720 --> 00:00:27.759
+with a focus on modeling and predicting
+
+00:00:27.760 --> 00:00:31.919
+complex linguistic structures from various signals.
+
+00:00:31.920 --> 00:00:35.799
+The work we present is limited to text and speech signals.
+
+00:00:35.800 --> 00:00:38.639
+but it can be extended to other signals.
+
+00:00:38.640 --> 00:00:40.799
+We propose an architecture,
+
+00:00:40.800 --> 00:00:42.959
+and we call it GRAIL, which allows
+
+00:00:42.960 --> 00:00:44.639
+the representation and aggregation
+
+00:00:44.640 --> 00:00:50.199
+of such rich structures in a systematic fashion.
+
+00:00:50.200 --> 00:00:52.679
+I'll demonstrate a proof of concept
+
+00:00:52.680 --> 00:00:56.559
+for representing and manipulating data and annotations
+
+00:00:56.560 --> 00:00:58.519
+for the specific purpose of building
+
+00:00:58.520 --> 00:01:02.879
+machine learning models that simulate understanding.
+
+00:01:02.880 --> 00:01:05.679
+These technologies have the potential for impact
+
+00:01:05.680 --> 00:01:09.119
+in almost every conceivable field
+
+00:01:09.120 --> 00:01:13.399
+that generates and uses data.
+
+NOTE Processing language
+
+00:01:13.400 --> 00:01:15.039
+We process human language
+
+00:01:15.040 --> 00:01:16.719
+when our brains receive and assimilate
+
+00:01:16.720 --> 00:01:20.079
+various signals which are then manipulated
+
+00:01:20.080 --> 00:01:23.879
+and interpreted within a syntactic structure.
+
+00:01:23.880 --> 00:01:27.319
+it's a complex process that I have simplified here
+
+00:01:27.320 --> 00:01:30.759
+for the purpose of comparison to machine learning.
+
+00:01:30.760 --> 00:01:33.959
+Recent machine learning models tend to require
+
+00:01:33.960 --> 00:01:37.039
+a large amount of raw, naturally occurring data
+
+00:01:37.040 --> 00:01:40.199
+and a varying amount of manually enriched data,
+
+00:01:40.200 --> 00:01:43.199
+commonly known as "annotations".
+
+00:01:43.200 --> 00:01:45.959
+Owing to the complex and numerous nature
+
+00:01:45.960 --> 00:01:49.959
+of linguistic phenomena, we have most often used
+
+00:01:49.960 --> 00:01:52.999
+a divide and conquer approach.
+
+00:01:53.000 --> 00:01:55.399
+The strength of this approach is that it allows us
+
+00:01:55.400 --> 00:01:58.159
+to focus on a single, or perhaps a few related
+
+00:01:58.160 --> 00:02:00.439
+linguistic phenomena.
+
+00:02:00.440 --> 00:02:03.879
+The weaknesses are the universe of these phenomena
+
+00:02:03.880 --> 00:02:07.239
+keep expanding, as language itself
+
+00:02:07.240 --> 00:02:09.359
+evolves and changes over time,
+
+00:02:09.360 --> 00:02:13.119
+and second, this approach requires an additional task
+
+00:02:13.120 --> 00:02:14.839
+of aggregating the interpretations,
+
+00:02:14.840 --> 00:02:18.359
+creating more opportunities for computer error.
+
+00:02:18.360 --> 00:02:21.519
+Our challenge, then, is to find the sweet spot
+
+00:02:21.520 --> 00:02:25.239
+that allows us to encode complex information
+
+00:02:25.240 --> 00:02:27.719
+without the use of manual annotation,
+
+00:02:27.720 --> 00:02:34.559
+or without the additional task of aggregation by computers.
+
+NOTE Annotation
+
+00:02:34.560 --> 00:02:37.119
+So what do I mean by "annotation"?
+
+00:02:37.120 --> 00:02:39.759
+In this talk the word annotation refers to
+
+00:02:39.760 --> 00:02:43.519
+the manual assignment of certain attributes
+
+00:02:43.520 --> 00:02:48.639
+to portions of a signal which is necessary
+
+00:02:48.640 --> 00:02:51.639
+to perform the end task.
+
+00:02:51.640 --> 00:02:54.439
+For example, in order for the algorithm
+
+00:02:54.440 --> 00:02:57.439
+to accurately interpret a pronoun,
+
+00:02:57.440 --> 00:03:00.279
+it needs to know that pronoun,
+
+00:03:00.280 --> 00:03:03.799
+what that pronoun refers back to.
+
+00:03:03.800 --> 00:03:06.719
+We may find this task trivial, however,
+
+00:03:06.720 --> 00:03:10.599
+current algorithms repeatedly fail in this task.
+
+00:03:10.600 --> 00:03:13.319
+So the complexities of understanding
+
+00:03:13.320 --> 00:03:16.639
+in computational linguistics require annotation.
+
+00:03:16.640 --> 00:03:20.799
+The world annotation itself is a useful example,
+
+00:03:20.800 --> 00:03:22.679
+because it also reminds us
+
+00:03:22.680 --> 00:03:25.119
+that words have multiple meetings
+
+00:03:25.120 --> 00:03:27.519
+as annotation itself does—
+
+00:03:27.520 --> 00:03:30.559
+just as I needed to define it in this context,
+
+00:03:30.560 --> 00:03:33.799
+so that my message won't be misinterpreted.
+
+00:03:33.800 --> 00:03:39.039
+So, too, must annotators do this for algorithms
+
+00:03:39.040 --> 00:03:43.239
+through the manual intervention.
+
+NOTE Learning from data
+
+00:03:43.240 --> 00:03:44.759
+Learning from raw data
+
+00:03:44.760 --> 00:03:47.039
+(commonly known as unsupervised learning)
+
+00:03:47.040 --> 00:03:50.079
+poses limitations for machine learning.
+
+00:03:50.080 --> 00:03:53.039
+As I described, modeling complex phenomena
+
+00:03:53.040 --> 00:03:55.559
+need manual annotations.
+
+00:03:55.560 --> 00:03:58.559
+The learning algorithm uses these annotations
+
+00:03:58.560 --> 00:04:01.319
+as examples to build statistical models.
+
+00:04:01.320 --> 00:04:04.879
+This is called supervised learning.
+
+00:04:04.880 --> 00:04:06.319
+Without going into too much detail,
+
+00:04:06.320 --> 00:04:10.039
+I'll simply note that the recent popularity
+
+00:04:10.040 --> 00:04:12.519
+of the concept of deep learning
+
+00:04:12.520 --> 00:04:14.679
+is that evolutionary step
+
+00:04:14.680 --> 00:04:17.319
+where we have learned to train models
+
+00:04:17.320 --> 00:04:20.799
+using trillions of parameters in ways that they can
+
+00:04:20.800 --> 00:04:25.079
+learn richer hierarchical structures
+
+00:04:25.080 --> 00:04:29.399
+from very large amounts of annotate, unannotated data.
+
+00:04:29.400 --> 00:04:32.319
+These models can then be fine-tuned,
+
+00:04:32.320 --> 00:04:35.599
+using varying amounts of annotated examples
+
+00:04:35.600 --> 00:04:37.639
+depending on the complexity of the task
+
+00:04:37.640 --> 00:04:39.679
+to generate better predictions.
+
+NOTE Manual annotation
+
+00:04:39.680 --> 00:04:44.919
+As you might imagine, manually annotating
+
+00:04:44.920 --> 00:04:47.359
+complex, linguistic phenomena
+
+00:04:47.360 --> 00:04:51.719
+can be very specific, labor-intensive task.
+
+00:04:51.720 --> 00:04:54.279
+For example, imagine if we were
+
+00:04:54.280 --> 00:04:56.399
+to go back through this presentation
+
+00:04:56.400 --> 00:04:58.399
+and connect all the pronouns
+
+00:04:58.400 --> 00:04:59.919
+with the nouns to which they refer.
+
+00:04:59.920 --> 00:05:03.239
+Even for a short 18 min presentation,
+
+00:05:03.240 --> 00:05:05.239
+this would require hundreds of annotations.
+
+00:05:05.240 --> 00:05:08.519
+The models we build are only as good
+
+00:05:08.520 --> 00:05:11.119
+as the quality of the annotations we make.
+
+00:05:11.120 --> 00:05:12.679
+We need guidelines
+
+00:05:12.680 --> 00:05:15.759
+that ensure that the annotations are done
+
+00:05:15.760 --> 00:05:19.719
+by at least two humans who have substantial agreement
+
+00:05:19.720 --> 00:05:22.119
+with each other in their interpretations.
+
+00:05:22.120 --> 00:05:25.599
+We know that if we try to trade a model using annotations
+
+00:05:25.600 --> 00:05:28.519
+that are very subjective, or have more noise,
+
+00:05:28.520 --> 00:05:30.919
+we will receive poor predictions.
+
+00:05:30.920 --> 00:05:33.679
+Additionally, there is the concern of introducing
+
+00:05:33.680 --> 00:05:37.079
+various unexpected biases into one's models.
+
+00:05:37.080 --> 00:05:44.399
+So annotation is really both an art and a science.
+
+NOTE How can we develop a unified representation?
+
+00:05:44.400 --> 00:05:47.439
+In the remaining time,
+
+00:05:47.440 --> 00:05:49.999
+we will turn to two fundamental questions.
+
+00:05:50.000 --> 00:05:54.239
+First, how can we develop a unified representation
+
+00:05:54.240 --> 00:05:55.599
+of data and annotations
+
+00:05:55.600 --> 00:05:59.759
+that encompasses arbitrary levels of linguistic information?
+
+00:05:59.760 --> 00:06:03.839
+There is a long history of attempting to answer
+
+00:06:03.840 --> 00:06:04.839
+this first question.
+
+00:06:04.840 --> 00:06:08.839
+This history is documented in our recent article,
+
+00:06:08.840 --> 00:06:11.519
+and you can refer to that article.
+
+00:06:11.520 --> 00:06:16.719
+It will be on the website.
+
+00:06:16.720 --> 00:06:18.999
+It is as if we, as a community,
+
+00:06:19.000 --> 00:06:22.519
+have been searching for our own Holy Grail.
+
+NOTE What role might Emacs and Org mode play?
+
+00:06:22.520 --> 00:06:26.519
+The second question we will pose is
+
+00:06:26.520 --> 00:06:30.159
+what role might Emacs, along with Org mode,
+
+00:06:30.160 --> 00:06:31.919
+play in this process?
+
+00:06:31.920 --> 00:06:35.359
+Well, the solution itself may not be tied to Emacs.
+
+00:06:35.360 --> 00:06:38.359
+Emacs has built in capabilities
+
+00:06:38.360 --> 00:06:42.599
+that could be useful for evaluating potential solutions.
+
+00:06:42.600 --> 00:06:45.759
+It's also one of the most extensively documented
+
+00:06:45.760 --> 00:06:48.519
+pieces of software and the most customizable
+
+00:06:48.520 --> 00:06:51.599
+piece of software that I have ever come across,
+
+00:06:51.600 --> 00:06:55.279
+and many would agree with that.
+
+NOTE The complex structure of language
+
+00:06:55.280 --> 00:07:00.639
+In order to approach this second question,
+
+00:07:00.640 --> 00:07:03.919
+we turn to the complex structure of language itself.
+
+00:07:03.920 --> 00:07:07.679
+At first glance, language appears to us
+
+00:07:07.680 --> 00:07:09.879
+as a series of words.
+
+00:07:09.880 --> 00:07:13.439
+Words form sentences, sentences form paragraphs,
+
+00:07:13.440 --> 00:07:16.239
+and paragraphs form completed text.
+
+00:07:16.240 --> 00:07:19.039
+If this was a sufficient description
+
+00:07:19.040 --> 00:07:21.159
+of the complexity of language,
+
+00:07:21.160 --> 00:07:24.199
+all of us would be able to speak and read
+
+00:07:24.200 --> 00:07:26.559
+at least ten different languages.
+
+00:07:26.560 --> 00:07:29.279
+We know it is much more complex than this.
+
+00:07:29.280 --> 00:07:33.199
+There is a rich, underlying recursive tree structure--
+
+00:07:33.200 --> 00:07:36.439
+in fact, many possible tree structures
+
+00:07:36.440 --> 00:07:39.439
+which makes a particular sequence meaningful
+
+00:07:39.440 --> 00:07:42.079
+and many others meaningless.
+
+00:07:42.080 --> 00:07:45.239
+One of the better understood tree structures
+
+00:07:45.240 --> 00:07:47.119
+is the syntactic structure.
+
+00:07:47.120 --> 00:07:49.439
+While natural language
+
+00:07:49.440 --> 00:07:51.679
+has rich ambiguities and complexities,
+
+00:07:51.680 --> 00:07:55.119
+programming languages are designed to be parsed
+
+00:07:55.120 --> 00:07:56.999
+and interpreted deterministically.
+
+00:07:57.000 --> 00:08:02.159
+Emacs has been used for programming very effectively.
+
+00:08:02.160 --> 00:08:05.359
+So there is a potential for using Emacs
+
+00:08:05.360 --> 00:08:06.559
+as a tool for annotation.
+
+00:08:06.560 --> 00:08:10.799
+This would significantly improve our current set of tools.
+
+NOTE Annotation tools
+
+00:08:10.800 --> 00:08:16.559
+It is important to note that most of the annotation tools
+
+00:08:16.560 --> 00:08:19.639
+that have been developed over the past few decades
+
+00:08:19.640 --> 00:08:22.879
+have relied on graphical interfaces,
+
+00:08:22.880 --> 00:08:26.919
+even those used for enriching textual information.
+
+00:08:26.920 --> 00:08:30.399
+Most of the tools in current use
+
+00:08:30.400 --> 00:08:36.159
+are designed for a end user to add very specific,
+
+00:08:36.160 --> 00:08:38.639
+very restricted information.
+
+00:08:38.640 --> 00:08:42.799
+We have not really made use of the potential
+
+00:08:42.800 --> 00:08:45.639
+that an editor or a rich editing environment like Emacs
+
+00:08:45.640 --> 00:08:47.239
+can add to the mix.
+
+00:08:47.240 --> 00:08:52.479
+Emacs has long enabled the editing of, the manipulation of
+
+00:08:52.480 --> 00:08:56.359
+complex embedded tree structures abundant in source code.
+
+00:08:56.360 --> 00:08:58.599
+So it's not difficult to imagine that it would have
+
+00:08:58.600 --> 00:09:00.359
+many capabilities that we we need
+
+00:09:00.360 --> 00:09:02.599
+to represent actual language.
+
+00:09:02.600 --> 00:09:04.759
+In fact, it already does that with features
+
+00:09:04.760 --> 00:09:06.399
+that allow us to quickly navigate
+
+00:09:06.400 --> 00:09:07.919
+through sentences and paragraphs,
+
+00:09:07.920 --> 00:09:09.799
+and we don't need a few key strokes.
+
+00:09:09.800 --> 00:09:13.599
+Or to add various text properties to text spans
+
+00:09:13.600 --> 00:09:17.039
+to create overlays, to name but a few.
+
+00:09:17.040 --> 00:09:22.719
+Emacs figured out this way to handle Unicode,
+
+00:09:22.720 --> 00:09:26.799
+so you don't even have to worry about the complexity
+
+00:09:26.800 --> 00:09:29.439
+of managing multiple languages.
+
+00:09:29.440 --> 00:09:34.039
+It's built into Emacs. In fact, this is not the first time
+
+00:09:34.040 --> 00:09:37.399
+Emacs has been used for linguistic analysis.
+
+00:09:37.400 --> 00:09:41.159
+One of the breakthrough moments in language,
+
+00:09:41.160 --> 00:09:44.439
+natural language processing was the creation
+
+00:09:44.440 --> 00:09:48.639
+of manually created syntactic trees
+
+00:09:48.640 --> 00:09:50.439
+for a 1 million word collection
+
+00:09:50.440 --> 00:09:52.399
+of Wall Street Journal articles.
+
+00:09:52.400 --> 00:09:54.879
+This was else around 1992
+
+00:09:54.880 --> 00:09:59.279
+before Java or graphical interfaces were common.
+
+00:09:59.280 --> 00:10:03.279
+The tool that was used to create that corpus was Emacs.
+
+00:10:03.280 --> 00:10:08.959
+It was created at UPenn, and is famously known as
+
+00:10:08.960 --> 00:10:12.719
+the Penn Treebank. '92 was about when
+
+00:10:12.720 --> 00:10:16.439
+the Linguistic Data Consortium was also established,
+
+00:10:16.440 --> 00:10:18.039
+and it's been about 30 years
+
+00:10:18.040 --> 00:10:20.719
+that it has been creating various
+
+00:10:20.720 --> 00:10:22.359
+language-related resources.
+
+NOTE Org mode
+
+00:10:22.360 --> 00:10:28.519
+Org mode--in particular, the outlining mode,
+
+00:10:28.520 --> 00:10:32.399
+or rather the enhanced form of outlining mode--
+
+00:10:32.400 --> 00:10:35.599
+allows us to create rich outlines,
+
+00:10:35.600 --> 00:10:37.799
+attaching properties to nodes,
+
+00:10:37.800 --> 00:10:41.119
+and provides commands for easily customizing
+
+00:10:41.120 --> 00:10:43.879
+sorting of various pieces of information
+
+00:10:43.880 --> 00:10:45.639
+as per one's requirement.
+
+00:10:45.640 --> 00:10:50.239
+This can also be a very useful tool.
+
+00:10:50.240 --> 00:10:59.159
+This enhanced form of outline-mode adds more power to Emacs.
+
+00:10:59.160 --> 00:11:03.359
+It provides commands for easily customizing
+
+00:11:03.360 --> 00:11:05.159
+and filtering information,
+
+00:11:05.160 --> 00:11:08.999
+while at the same time hiding unnecessary context.
+
+00:11:09.000 --> 00:11:11.919
+It also allows structural editing.
+
+00:11:11.920 --> 00:11:16.039
+This can be a very useful tool to enrich corpora
+
+00:11:16.040 --> 00:11:20.919
+where we are focusing on limited amount of phenomena.
+
+00:11:20.920 --> 00:11:24.519
+The two together allow us to create
+
+00:11:24.520 --> 00:11:27.199
+a rich representation
+
+00:11:27.200 --> 00:11:32.999
+that can simultaneously capture multiple possible sequences,
+
+00:11:33.000 --> 00:11:38.759
+capture details necessary to recreate the original source,
+
+00:11:38.760 --> 00:11:42.079
+allow the creation of hierarchical representation,
+
+00:11:42.080 --> 00:11:44.679
+provide structural editing capabilities
+
+00:11:44.680 --> 00:11:47.439
+that can take advantage of the concept of inheritance
+
+00:11:47.440 --> 00:11:48.999
+within the tree structure.
+
+00:11:49.000 --> 00:11:54.279
+Together they allow local manipulations of structures,
+
+00:11:54.280 --> 00:11:56.199
+thereby minimizing data coupling.
+
+00:11:56.200 --> 00:11:59.119
+The concept of tags in Org mode
+
+00:11:59.120 --> 00:12:01.599
+complement the hierarchy part.
+
+00:12:01.600 --> 00:12:03.839
+Hierarchies can be very rigid,
+
+00:12:03.840 --> 00:12:06.039
+but to tags on hierarchies,
+
+00:12:06.040 --> 00:12:08.839
+we can have a multifaceted representations.
+
+00:12:08.840 --> 00:12:12.759
+As a matter of fact, Org mode has the ability for the tags
+
+00:12:12.760 --> 00:12:15.039
+to have their own hierarchical structure
+
+00:12:15.040 --> 00:12:18.639
+which further enhances the representational power.
+
+00:12:18.640 --> 00:12:22.639
+All of this can be done as a sequence
+
+00:12:22.640 --> 00:12:25.679
+of mostly functional data transformations,
+
+00:12:25.680 --> 00:12:27.439
+because most of the capabilities
+
+00:12:27.440 --> 00:12:29.759
+can be configured and customized.
+
+00:12:29.760 --> 00:12:32.799
+It is not necessary to do everything at once.
+
+00:12:32.800 --> 00:12:36.199
+Instead, it allows us to incrementally increase
+
+00:12:36.200 --> 00:12:37.919
+the complexity of the representation.
+
+00:12:37.920 --> 00:12:39.799
+Finally, all of this can be done
+
+00:12:39.800 --> 00:12:42.359
+in plain-text representation
+
+00:12:42.360 --> 00:12:45.479
+which comes with its own advantages.
+
+NOTE Example
+
+00:12:45.480 --> 00:12:50.679
+Now let's take a simple example.
+
+00:12:50.680 --> 00:12:55.999
+This is a a short video that I'll play.
+
+00:12:56.000 --> 00:12:59.679
+The sentence is "I saw the moon with a telescope,"
+
+00:12:59.680 --> 00:13:03.999
+and let's just make a copy of the sentence.
+
+00:13:04.000 --> 00:13:09.199
+What we can do now is to see:
+
+00:13:09.200 --> 00:13:11.879
+what does this sentence comprise?
+
+00:13:11.880 --> 00:13:13.679
+It has a noun phrase "I,"
+
+00:13:13.680 --> 00:13:17.479
+followed by a word "saw."
+
+00:13:17.480 --> 00:13:21.359
+Then "the moon" is another noun phrase,
+
+00:13:21.360 --> 00:13:24.839
+and "with the telescope" is a prepositional phrase.
+
+00:13:24.840 --> 00:13:30.759
+Now one thing that you might remember,
+
+00:13:30.760 --> 00:13:36.119
+from grammar school or syntax is that
+
+00:13:36.120 --> 00:13:41.279
+there is a syntactic structure.
+
+00:13:41.280 --> 00:13:44.359
+And if you in this particular case--
+
+00:13:44.360 --> 00:13:47.919
+because we know that the moon is not typically
+
+00:13:47.920 --> 00:13:51.679
+something that can hold the telescope,
+
+00:13:51.680 --> 00:13:56.239
+that the seeing must be done by me or "I,"
+
+00:13:56.240 --> 00:14:01.039
+and the telescope must be in my hand,
+
+00:14:01.040 --> 00:14:04.479
+or "I" am viewing the moon with a telescope.
+
+00:14:04.480 --> 00:14:13.519
+However, it is possible that in a different context
+
+00:14:13.520 --> 00:14:17.159
+the moon could be referring to an animated character
+
+00:14:17.160 --> 00:14:22.319
+in a animated series, and could actually hold the telescope.
+
+00:14:22.320 --> 00:14:23.479
+And this is one of the most--
+
+00:14:23.480 --> 00:14:24.839
+the oldest and one of the most--
+
+00:14:24.840 --> 00:14:26.319
+and in that case the situation might be
+
+00:14:26.320 --> 00:14:30.959
+that I'm actually seeing the moon holding a telescope...
+
+00:14:30.960 --> 00:14:36.079
+I mean. The moon is holding the telescope,
+
+00:14:36.080 --> 00:14:40.959
+and I'm just seeing the moon holding the telescope.
+
+00:14:40.960 --> 00:14:47.999
+Complex linguistic ambiguity or linguistic
+
+00:14:48.000 --> 00:14:53.599
+phenomena that requires world knowledge,
+
+00:14:53.600 --> 00:14:55.719
+and it's called the PP attachment problem
+
+00:14:55.720 --> 00:14:59.239
+where the propositional phrase attachment
+
+00:14:59.240 --> 00:15:04.599
+can be ambiguous, and various different contextual cues
+
+00:15:04.600 --> 00:15:06.879
+have to be used to resolve the ambiguity.
+
+00:15:06.880 --> 00:15:09.079
+So in this case, as you saw,
+
+00:15:09.080 --> 00:15:11.199
+both the readings are technically true,
+
+00:15:11.200 --> 00:15:13.959
+depending on different contexts.
+
+00:15:13.960 --> 00:15:16.599
+So one thing we could do is just
+
+00:15:16.600 --> 00:15:19.919
+to cut the tree and duplicate it,
+
+00:15:19.920 --> 00:15:21.599
+and then let's create another node
+
+00:15:21.600 --> 00:15:24.479
+and call it an "OR" node.
+
+00:15:24.480 --> 00:15:26.119
+And because we are saying,
+
+00:15:26.120 --> 00:15:28.359
+this is one of the two interpretations.
+
+00:15:28.360 --> 00:15:32.159
+Now let's call one interpretation "a",
+
+00:15:32.160 --> 00:15:36.159
+and that interpretation essentially
+
+00:15:36.160 --> 00:15:39.319
+is this child of that node "a"
+
+00:15:39.320 --> 00:15:41.799
+and that says that the moon
+
+00:15:41.800 --> 00:15:43.999
+is holding the telescope.
+
+00:15:44.000 --> 00:15:46.359
+Now we can create another representation "b"
+
+00:15:46.360 --> 00:15:53.919
+where we capture the other interpretation,
+
+00:15:53.920 --> 00:15:59.959
+where this, the act, the moon or--I am actually
+
+00:15:59.960 --> 00:16:00.519
+holding the telescope,
+
+00:16:00.520 --> 00:16:06.799
+and watching the moon using it.
+
+00:16:06.800 --> 00:16:09.199
+So now we have two separate interpretations
+
+00:16:09.200 --> 00:16:11.679
+in the same structure,
+
+00:16:11.680 --> 00:16:15.519
+and all we do--we're able to do is with this,
+
+00:16:15.520 --> 00:16:18.159
+with very quick key strokes now...
+
+00:16:18.160 --> 00:16:22.439
+While we are at it, let's add another interesting thing,
+
+00:16:22.440 --> 00:16:25.159
+this node that represents "I":
+
+00:16:25.160 --> 00:16:28.919
+"He." It can be "She".
+
+00:16:28.920 --> 00:16:35.759
+It can be "the children," or it can be "The people".
+
+00:16:35.760 --> 00:16:45.039
+Basically, any entity that has the capability to "see"
+
+00:16:45.040 --> 00:16:53.359
+can be substituted in this particular node.
+
+00:16:53.360 --> 00:16:57.399
+Let's see what we have here now.
+
+00:16:57.400 --> 00:17:01.239
+We just are getting sort of a zoom view
+
+00:17:01.240 --> 00:17:04.599
+of the entire structure, what we created,
+
+00:17:04.600 --> 00:17:08.039
+and essentially you can see that
+
+00:17:08.040 --> 00:17:11.879
+by just, you know, using a few keystrokes,
+
+00:17:11.880 --> 00:17:17.839
+we were able to capture two different interpretations
+
+00:17:17.840 --> 00:17:20.879
+of a a simple sentence,
+
+00:17:20.880 --> 00:17:23.759
+and they are also able to add
+
+00:17:23.760 --> 00:17:27.799
+these alternate pieces of information
+
+00:17:27.800 --> 00:17:30.559
+that could help machine learning algorithms
+
+00:17:30.560 --> 00:17:32.439
+generalize better.
+
+00:17:32.440 --> 00:17:36.239
+All right.
+
+NOTE Different readings
+
+00:17:36.240 --> 00:17:40.359
+Now, let's look at the next thing. So in a sense,
+
+00:17:40.360 --> 00:17:46.679
+we can use this power of functional data structures
+
+00:17:46.680 --> 00:17:50.239
+to represent various potentially conflicting
+
+00:17:50.240 --> 00:17:55.559
+and structural readings of that piece of text.
+
+00:17:55.560 --> 00:17:58.079
+In addition to that, we can also create more texts,
+
+00:17:58.080 --> 00:17:59.799
+each with different structure,
+
+00:17:59.800 --> 00:18:01.559
+and have them all in the same place.
+
+00:18:01.560 --> 00:18:04.239
+This allows us to address the interpretation
+
+00:18:04.240 --> 00:18:06.879
+of a static sentence that might be occurring in the world,
+
+00:18:06.880 --> 00:18:09.639
+while simultaneously inserting information
+
+00:18:09.640 --> 00:18:11.519
+that would add more value to it.
+
+00:18:11.520 --> 00:18:14.999
+This makes the enrichment process also very efficient.
+
+00:18:15.000 --> 00:18:19.519
+Additionally, we can envision
+
+00:18:19.520 --> 00:18:23.999
+a power user of the future, or present,
+
+00:18:24.000 --> 00:18:27.479
+who can not only annotate a span,
+
+00:18:27.480 --> 00:18:31.279
+but also edit the information in situ
+
+00:18:31.280 --> 00:18:34.639
+in a way that would help machine algorithms
+
+00:18:34.640 --> 00:18:36.879
+generalize better by making more efficient use
+
+00:18:36.880 --> 00:18:37.719
+of the annotations.
+
+00:18:37.720 --> 00:18:41.519
+So together, Emacs and Org mode can speed up
+
+00:18:41.520 --> 00:18:42.959
+the enrichment of the signals
+
+00:18:42.960 --> 00:18:44.519
+in a way that allows us
+
+00:18:44.520 --> 00:18:47.719
+to focus on certain aspects and ignore others.
+
+00:18:47.720 --> 00:18:50.839
+Extremely complex landscape of rich structures
+
+00:18:50.840 --> 00:18:53.039
+can be captured consistently,
+
+00:18:53.040 --> 00:18:55.639
+in a fashion that allows computers
+
+00:18:55.640 --> 00:18:56.759
+to understand language.
+
+00:18:56.760 --> 00:19:00.879
+We can then build tools to enhance the tasks
+
+00:19:00.880 --> 00:19:03.319
+that we do in our everyday life.
+
+00:19:03.320 --> 00:19:10.759
+YAMR is acronym, or the file's type or specification
+
+00:19:10.760 --> 00:19:15.239
+that we are creating to capture this new
+
+00:19:15.240 --> 00:19:17.679
+rich representation.
+
+NOTE Spontaneous speech
+
+00:19:17.680 --> 00:19:21.959
+We'll now look at an example of spontaneous speech
+
+00:19:21.960 --> 00:19:24.799
+that occurs in spoken conversations.
+
+00:19:24.800 --> 00:19:28.599
+Conversations frequently contain errors in speech:
+
+00:19:28.600 --> 00:19:30.799
+interruptions, disfluencies,
+
+00:19:30.800 --> 00:19:33.959
+verbal sounds such as cough or laugh,
+
+00:19:33.960 --> 00:19:35.039
+and other noises.
+
+00:19:35.040 --> 00:19:38.199
+In this sense, spontaneous speech is similar
+
+00:19:38.200 --> 00:19:39.799
+to a functional data stream.
+
+00:19:39.800 --> 00:19:42.759
+We cannot take back words that come out of our mouth,
+
+00:19:42.760 --> 00:19:47.239
+but we tend to make mistakes, and we correct ourselves
+
+00:19:47.240 --> 00:19:49.039
+as soon as we realize that we have made--
+
+00:19:49.040 --> 00:19:50.679
+we have misspoken.
+
+00:19:50.680 --> 00:19:53.159
+This process manifests through a combination
+
+00:19:53.160 --> 00:19:56.279
+of a handful of mechanisms, including immediate correction
+
+00:19:56.280 --> 00:20:00.959
+after an error, and we do this unconsciously.
+
+00:20:00.960 --> 00:20:02.719
+Computers, on the other hand,
+
+00:20:02.720 --> 00:20:06.639
+must be taught to understand these cases.
+
+00:20:06.640 --> 00:20:12.799
+What we see here is a example document or outline,
+
+00:20:12.800 --> 00:20:18.119
+or part of a document that illustrates
+
+00:20:18.120 --> 00:20:22.919
+various different aspects of the representation.
+
+00:20:22.920 --> 00:20:25.919
+We don't have a lot of time to go through
+
+00:20:25.920 --> 00:20:28.239
+many of the details.
+
+00:20:28.240 --> 00:20:31.759
+I would highly encourage you to play a...
+
+00:20:31.760 --> 00:20:39.159
+I'm planning on making some videos, or ascii cinemas,
+
+00:20:39.160 --> 00:20:42.559
+that I'll be posting, and you can,
+
+00:20:42.560 --> 00:20:46.759
+if you're interested, you can go through those.
+
+00:20:46.760 --> 00:20:50.359
+The idea here is to try to do
+
+00:20:50.360 --> 00:20:54.599
+a slightly more complex use case.
+
+00:20:54.600 --> 00:20:57.639
+But again, given the time constraint
+
+00:20:57.640 --> 00:21:00.279
+and the amount of information
+
+00:21:00.280 --> 00:21:01.519
+that needs to fit in the screen,
+
+00:21:01.520 --> 00:21:05.559
+this may not be very informative,
+
+00:21:05.560 --> 00:21:08.399
+but at least it will give you some idea
+
+00:21:08.400 --> 00:21:10.439
+of what can be possible.
+
+00:21:10.440 --> 00:21:13.279
+And in this particular case, what you're seeing is that
+
+00:21:13.280 --> 00:21:18.319
+there is a sentence which is "What I'm I'm tr- telling now."
+
+00:21:18.320 --> 00:21:21.159
+Essentially, there is a repetition of the word "I'm",
+
+00:21:21.160 --> 00:21:23.279
+and then there is a partial word
+
+00:21:23.280 --> 00:21:25.159
+that somebody tried to say "telling",
+
+00:21:25.160 --> 00:21:29.599
+but started saying "tr-", and then corrected themselves
+
+00:21:29.600 --> 00:21:30.959
+and said, "telling now."
+
+00:21:30.960 --> 00:21:39.239
+So in this case, you see, we can capture words
+
+00:21:39.240 --> 00:21:44.919
+or a sequence of words, or a sequence of tokens.
+
+00:21:44.920 --> 00:21:52.279
+One thing to... An interesting thing to note is that in NLP,
+
+00:21:52.280 --> 00:21:55.319
+sometimes we have to break typically
+
+00:21:55.320 --> 00:22:01.199
+words that don't have spaces into two separate words,
+
+00:22:01.200 --> 00:22:04.119
+especially contractions like "I'm",
+
+00:22:04.120 --> 00:22:08.199
+so the syntactic parser needs needs two separate nodes.
+
+00:22:08.200 --> 00:22:11.199
+But anyway, so I'll... You can see that here.
+
+00:22:11.200 --> 00:22:15.759
+The other... This view. What this view shows is that
+
+00:22:15.760 --> 00:22:19.759
+with each of the nodes in the sentence
+
+00:22:19.760 --> 00:22:23.079
+or in the representation,
+
+00:22:23.080 --> 00:22:26.079
+you can have a lot of different properties
+
+00:22:26.080 --> 00:22:27.559
+that you can attach to them,
+
+00:22:27.560 --> 00:22:30.119
+and these properties are typically hidden,
+
+00:22:30.120 --> 00:22:32.719
+like you saw in the earlier slide.
+
+00:22:32.720 --> 00:22:35.599
+But you can make use of all these properties
+
+00:22:35.600 --> 00:22:39.439
+to do various kind of searches and filtering.
+
+00:22:39.440 --> 00:22:43.519
+And on the right hand side here--
+
+00:22:43.520 --> 00:22:48.799
+this is actually not a legitimate syntax--
+
+00:22:48.800 --> 00:22:51.279
+but on the right are descriptions
+
+00:22:51.280 --> 00:22:53.479
+of what each of these represent.
+
+00:22:53.480 --> 00:22:57.319
+All the information is also available in the article.
+
+00:22:57.320 --> 00:23:04.279
+You can see there... It shows how much rich context
+
+00:23:04.280 --> 00:23:05.879
+you can capture.
+
+00:23:05.880 --> 00:23:08.799
+This is just a closer snapshot
+
+00:23:08.800 --> 00:23:10.159
+of the properties on the node,
+
+00:23:10.160 --> 00:23:13.119
+and you can see we can have things like,
+
+00:23:13.120 --> 00:23:14.799
+whether the word is a token or not,
+
+00:23:14.800 --> 00:23:17.359
+or that it's incomplete, whether some words
+
+00:23:17.360 --> 00:23:19.959
+might want to be filtered out for parsing,
+
+00:23:19.960 --> 00:23:23.039
+and we can say this: PARSE_IGNORE,
+
+00:23:23.040 --> 00:23:25.519
+or some words or restart markers...
+
+00:23:25.520 --> 00:23:29.239
+We can mark, add a RESTART_MARKER, or sometimes,
+
+00:23:29.240 --> 00:23:31.999
+some of these might have durations. Things like that.
+
+NOTE Editing properties in column view
+
+00:23:32.000 --> 00:23:38.799
+The other fascinating thing of this representation
+
+00:23:38.800 --> 00:23:42.599
+is that you can edit properties in the column view.
+
+00:23:42.600 --> 00:23:45.399
+And suddenly, you have this tabular data structure
+
+00:23:45.400 --> 00:23:48.879
+combined with the hierarchical data structure.
+
+00:23:48.880 --> 00:23:53.119
+And as you can--you may not be able to see it here,
+
+00:23:53.120 --> 00:23:56.879
+but what has also happened here is that
+
+00:23:56.880 --> 00:24:01.159
+some of the tags have been inherited
+
+00:24:01.160 --> 00:24:02.479
+from the earlier nodes.
+
+00:24:02.480 --> 00:24:07.919
+And so you get a much fuller picture of things.
+
+00:24:07.920 --> 00:24:13.919
+Essentially you, can filter out things
+
+00:24:13.920 --> 00:24:15.319
+that you want to process,
+
+00:24:15.320 --> 00:24:20.279
+process them, and then reintegrate it into the whole.
+
+NOTE Conclusion
+
+00:24:20.280 --> 00:24:25.479
+So, in conclusion, today we have proposed and demonstrated
+
+00:24:25.480 --> 00:24:27.559
+the use of an architecture (GRAIL),
+
+00:24:27.560 --> 00:24:31.319
+which allows the representation, manipulation,
+
+00:24:31.320 --> 00:24:34.759
+and aggregation of rich linguistic structures
+
+00:24:34.760 --> 00:24:36.519
+in a systematic fashion.
+
+00:24:36.520 --> 00:24:41.359
+We have shown how GRAIL advances the tools
+
+00:24:41.360 --> 00:24:44.599
+available for building machine learning models
+
+00:24:44.600 --> 00:24:46.879
+that simulate understanding.
+
+00:24:46.880 --> 00:24:51.679
+Thank you very much for your time and attention today.
+
+00:24:51.680 --> 00:24:54.639
+My contact information is on this slide.
+
+00:24:54.640 --> 00:25:02.599
+If you are interested in an additional example
+
+00:25:02.600 --> 00:25:05.439
+that demonstrates the representation
+
+00:25:05.440 --> 00:25:08.039
+of speech and written text together,
+
+00:25:08.040 --> 00:25:10.719
+please continue watching.
+
+00:25:10.720 --> 00:25:12.199
+Otherwise, you can stop here
+
+00:25:12.200 --> 00:25:15.279
+and enjoy the rest of the conference.
+
+NOTE Bonus material
+
+00:25:15.280 --> 00:25:39.079
+Welcome to the bonus material.
+
+00:25:39.080 --> 00:25:43.959
+I'm glad for those of you who are stuck around.
+
+00:25:43.960 --> 00:25:46.559
+We are now going to examine an instance
+
+00:25:46.560 --> 00:25:49.159
+of speech and text signals together
+
+00:25:49.160 --> 00:25:51.479
+that produce multiple layers.
+
+00:25:51.480 --> 00:25:54.839
+When we have--when we take a spoken conversation
+
+00:25:54.840 --> 00:25:58.719
+and use the best language processing models available,
+
+00:25:58.720 --> 00:26:00.679
+we suddenly hit a hard spot
+
+00:26:00.680 --> 00:26:03.239
+because the tools are typically not trained
+
+00:26:03.240 --> 00:26:05.359
+to filter out the unnecessary cruft
+
+00:26:05.360 --> 00:26:07.559
+in order to automatically interpret
+
+00:26:07.560 --> 00:26:09.559
+the part of what is being said
+
+00:26:09.560 --> 00:26:11.799
+that is actually relevant.
+
+00:26:11.800 --> 00:26:14.639
+Over time, language researchers
+
+00:26:14.640 --> 00:26:17.719
+have created many interdependent layers of annotations,
+
+00:26:17.720 --> 00:26:21.039
+yet the assumptions underlying them are seldom the same.
+
+00:26:21.040 --> 00:26:25.039
+Piecing together such related but disjointed annotations
+
+00:26:25.040 --> 00:26:28.039
+on their predictions poses a huge challenge.
+
+00:26:28.040 --> 00:26:30.719
+This is another place where we can leverage
+
+00:26:30.720 --> 00:26:33.119
+the data model underlying the Emacs editor,
+
+00:26:33.120 --> 00:26:35.359
+along with the structural editing capabilities
+
+00:26:35.360 --> 00:26:38.519
+of Org mode to improve current tools.
+
+00:26:38.520 --> 00:26:42.839
+Let's take this very simple looking utterance.
+
+00:26:42.840 --> 00:26:48.039
+"Um \{lipsmack\} and that's it. (\{laugh\})"
+
+00:26:48.040 --> 00:26:50.319
+Looks like the person-- so this is--
+
+00:26:50.320 --> 00:26:54.519
+what you are seeing here is a transcript of an audio signal
+
+00:26:54.520 --> 00:27:00.759
+that has a lip smack and a laugh as part of it,
+
+00:27:00.760 --> 00:27:04.199
+and there is also a "Um" like interjection.
+
+00:27:04.200 --> 00:27:08.199
+So this has a few interesting noises
+
+00:27:08.200 --> 00:27:13.999
+and specific things that would be illustrative
+
+00:27:14.000 --> 00:27:20.479
+of what we are going to, how we are going to represent it.
+
+NOTE Syntactic analysis
+
+00:27:20.480 --> 00:27:25.839
+Okay. So let's say you want to have
+
+00:27:25.840 --> 00:27:28.879
+a syntactic analysis of this sentence or utterance.
+
+00:27:28.880 --> 00:27:30.959
+One common technique people use
+
+00:27:30.960 --> 00:27:32.879
+is just to remove the cruft, and, you know,
+
+00:27:32.880 --> 00:27:35.079
+write some rules, clean up the utterance,
+
+00:27:35.080 --> 00:27:36.719
+make it look like it's proper English,
+
+00:27:36.720 --> 00:27:40.239
+and then, you know, tokenize it,
+
+00:27:40.240 --> 00:27:43.079
+and basically just use standard tools to process it.
+
+00:27:43.080 --> 00:27:47.279
+But in that process, they end up eliminating
+
+00:27:47.280 --> 00:27:51.119
+valid pieces of signal that have meaning to others
+
+00:27:51.120 --> 00:27:52.799
+studying different phenomena of language.
+
+00:27:52.800 --> 00:27:56.479
+Here you have the rich transcript,
+
+00:27:56.480 --> 00:28:00.119
+the input to the syntactic parser.
+
+00:28:00.120 --> 00:28:05.919
+As you can see, there is a little tokenization happening
+
+00:28:05.920 --> 00:28:07.199
+where you'll be inserting space
+
+00:28:07.200 --> 00:28:12.119
+between "that" and the contracted is ('s),
+
+00:28:12.120 --> 00:28:15.599
+and between the period and the "it,"
+
+00:28:15.600 --> 00:28:18.199
+and the output of the syntactic parser is shown below.
+
+00:28:18.200 --> 00:28:21.639
+which (surprise) is a S-expression.
+
+00:28:21.640 --> 00:28:24.919
+Like I said, the parse trees, when they were created,
+
+00:28:24.920 --> 00:28:29.799
+and still largely when they are used, are S-expressions,
+
+00:28:29.800 --> 00:28:32.999
+and most of the viewers here
+
+00:28:33.000 --> 00:28:35.119
+should not have much problem reading it.
+
+00:28:35.120 --> 00:28:37.279
+You can see this tree structure
+
+00:28:37.280 --> 00:28:39.279
+of this syntactic parser here.
+
+NOTE Forced alignment
+
+00:28:39.280 --> 00:28:40.919
+Now let's say you want to integrate
+
+00:28:40.920 --> 00:28:44.479
+phonetic information or phonetic layer
+
+00:28:44.480 --> 00:28:49.119
+that's in the audio signal, and do some analysis.
+
+00:28:49.120 --> 00:28:57.519
+Now, it would need you to do a few-- take a few steps.
+
+00:28:57.520 --> 00:29:01.679
+First, you would need to align the transcript
+
+00:29:01.680 --> 00:29:06.479
+with the audio. This process is called forced alignment,
+
+00:29:06.480 --> 00:29:10.399
+where you already know what the transcript is,
+
+00:29:10.400 --> 00:29:14.599
+and you have the audio, and you can get a good alignment
+
+00:29:14.600 --> 00:29:17.599
+using both pieces of information.
+
+00:29:17.600 --> 00:29:20.119
+And this is typically a technique that is used to
+
+00:29:20.120 --> 00:29:23.079
+create training data for training
+
+00:29:23.080 --> 00:29:25.839
+automatic speech recognizers.
+
+00:29:25.840 --> 00:29:29.639
+One interesting thing is that in order to do
+
+00:29:29.640 --> 00:29:32.879
+this forced alignment, you have to keep
+
+00:29:32.880 --> 00:29:35.799
+the non-speech events in transcript,
+
+00:29:35.800 --> 00:29:39.079
+because they consume some audio signal,
+
+00:29:39.080 --> 00:29:41.399
+and if you don't have that signal,
+
+00:29:41.400 --> 00:29:44.399
+the alignment process doesn't know exactly...
+
+00:29:44.400 --> 00:29:45.759
+you know, it doesn't do a good job,
+
+00:29:45.760 --> 00:29:50.039
+because it needs to align all parts of the signal
+
+00:29:50.040 --> 00:29:54.999
+with something, either pause or silence or noise or words.
+
+00:29:55.000 --> 00:29:59.719
+Interestingly, punctuations really don't factor in,
+
+00:29:59.720 --> 00:30:01.559
+because we don't speak in punctuations.
+
+00:30:01.560 --> 00:30:04.239
+So one of the things that you need to do
+
+00:30:04.240 --> 00:30:05.679
+is remove most of the punctuations,
+
+00:30:05.680 --> 00:30:08.039
+although you'll see there are some punctuations
+
+00:30:08.040 --> 00:30:12.599
+that can be kept, or that are to be kept.
+
+NOTE Alignment before tokenization
+
+00:30:12.600 --> 00:30:15.319
+And the other thing is that the alignment has to be done
+
+00:30:15.320 --> 00:30:20.159
+before tokenization, as it impacts pronunciation.
+
+00:30:20.160 --> 00:30:24.399
+To show an example: Here you see "that's".
+
+00:30:24.400 --> 00:30:26.919
+When it's one word,
+
+00:30:26.920 --> 00:30:31.959
+it has a slightly different pronunciation
+
+00:30:31.960 --> 00:30:35.679
+than when it is two words, which is "that is",
+
+00:30:35.680 --> 00:30:38.399
+like you can see "is." And so,
+
+00:30:38.400 --> 00:30:44.279
+if you split the tokens or split the words
+
+00:30:44.280 --> 00:30:48.119
+in order for syntactic parser to process it,
+
+00:30:48.120 --> 00:30:51.599
+you would end up getting the wrong phonetic analysis.
+
+00:30:51.600 --> 00:30:54.239
+And if you have--if you process it
+
+00:30:54.240 --> 00:30:55.319
+through the phonetic analysis,
+
+00:30:55.320 --> 00:30:59.159
+and you don't know how to integrate it
+
+00:30:59.160 --> 00:31:02.719
+with the tokenized syntax, you can, you know,
+
+00:31:02.720 --> 00:31:07.519
+that can be pretty tricky. And a lot of time,
+
+00:31:07.520 --> 00:31:10.759
+people write one-off pieces of code that handle these,
+
+00:31:10.760 --> 00:31:14.279
+but the idea here is to try to have a general architecture
+
+00:31:14.280 --> 00:31:17.239
+that seamlessly integrates all these pieces.
+
+00:31:17.240 --> 00:31:21.319
+Then you do the syntactic parsing of the remaining tokens.
+
+00:31:21.320 --> 00:31:24.799
+Then you align the data and the two annotations,
+
+00:31:24.800 --> 00:31:27.959
+and then integrate the two layers.
+
+00:31:27.960 --> 00:31:31.359
+Once that is done, then you can do all kinds of
+
+00:31:31.360 --> 00:31:33.919
+interesting analysis, and test various hypotheses
+
+00:31:33.920 --> 00:31:35.279
+and generate the statistics,
+
+00:31:35.280 --> 00:31:39.359
+but without that you only are dealing
+
+00:31:39.360 --> 00:31:42.879
+with one or the other part.
+
+NOTE Layers
+
+00:31:42.880 --> 00:31:48.319
+Let's just take a quick look at how each of the layers
+
+00:31:48.320 --> 00:31:51.159
+that are involved look like.
+
+00:31:51.160 --> 00:31:56.719
+So this is "Um \{lipsmack\}, and that's it. \{laugh\}"
+
+00:31:56.720 --> 00:32:00.159
+This is the transcript, and on the right hand side,
+
+00:32:00.160 --> 00:32:04.199
+you see the same thing as a transcript
+
+00:32:04.200 --> 00:32:06.239
+listed in a vertical in a column.
+
+00:32:06.240 --> 00:32:08.199
+You'll see why, in just a second.
+
+00:32:08.200 --> 00:32:09.879
+And there are some place--
+
+00:32:09.880 --> 00:32:11.279
+there are some rows that are empty,
+
+00:32:11.280 --> 00:32:15.079
+some rows that are wider than the others, and we'll see why.
+
+00:32:15.080 --> 00:32:19.319
+The next is the tokenized sentence
+
+00:32:19.320 --> 00:32:20.959
+where you have space added,
+
+00:32:20.960 --> 00:32:23.599
+you know space between these two tokens:
+
+00:32:23.600 --> 00:32:26.599
+"that" and the apostrophe "s" ('s),
+
+00:32:26.600 --> 00:32:28.079
+and the "it" and the "period".
+
+00:32:28.080 --> 00:32:30.679
+And you see on the right hand side
+
+00:32:30.680 --> 00:32:33.559
+that the tokens have attributes.
+
+00:32:33.560 --> 00:32:36.439
+So there is a token index, and there are 1, 2,
+
+00:32:36.440 --> 00:32:38.839
+you know 0, 1, 2, 3, 4, 5 tokens,
+
+00:32:38.840 --> 00:32:41.479
+and each token has a start and end character,
+
+00:32:41.480 --> 00:32:45.799
+and space (sp) also has a start and end character,
+
+00:32:45.800 --> 00:32:50.399
+and space is represented by a "sp".  And there are
+
+00:32:50.400 --> 00:32:54.319
+these other things that we removed,
+
+00:32:54.320 --> 00:32:56.239
+like the "\{LS\}" which is for "\{lipsmack\}"
+
+00:32:56.240 --> 00:32:59.399
+and "\{LG\}" which is "\{laugh\}" are showing grayed out,
+
+00:32:59.400 --> 00:33:02.439
+and you'll see why some of these things are grayed out
+
+00:33:02.440 --> 00:33:03.399
+in a little bit.
+
+00:33:03.400 --> 00:33:11.919
+This is what the forced alignment tool produces.
+
+00:33:11.920 --> 00:33:17.159
+Basically, it takes the transcript,
+
+00:33:17.160 --> 00:33:19.159
+and this is the transcript
+
+00:33:19.160 --> 00:33:24.119
+that has slightly different symbols,
+
+00:33:24.120 --> 00:33:26.239
+because different tools use different symbols
+
+00:33:26.240 --> 00:33:28.159
+and their various configurational things.
+
+00:33:28.160 --> 00:33:33.679
+But this is what is used to get an alignment
+
+00:33:33.680 --> 00:33:36.039
+or time alignment with phones.
+
+00:33:36.040 --> 00:33:40.079
+So this column shows the phones, and so each word...
+
+00:33:40.080 --> 00:33:43.879
+So, for example, "and" has been aligned with these phones,
+
+00:33:43.880 --> 00:33:46.879
+and these on the start and end
+
+00:33:46.880 --> 00:33:52.959
+are essentially temporal or time stamps that it aligned--
+
+00:33:52.960 --> 00:33:54.279
+that has been aligned to it.
+
+00:33:54.280 --> 00:34:00.759
+Interestingly, sometimes we don't really have any pause
+
+00:34:00.760 --> 00:34:05.159
+or any time duration between some words
+
+00:34:05.160 --> 00:34:08.199
+and those are highlighted as gray here.
+
+00:34:08.200 --> 00:34:12.759
+See, there's this space... Actually
+
+00:34:12.760 --> 00:34:17.799
+it does not have any temporal content,
+
+00:34:17.800 --> 00:34:21.319
+whereas this other space has some duration.
+
+00:34:21.320 --> 00:34:24.839
+So the ones that have some duration are captured,
+
+00:34:24.840 --> 00:34:29.519
+while the others are the ones that in the earlier diagram
+
+00:34:29.520 --> 00:34:31.319
+we saw were left out.
+
+NOTE Variations
+
+00:34:31.320 --> 00:34:37.639
+And the aligner actually produces multiple files.
+
+00:34:37.640 --> 00:34:44.399
+One of the files has a different, slightly different
+
+00:34:44.400 --> 00:34:46.679
+variation on the same information,
+
+00:34:46.680 --> 00:34:49.999
+and in this case, you can see
+
+00:34:50.000 --> 00:34:52.399
+that the punctuation is missing,
+
+00:34:52.400 --> 00:34:57.599
+and the punctuation is, you know, deliberately missing,
+
+00:34:57.600 --> 00:35:02.279
+because there is no time associated with it,
+
+00:35:02.280 --> 00:35:06.439
+and you see that it's not the tokenized sentence--
+
+00:35:06.440 --> 00:35:17.119
+a tokenized word. This... Now it gives you a full table,
+
+00:35:17.120 --> 00:35:21.239
+and you can't really look into it very carefully.
+
+00:35:21.240 --> 00:35:25.879
+But we can focus on the part that seems legible,
+
+00:35:25.880 --> 00:35:28.559
+or, you know, properly written sentence,
+
+00:35:28.560 --> 00:35:32.879
+process it and reincorporate it back into the whole.
+
+00:35:32.880 --> 00:35:35.879
+So if somebody wants to look at, for example,
+
+00:35:35.880 --> 00:35:39.679
+how many pauses the person made while they were talking,
+
+00:35:39.680 --> 00:35:42.919
+And they can actually measure the pause, the number,
+
+00:35:42.920 --> 00:35:46.279
+the duration, and make connections between that
+
+00:35:46.280 --> 00:35:49.639
+and the rich syntactic structure that is being produced.
+
+00:35:49.640 --> 00:35:57.279
+And in order to do that, you have to get these layers
+
+00:35:57.280 --> 00:35:59.039
+to align with each other,
+
+00:35:59.040 --> 00:36:04.359
+and this table is just a tabular representation
+
+00:36:04.360 --> 00:36:08.679
+of the information that we'll be storing in the YAMR file.
+
+00:36:08.680 --> 00:36:11.719
+Congratulations! You have reached
+
+00:36:11.720 --> 00:36:13.479
+the end of this demonstration.
+
+00:36:13.480 --> 00:36:17.000
+Thank you for your time and attention.
author	Sacha Chua <sacha@sachachua.com>	2022-12-13 00:51:18 -0500
committer	Sacha Chua <sacha@sachachua.com>	2022-12-13 00:51:18 -0500
commit	a223dda5a9e14cd51d960533604cd9e284d7624f (patch)
tree	9767eee6758b5d9b249e778aa97eaac801588243
parent	2fd19da21925447affd446a7aa3fee2b12d94c0e (diff)
download	emacsconf-wiki-a223dda5a9e14cd51d960533604cd9e284d7624f.tar.xz emacsconf-wiki-a223dda5a9e14cd51d960533604cd9e284d7624f.zip