diff options
Diffstat (limited to '')
-rw-r--r-- | 2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt | 1945 |
1 files changed, 1945 insertions, 0 deletions
diff --git a/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt b/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt new file mode 100644 index 00000000..a642f94a --- /dev/null +++ b/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt @@ -0,0 +1,1945 @@ +WEBVTT captioned by sameer + +NOTE Introduction + +00:00:00.000 --> 00:00:05.839 +Thank you for joining me today. I'm Sameer Pradhan + +00:00:05.840 --> 00:00:07.799 +from the Linguistic Data Consortium + +00:00:07.800 --> 00:00:10.079 +at the University of Pennsylvania + +00:00:10.080 --> 00:00:14.519 +and founder of cemantix.org . + +00:00:14.520 --> 00:00:16.879 +Today we'll be addressing research + +00:00:16.880 --> 00:00:18.719 +in computational linguistics, + +00:00:18.720 --> 00:00:22.039 +also known as natural language processing + +00:00:22.040 --> 00:00:24.719 +a sub area of artificial intelligence + +00:00:24.720 --> 00:00:27.759 +with a focus on modeling and predicting + +00:00:27.760 --> 00:00:31.919 +complex linguistic structures from various signals. + +00:00:31.920 --> 00:00:35.799 +The work we present is limited to text and speech signals. + +00:00:35.800 --> 00:00:38.639 +but it can be extended to other signals. + +00:00:38.640 --> 00:00:40.799 +We propose an architecture, + +00:00:40.800 --> 00:00:42.959 +and we call it GRAIL, which allows + +00:00:42.960 --> 00:00:44.639 +the representation and aggregation + +00:00:44.640 --> 00:00:50.199 +of such rich structures in a systematic fashion. + +00:00:50.200 --> 00:00:52.679 +I'll demonstrate a proof of concept + +00:00:52.680 --> 00:00:56.559 +for representing and manipulating data and annotations + +00:00:56.560 --> 00:00:58.519 +for the specific purpose of building + +00:00:58.520 --> 00:01:02.879 +machine learning models that simulate understanding. + +00:01:02.880 --> 00:01:05.679 +These technologies have the potential for impact + +00:01:05.680 --> 00:01:09.119 +in almost every conceivable field + +00:01:09.120 --> 00:01:13.399 +that generates and uses data. + +NOTE Processing language + +00:01:13.400 --> 00:01:15.039 +We process human language + +00:01:15.040 --> 00:01:16.719 +when our brains receive and assimilate + +00:01:16.720 --> 00:01:20.079 +various signals which are then manipulated + +00:01:20.080 --> 00:01:23.879 +and interpreted within a syntactic structure. + +00:01:23.880 --> 00:01:27.319 +it's a complex process that I have simplified here + +00:01:27.320 --> 00:01:30.759 +for the purpose of comparison to machine learning. + +00:01:30.760 --> 00:01:33.959 +Recent machine learning models tend to require + +00:01:33.960 --> 00:01:37.039 +a large amount of raw, naturally occurring data + +00:01:37.040 --> 00:01:40.199 +and a varying amount of manually enriched data, + +00:01:40.200 --> 00:01:43.199 +commonly known as "annotations". + +00:01:43.200 --> 00:01:45.959 +Owing to the complex and numerous nature + +00:01:45.960 --> 00:01:49.959 +of linguistic phenomena, we have most often used + +00:01:49.960 --> 00:01:52.999 +a divide and conquer approach. + +00:01:53.000 --> 00:01:55.399 +The strength of this approach is that it allows us + +00:01:55.400 --> 00:01:58.159 +to focus on a single, or perhaps a few related + +00:01:58.160 --> 00:02:00.439 +linguistic phenomena. + +00:02:00.440 --> 00:02:03.879 +The weaknesses are the universe of these phenomena + +00:02:03.880 --> 00:02:07.239 +keep expanding, as language itself + +00:02:07.240 --> 00:02:09.359 +evolves and changes over time, + +00:02:09.360 --> 00:02:13.119 +and second, this approach requires an additional task + +00:02:13.120 --> 00:02:14.839 +of aggregating the interpretations, + +00:02:14.840 --> 00:02:18.359 +creating more opportunities for computer error. + +00:02:18.360 --> 00:02:21.519 +Our challenge, then, is to find the sweet spot + +00:02:21.520 --> 00:02:25.239 +that allows us to encode complex information + +00:02:25.240 --> 00:02:27.719 +without the use of manual annotation, + +00:02:27.720 --> 00:02:34.559 +or without the additional task of aggregation by computers. + +NOTE Annotation + +00:02:34.560 --> 00:02:37.119 +So what do I mean by "annotation"? + +00:02:37.120 --> 00:02:39.759 +In this talk the word annotation refers to + +00:02:39.760 --> 00:02:43.519 +the manual assignment of certain attributes + +00:02:43.520 --> 00:02:48.639 +to portions of a signal which is necessary + +00:02:48.640 --> 00:02:51.639 +to perform the end task. + +00:02:51.640 --> 00:02:54.439 +For example, in order for the algorithm + +00:02:54.440 --> 00:02:57.439 +to accurately interpret a pronoun, + +00:02:57.440 --> 00:03:00.279 +it needs to know that pronoun, + +00:03:00.280 --> 00:03:03.799 +what that pronoun refers back to. + +00:03:03.800 --> 00:03:06.719 +We may find this task trivial, however, + +00:03:06.720 --> 00:03:10.599 +current algorithms repeatedly fail in this task. + +00:03:10.600 --> 00:03:13.319 +So the complexities of understanding + +00:03:13.320 --> 00:03:16.639 +in computational linguistics require annotation. + +00:03:16.640 --> 00:03:20.799 +The world annotation itself is a useful example, + +00:03:20.800 --> 00:03:22.679 +because it also reminds us + +00:03:22.680 --> 00:03:25.119 +that words have multiple meetings + +00:03:25.120 --> 00:03:27.519 +as annotation itself does— + +00:03:27.520 --> 00:03:30.559 +just as I needed to define it in this context, + +00:03:30.560 --> 00:03:33.799 +so that my message won't be misinterpreted. + +00:03:33.800 --> 00:03:39.039 +So, too, must annotators do this for algorithms + +00:03:39.040 --> 00:03:43.239 +through the manual intervention. + +NOTE Learning from data + +00:03:43.240 --> 00:03:44.759 +Learning from raw data + +00:03:44.760 --> 00:03:47.039 +(commonly known as unsupervised learning) + +00:03:47.040 --> 00:03:50.079 +poses limitations for machine learning. + +00:03:50.080 --> 00:03:53.039 +As I described, modeling complex phenomena + +00:03:53.040 --> 00:03:55.559 +need manual annotations. + +00:03:55.560 --> 00:03:58.559 +The learning algorithm uses these annotations + +00:03:58.560 --> 00:04:01.319 +as examples to build statistical models. + +00:04:01.320 --> 00:04:04.879 +This is called supervised learning. + +00:04:04.880 --> 00:04:06.319 +Without going into too much detail, + +00:04:06.320 --> 00:04:10.039 +I'll simply note that the recent popularity + +00:04:10.040 --> 00:04:12.519 +of the concept of deep learning + +00:04:12.520 --> 00:04:14.679 +is that evolutionary step + +00:04:14.680 --> 00:04:17.319 +where we have learned to train models + +00:04:17.320 --> 00:04:20.799 +using trillions of parameters in ways that they can + +00:04:20.800 --> 00:04:25.079 +learn richer hierarchical structures + +00:04:25.080 --> 00:04:29.399 +from very large amounts of annotate, unannotated data. + +00:04:29.400 --> 00:04:32.319 +These models can then be fine-tuned, + +00:04:32.320 --> 00:04:35.599 +using varying amounts of annotated examples + +00:04:35.600 --> 00:04:37.639 +depending on the complexity of the task + +00:04:37.640 --> 00:04:39.679 +to generate better predictions. + +NOTE Manual annotation + +00:04:39.680 --> 00:04:44.919 +As you might imagine, manually annotating + +00:04:44.920 --> 00:04:47.359 +complex, linguistic phenomena + +00:04:47.360 --> 00:04:51.719 +can be very specific, labor-intensive task. + +00:04:51.720 --> 00:04:54.279 +For example, imagine if we were + +00:04:54.280 --> 00:04:56.399 +to go back through this presentation + +00:04:56.400 --> 00:04:58.399 +and connect all the pronouns + +00:04:58.400 --> 00:04:59.919 +with the nouns to which they refer. + +00:04:59.920 --> 00:05:03.239 +Even for a short 18 min presentation, + +00:05:03.240 --> 00:05:05.239 +this would require hundreds of annotations. + +00:05:05.240 --> 00:05:08.519 +The models we build are only as good + +00:05:08.520 --> 00:05:11.119 +as the quality of the annotations we make. + +00:05:11.120 --> 00:05:12.679 +We need guidelines + +00:05:12.680 --> 00:05:15.759 +that ensure that the annotations are done + +00:05:15.760 --> 00:05:19.719 +by at least two humans who have substantial agreement + +00:05:19.720 --> 00:05:22.119 +with each other in their interpretations. + +00:05:22.120 --> 00:05:25.599 +We know that if we try to trade a model using annotations + +00:05:25.600 --> 00:05:28.519 +that are very subjective, or have more noise, + +00:05:28.520 --> 00:05:30.919 +we will receive poor predictions. + +00:05:30.920 --> 00:05:33.679 +Additionally, there is the concern of introducing + +00:05:33.680 --> 00:05:37.079 +various unexpected biases into one's models. + +00:05:37.080 --> 00:05:44.399 +So annotation is really both an art and a science. + +NOTE How can we develop a unified representation? + +00:05:44.400 --> 00:05:47.439 +In the remaining time, + +00:05:47.440 --> 00:05:49.999 +we will turn to two fundamental questions. + +00:05:50.000 --> 00:05:54.239 +First, how can we develop a unified representation + +00:05:54.240 --> 00:05:55.599 +of data and annotations + +00:05:55.600 --> 00:05:59.759 +that encompasses arbitrary levels of linguistic information? + +00:05:59.760 --> 00:06:03.839 +There is a long history of attempting to answer + +00:06:03.840 --> 00:06:04.839 +this first question. + +00:06:04.840 --> 00:06:08.839 +This history is documented in our recent article, + +00:06:08.840 --> 00:06:11.519 +and you can refer to that article. + +00:06:11.520 --> 00:06:16.719 +It will be on the website. + +00:06:16.720 --> 00:06:18.999 +It is as if we, as a community, + +00:06:19.000 --> 00:06:22.519 +have been searching for our own Holy Grail. + +NOTE What role might Emacs and Org mode play? + +00:06:22.520 --> 00:06:26.519 +The second question we will pose is + +00:06:26.520 --> 00:06:30.159 +what role might Emacs, along with Org mode, + +00:06:30.160 --> 00:06:31.919 +play in this process? + +00:06:31.920 --> 00:06:35.359 +Well, the solution itself may not be tied to Emacs. + +00:06:35.360 --> 00:06:38.359 +Emacs has built in capabilities + +00:06:38.360 --> 00:06:42.599 +that could be useful for evaluating potential solutions. + +00:06:42.600 --> 00:06:45.759 +It's also one of the most extensively documented + +00:06:45.760 --> 00:06:48.519 +pieces of software and the most customizable + +00:06:48.520 --> 00:06:51.599 +piece of software that I have ever come across, + +00:06:51.600 --> 00:06:55.279 +and many would agree with that. + +NOTE The complex structure of language + +00:06:55.280 --> 00:07:00.639 +In order to approach this second question, + +00:07:00.640 --> 00:07:03.919 +we turn to the complex structure of language itself. + +00:07:03.920 --> 00:07:07.679 +At first glance, language appears to us + +00:07:07.680 --> 00:07:09.879 +as a series of words. + +00:07:09.880 --> 00:07:13.439 +Words form sentences, sentences form paragraphs, + +00:07:13.440 --> 00:07:16.239 +and paragraphs form completed text. + +00:07:16.240 --> 00:07:19.039 +If this was a sufficient description + +00:07:19.040 --> 00:07:21.159 +of the complexity of language, + +00:07:21.160 --> 00:07:24.199 +all of us would be able to speak and read + +00:07:24.200 --> 00:07:26.559 +at least ten different languages. + +00:07:26.560 --> 00:07:29.279 +We know it is much more complex than this. + +00:07:29.280 --> 00:07:33.199 +There is a rich, underlying recursive tree structure-- + +00:07:33.200 --> 00:07:36.439 +in fact, many possible tree structures + +00:07:36.440 --> 00:07:39.439 +which makes a particular sequence meaningful + +00:07:39.440 --> 00:07:42.079 +and many others meaningless. + +00:07:42.080 --> 00:07:45.239 +One of the better understood tree structures + +00:07:45.240 --> 00:07:47.119 +is the syntactic structure. + +00:07:47.120 --> 00:07:49.439 +While natural language + +00:07:49.440 --> 00:07:51.679 +has rich ambiguities and complexities, + +00:07:51.680 --> 00:07:55.119 +programming languages are designed to be parsed + +00:07:55.120 --> 00:07:56.999 +and interpreted deterministically. + +00:07:57.000 --> 00:08:02.159 +Emacs has been used for programming very effectively. + +00:08:02.160 --> 00:08:05.359 +So there is a potential for using Emacs + +00:08:05.360 --> 00:08:06.559 +as a tool for annotation. + +00:08:06.560 --> 00:08:10.799 +This would significantly improve our current set of tools. + +NOTE Annotation tools + +00:08:10.800 --> 00:08:16.559 +It is important to note that most of the annotation tools + +00:08:16.560 --> 00:08:19.639 +that have been developed over the past few decades + +00:08:19.640 --> 00:08:22.879 +have relied on graphical interfaces, + +00:08:22.880 --> 00:08:26.919 +even those used for enriching textual information. + +00:08:26.920 --> 00:08:30.399 +Most of the tools in current use + +00:08:30.400 --> 00:08:36.159 +are designed for a end user to add very specific, + +00:08:36.160 --> 00:08:38.639 +very restricted information. + +00:08:38.640 --> 00:08:42.799 +We have not really made use of the potential + +00:08:42.800 --> 00:08:45.639 +that an editor or a rich editing environment like Emacs + +00:08:45.640 --> 00:08:47.239 +can add to the mix. + +00:08:47.240 --> 00:08:52.479 +Emacs has long enabled the editing of, the manipulation of + +00:08:52.480 --> 00:08:56.359 +complex embedded tree structures abundant in source code. + +00:08:56.360 --> 00:08:58.599 +So it's not difficult to imagine that it would have + +00:08:58.600 --> 00:09:00.359 +many capabilities that we we need + +00:09:00.360 --> 00:09:02.599 +to represent actual language. + +00:09:02.600 --> 00:09:04.759 +In fact, it already does that with features + +00:09:04.760 --> 00:09:06.399 +that allow us to quickly navigate + +00:09:06.400 --> 00:09:07.919 +through sentences and paragraphs, + +00:09:07.920 --> 00:09:09.799 +and we don't need a few key strokes. + +00:09:09.800 --> 00:09:13.599 +Or to add various text properties to text spans + +00:09:13.600 --> 00:09:17.039 +to create overlays, to name but a few. + +00:09:17.040 --> 00:09:22.719 +Emacs figured out this way to handle Unicode, + +00:09:22.720 --> 00:09:26.799 +so you don't even have to worry about the complexity + +00:09:26.800 --> 00:09:29.439 +of managing multiple languages. + +00:09:29.440 --> 00:09:34.039 +It's built into Emacs. In fact, this is not the first time + +00:09:34.040 --> 00:09:37.399 +Emacs has been used for linguistic analysis. + +00:09:37.400 --> 00:09:41.159 +One of the breakthrough moments in language, + +00:09:41.160 --> 00:09:44.439 +natural language processing was the creation + +00:09:44.440 --> 00:09:48.639 +of manually created syntactic trees + +00:09:48.640 --> 00:09:50.439 +for a 1 million word collection + +00:09:50.440 --> 00:09:52.399 +of Wall Street Journal articles. + +00:09:52.400 --> 00:09:54.879 +This was else around 1992 + +00:09:54.880 --> 00:09:59.279 +before Java or graphical interfaces were common. + +00:09:59.280 --> 00:10:03.279 +The tool that was used to create that corpus was Emacs. + +00:10:03.280 --> 00:10:08.959 +It was created at UPenn, and is famously known as + +00:10:08.960 --> 00:10:12.719 +the Penn Treebank. '92 was about when + +00:10:12.720 --> 00:10:16.439 +the Linguistic Data Consortium was also established, + +00:10:16.440 --> 00:10:18.039 +and it's been about 30 years + +00:10:18.040 --> 00:10:20.719 +that it has been creating various + +00:10:20.720 --> 00:10:22.359 +language-related resources. + +NOTE Org mode + +00:10:22.360 --> 00:10:28.519 +Org mode--in particular, the outlining mode, + +00:10:28.520 --> 00:10:32.399 +or rather the enhanced form of outlining mode-- + +00:10:32.400 --> 00:10:35.599 +allows us to create rich outlines, + +00:10:35.600 --> 00:10:37.799 +attaching properties to nodes, + +00:10:37.800 --> 00:10:41.119 +and provides commands for easily customizing + +00:10:41.120 --> 00:10:43.879 +sorting of various pieces of information + +00:10:43.880 --> 00:10:45.639 +as per one's requirement. + +00:10:45.640 --> 00:10:50.239 +This can also be a very useful tool. + +00:10:50.240 --> 00:10:59.159 +This enhanced form of outline-mode adds more power to Emacs. + +00:10:59.160 --> 00:11:03.359 +It provides commands for easily customizing + +00:11:03.360 --> 00:11:05.159 +and filtering information, + +00:11:05.160 --> 00:11:08.999 +while at the same time hiding unnecessary context. + +00:11:09.000 --> 00:11:11.919 +It also allows structural editing. + +00:11:11.920 --> 00:11:16.039 +This can be a very useful tool to enrich corpora + +00:11:16.040 --> 00:11:20.919 +where we are focusing on limited amount of phenomena. + +00:11:20.920 --> 00:11:24.519 +The two together allow us to create + +00:11:24.520 --> 00:11:27.199 +a rich representation + +00:11:27.200 --> 00:11:32.999 +that can simultaneously capture multiple possible sequences, + +00:11:33.000 --> 00:11:38.759 +capture details necessary to recreate the original source, + +00:11:38.760 --> 00:11:42.079 +allow the creation of hierarchical representation, + +00:11:42.080 --> 00:11:44.679 +provide structural editing capabilities + +00:11:44.680 --> 00:11:47.439 +that can take advantage of the concept of inheritance + +00:11:47.440 --> 00:11:48.999 +within the tree structure. + +00:11:49.000 --> 00:11:54.279 +Together they allow local manipulations of structures, + +00:11:54.280 --> 00:11:56.199 +thereby minimizing data coupling. + +00:11:56.200 --> 00:11:59.119 +The concept of tags in Org mode + +00:11:59.120 --> 00:12:01.599 +complement the hierarchy part. + +00:12:01.600 --> 00:12:03.839 +Hierarchies can be very rigid, + +00:12:03.840 --> 00:12:06.039 +but to tags on hierarchies, + +00:12:06.040 --> 00:12:08.839 +we can have a multifaceted representations. + +00:12:08.840 --> 00:12:12.759 +As a matter of fact, Org mode has the ability for the tags + +00:12:12.760 --> 00:12:15.039 +to have their own hierarchical structure + +00:12:15.040 --> 00:12:18.639 +which further enhances the representational power. + +00:12:18.640 --> 00:12:22.639 +All of this can be done as a sequence + +00:12:22.640 --> 00:12:25.679 +of mostly functional data transformations, + +00:12:25.680 --> 00:12:27.439 +because most of the capabilities + +00:12:27.440 --> 00:12:29.759 +can be configured and customized. + +00:12:29.760 --> 00:12:32.799 +It is not necessary to do everything at once. + +00:12:32.800 --> 00:12:36.199 +Instead, it allows us to incrementally increase + +00:12:36.200 --> 00:12:37.919 +the complexity of the representation. + +00:12:37.920 --> 00:12:39.799 +Finally, all of this can be done + +00:12:39.800 --> 00:12:42.359 +in plain-text representation + +00:12:42.360 --> 00:12:45.479 +which comes with its own advantages. + +NOTE Example + +00:12:45.480 --> 00:12:50.679 +Now let's take a simple example. + +00:12:50.680 --> 00:12:55.999 +This is a a short video that I'll play. + +00:12:56.000 --> 00:12:59.679 +The sentence is "I saw the moon with a telescope," + +00:12:59.680 --> 00:13:03.999 +and let's just make a copy of the sentence. + +00:13:04.000 --> 00:13:09.199 +What we can do now is to see: + +00:13:09.200 --> 00:13:11.879 +what does this sentence comprise? + +00:13:11.880 --> 00:13:13.679 +It has a noun phrase "I," + +00:13:13.680 --> 00:13:17.479 +followed by a word "saw." + +00:13:17.480 --> 00:13:21.359 +Then "the moon" is another noun phrase, + +00:13:21.360 --> 00:13:24.839 +and "with the telescope" is a prepositional phrase. + +00:13:24.840 --> 00:13:30.759 +Now one thing that you might remember, + +00:13:30.760 --> 00:13:36.119 +from grammar school or syntax is that + +00:13:36.120 --> 00:13:41.279 +there is a syntactic structure. + +00:13:41.280 --> 00:13:44.359 +And if you in this particular case-- + +00:13:44.360 --> 00:13:47.919 +because we know that the moon is not typically + +00:13:47.920 --> 00:13:51.679 +something that can hold the telescope, + +00:13:51.680 --> 00:13:56.239 +that the seeing must be done by me or "I," + +00:13:56.240 --> 00:14:01.039 +and the telescope must be in my hand, + +00:14:01.040 --> 00:14:04.479 +or "I" am viewing the moon with a telescope. + +00:14:04.480 --> 00:14:13.519 +However, it is possible that in a different context + +00:14:13.520 --> 00:14:17.159 +the moon could be referring to an animated character + +00:14:17.160 --> 00:14:22.319 +in a animated series, and could actually hold the telescope. + +00:14:22.320 --> 00:14:23.479 +And this is one of the most-- + +00:14:23.480 --> 00:14:24.839 +the oldest and one of the most-- + +00:14:24.840 --> 00:14:26.319 +and in that case the situation might be + +00:14:26.320 --> 00:14:30.959 +that I'm actually seeing the moon holding a telescope... + +00:14:30.960 --> 00:14:36.079 +I mean. The moon is holding the telescope, + +00:14:36.080 --> 00:14:40.959 +and I'm just seeing the moon holding the telescope. + +00:14:40.960 --> 00:14:47.999 +Complex linguistic ambiguity or linguistic + +00:14:48.000 --> 00:14:53.599 +phenomena that requires world knowledge, + +00:14:53.600 --> 00:14:55.719 +and it's called the PP attachment problem + +00:14:55.720 --> 00:14:59.239 +where the propositional phrase attachment + +00:14:59.240 --> 00:15:04.599 +can be ambiguous, and various different contextual cues + +00:15:04.600 --> 00:15:06.879 +have to be used to resolve the ambiguity. + +00:15:06.880 --> 00:15:09.079 +So in this case, as you saw, + +00:15:09.080 --> 00:15:11.199 +both the readings are technically true, + +00:15:11.200 --> 00:15:13.959 +depending on different contexts. + +00:15:13.960 --> 00:15:16.599 +So one thing we could do is just + +00:15:16.600 --> 00:15:19.919 +to cut the tree and duplicate it, + +00:15:19.920 --> 00:15:21.599 +and then let's create another node + +00:15:21.600 --> 00:15:24.479 +and call it an "OR" node. + +00:15:24.480 --> 00:15:26.119 +And because we are saying, + +00:15:26.120 --> 00:15:28.359 +this is one of the two interpretations. + +00:15:28.360 --> 00:15:32.159 +Now let's call one interpretation "a", + +00:15:32.160 --> 00:15:36.159 +and that interpretation essentially + +00:15:36.160 --> 00:15:39.319 +is this child of that node "a" + +00:15:39.320 --> 00:15:41.799 +and that says that the moon + +00:15:41.800 --> 00:15:43.999 +is holding the telescope. + +00:15:44.000 --> 00:15:46.359 +Now we can create another representation "b" + +00:15:46.360 --> 00:15:53.919 +where we capture the other interpretation, + +00:15:53.920 --> 00:15:59.959 +where this, the act, the moon or--I am actually + +00:15:59.960 --> 00:16:00.519 +holding the telescope, + +00:16:00.520 --> 00:16:06.799 +and watching the moon using it. + +00:16:06.800 --> 00:16:09.199 +So now we have two separate interpretations + +00:16:09.200 --> 00:16:11.679 +in the same structure, + +00:16:11.680 --> 00:16:15.519 +and all we do--we're able to do is with this, + +00:16:15.520 --> 00:16:18.159 +with very quick key strokes now... + +00:16:18.160 --> 00:16:22.439 +While we are at it, let's add another interesting thing, + +00:16:22.440 --> 00:16:25.159 +this node that represents "I": + +00:16:25.160 --> 00:16:28.919 +"He." It can be "She". + +00:16:28.920 --> 00:16:35.759 +It can be "the children," or it can be "The people". + +00:16:35.760 --> 00:16:45.039 +Basically, any entity that has the capability to "see" + +00:16:45.040 --> 00:16:53.359 +can be substituted in this particular node. + +00:16:53.360 --> 00:16:57.399 +Let's see what we have here now. + +00:16:57.400 --> 00:17:01.239 +We just are getting sort of a zoom view + +00:17:01.240 --> 00:17:04.599 +of the entire structure, what we created, + +00:17:04.600 --> 00:17:08.039 +and essentially you can see that + +00:17:08.040 --> 00:17:11.879 +by just, you know, using a few keystrokes, + +00:17:11.880 --> 00:17:17.839 +we were able to capture two different interpretations + +00:17:17.840 --> 00:17:20.879 +of a a simple sentence, + +00:17:20.880 --> 00:17:23.759 +and they are also able to add + +00:17:23.760 --> 00:17:27.799 +these alternate pieces of information + +00:17:27.800 --> 00:17:30.559 +that could help machine learning algorithms + +00:17:30.560 --> 00:17:32.439 +generalize better. + +00:17:32.440 --> 00:17:36.239 +All right. + +NOTE Different readings + +00:17:36.240 --> 00:17:40.359 +Now, let's look at the next thing. So in a sense, + +00:17:40.360 --> 00:17:46.679 +we can use this power of functional data structures + +00:17:46.680 --> 00:17:50.239 +to represent various potentially conflicting + +00:17:50.240 --> 00:17:55.559 +and structural readings of that piece of text. + +00:17:55.560 --> 00:17:58.079 +In addition to that, we can also create more texts, + +00:17:58.080 --> 00:17:59.799 +each with different structure, + +00:17:59.800 --> 00:18:01.559 +and have them all in the same place. + +00:18:01.560 --> 00:18:04.239 +This allows us to address the interpretation + +00:18:04.240 --> 00:18:06.879 +of a static sentence that might be occurring in the world, + +00:18:06.880 --> 00:18:09.639 +while simultaneously inserting information + +00:18:09.640 --> 00:18:11.519 +that would add more value to it. + +00:18:11.520 --> 00:18:14.999 +This makes the enrichment process also very efficient. + +00:18:15.000 --> 00:18:19.519 +Additionally, we can envision + +00:18:19.520 --> 00:18:23.999 +a power user of the future, or present, + +00:18:24.000 --> 00:18:27.479 +who can not only annotate a span, + +00:18:27.480 --> 00:18:31.279 +but also edit the information in situ + +00:18:31.280 --> 00:18:34.639 +in a way that would help machine algorithms + +00:18:34.640 --> 00:18:36.879 +generalize better by making more efficient use + +00:18:36.880 --> 00:18:37.719 +of the annotations. + +00:18:37.720 --> 00:18:41.519 +So together, Emacs and Org mode can speed up + +00:18:41.520 --> 00:18:42.959 +the enrichment of the signals + +00:18:42.960 --> 00:18:44.519 +in a way that allows us + +00:18:44.520 --> 00:18:47.719 +to focus on certain aspects and ignore others. + +00:18:47.720 --> 00:18:50.839 +Extremely complex landscape of rich structures + +00:18:50.840 --> 00:18:53.039 +can be captured consistently, + +00:18:53.040 --> 00:18:55.639 +in a fashion that allows computers + +00:18:55.640 --> 00:18:56.759 +to understand language. + +00:18:56.760 --> 00:19:00.879 +We can then build tools to enhance the tasks + +00:19:00.880 --> 00:19:03.319 +that we do in our everyday life. + +00:19:03.320 --> 00:19:10.759 +YAMR is acronym, or the file's type or specification + +00:19:10.760 --> 00:19:15.239 +that we are creating to capture this new + +00:19:15.240 --> 00:19:17.679 +rich representation. + +NOTE Spontaneous speech + +00:19:17.680 --> 00:19:21.959 +We'll now look at an example of spontaneous speech + +00:19:21.960 --> 00:19:24.799 +that occurs in spoken conversations. + +00:19:24.800 --> 00:19:28.599 +Conversations frequently contain errors in speech: + +00:19:28.600 --> 00:19:30.799 +interruptions, disfluencies, + +00:19:30.800 --> 00:19:33.959 +verbal sounds such as cough or laugh, + +00:19:33.960 --> 00:19:35.039 +and other noises. + +00:19:35.040 --> 00:19:38.199 +In this sense, spontaneous speech is similar + +00:19:38.200 --> 00:19:39.799 +to a functional data stream. + +00:19:39.800 --> 00:19:42.759 +We cannot take back words that come out of our mouth, + +00:19:42.760 --> 00:19:47.239 +but we tend to make mistakes, and we correct ourselves + +00:19:47.240 --> 00:19:49.039 +as soon as we realize that we have made-- + +00:19:49.040 --> 00:19:50.679 +we have misspoken. + +00:19:50.680 --> 00:19:53.159 +This process manifests through a combination + +00:19:53.160 --> 00:19:56.279 +of a handful of mechanisms, including immediate correction + +00:19:56.280 --> 00:20:00.959 +after an error, and we do this unconsciously. + +00:20:00.960 --> 00:20:02.719 +Computers, on the other hand, + +00:20:02.720 --> 00:20:06.639 +must be taught to understand these cases. + +00:20:06.640 --> 00:20:12.799 +What we see here is a example document or outline, + +00:20:12.800 --> 00:20:18.119 +or part of a document that illustrates + +00:20:18.120 --> 00:20:22.919 +various different aspects of the representation. + +00:20:22.920 --> 00:20:25.919 +We don't have a lot of time to go through + +00:20:25.920 --> 00:20:28.239 +many of the details. + +00:20:28.240 --> 00:20:31.759 +I would highly encourage you to play a... + +00:20:31.760 --> 00:20:39.159 +I'm planning on making some videos, or ascii cinemas, + +00:20:39.160 --> 00:20:42.559 +that I'll be posting, and you can, + +00:20:42.560 --> 00:20:46.759 +if you're interested, you can go through those. + +00:20:46.760 --> 00:20:50.359 +The idea here is to try to do + +00:20:50.360 --> 00:20:54.599 +a slightly more complex use case. + +00:20:54.600 --> 00:20:57.639 +But again, given the time constraint + +00:20:57.640 --> 00:21:00.279 +and the amount of information + +00:21:00.280 --> 00:21:01.519 +that needs to fit in the screen, + +00:21:01.520 --> 00:21:05.559 +this may not be very informative, + +00:21:05.560 --> 00:21:08.399 +but at least it will give you some idea + +00:21:08.400 --> 00:21:10.439 +of what can be possible. + +00:21:10.440 --> 00:21:13.279 +And in this particular case, what you're seeing is that + +00:21:13.280 --> 00:21:18.319 +there is a sentence which is "What I'm I'm tr- telling now." + +00:21:18.320 --> 00:21:21.159 +Essentially, there is a repetition of the word "I'm", + +00:21:21.160 --> 00:21:23.279 +and then there is a partial word + +00:21:23.280 --> 00:21:25.159 +that somebody tried to say "telling", + +00:21:25.160 --> 00:21:29.599 +but started saying "tr-", and then corrected themselves + +00:21:29.600 --> 00:21:30.959 +and said, "telling now." + +00:21:30.960 --> 00:21:39.239 +So in this case, you see, we can capture words + +00:21:39.240 --> 00:21:44.919 +or a sequence of words, or a sequence of tokens. + +00:21:44.920 --> 00:21:52.279 +One thing to... An interesting thing to note is that in NLP, + +00:21:52.280 --> 00:21:55.319 +sometimes we have to break typically + +00:21:55.320 --> 00:22:01.199 +words that don't have spaces into two separate words, + +00:22:01.200 --> 00:22:04.119 +especially contractions like "I'm", + +00:22:04.120 --> 00:22:08.199 +so the syntactic parser needs needs two separate nodes. + +00:22:08.200 --> 00:22:11.199 +But anyway, so I'll... You can see that here. + +00:22:11.200 --> 00:22:15.759 +The other... This view. What this view shows is that + +00:22:15.760 --> 00:22:19.759 +with each of the nodes in the sentence + +00:22:19.760 --> 00:22:23.079 +or in the representation, + +00:22:23.080 --> 00:22:26.079 +you can have a lot of different properties + +00:22:26.080 --> 00:22:27.559 +that you can attach to them, + +00:22:27.560 --> 00:22:30.119 +and these properties are typically hidden, + +00:22:30.120 --> 00:22:32.719 +like you saw in the earlier slide. + +00:22:32.720 --> 00:22:35.599 +But you can make use of all these properties + +00:22:35.600 --> 00:22:39.439 +to do various kind of searches and filtering. + +00:22:39.440 --> 00:22:43.519 +And on the right hand side here-- + +00:22:43.520 --> 00:22:48.799 +this is actually not a legitimate syntax-- + +00:22:48.800 --> 00:22:51.279 +but on the right are descriptions + +00:22:51.280 --> 00:22:53.479 +of what each of these represent. + +00:22:53.480 --> 00:22:57.319 +All the information is also available in the article. + +00:22:57.320 --> 00:23:04.279 +You can see there... It shows how much rich context + +00:23:04.280 --> 00:23:05.879 +you can capture. + +00:23:05.880 --> 00:23:08.799 +This is just a closer snapshot + +00:23:08.800 --> 00:23:10.159 +of the properties on the node, + +00:23:10.160 --> 00:23:13.119 +and you can see we can have things like, + +00:23:13.120 --> 00:23:14.799 +whether the word is a token or not, + +00:23:14.800 --> 00:23:17.359 +or that it's incomplete, whether some words + +00:23:17.360 --> 00:23:19.959 +might want to be filtered out for parsing, + +00:23:19.960 --> 00:23:23.039 +and we can say this: PARSE_IGNORE, + +00:23:23.040 --> 00:23:25.519 +or some words or restart markers... + +00:23:25.520 --> 00:23:29.239 +We can mark, add a RESTART_MARKER, or sometimes, + +00:23:29.240 --> 00:23:31.999 +some of these might have durations. Things like that. + +NOTE Editing properties in column view + +00:23:32.000 --> 00:23:38.799 +The other fascinating thing of this representation + +00:23:38.800 --> 00:23:42.599 +is that you can edit properties in the column view. + +00:23:42.600 --> 00:23:45.399 +And suddenly, you have this tabular data structure + +00:23:45.400 --> 00:23:48.879 +combined with the hierarchical data structure. + +00:23:48.880 --> 00:23:53.119 +And as you can--you may not be able to see it here, + +00:23:53.120 --> 00:23:56.879 +but what has also happened here is that + +00:23:56.880 --> 00:24:01.159 +some of the tags have been inherited + +00:24:01.160 --> 00:24:02.479 +from the earlier nodes. + +00:24:02.480 --> 00:24:07.919 +And so you get a much fuller picture of things. + +00:24:07.920 --> 00:24:13.919 +Essentially you, can filter out things + +00:24:13.920 --> 00:24:15.319 +that you want to process, + +00:24:15.320 --> 00:24:20.279 +process them, and then reintegrate it into the whole. + +NOTE Conclusion + +00:24:20.280 --> 00:24:25.479 +So, in conclusion, today we have proposed and demonstrated + +00:24:25.480 --> 00:24:27.559 +the use of an architecture (GRAIL), + +00:24:27.560 --> 00:24:31.319 +which allows the representation, manipulation, + +00:24:31.320 --> 00:24:34.759 +and aggregation of rich linguistic structures + +00:24:34.760 --> 00:24:36.519 +in a systematic fashion. + +00:24:36.520 --> 00:24:41.359 +We have shown how GRAIL advances the tools + +00:24:41.360 --> 00:24:44.599 +available for building machine learning models + +00:24:44.600 --> 00:24:46.879 +that simulate understanding. + +00:24:46.880 --> 00:24:51.679 +Thank you very much for your time and attention today. + +00:24:51.680 --> 00:24:54.639 +My contact information is on this slide. + +00:24:54.640 --> 00:25:02.599 +If you are interested in an additional example + +00:25:02.600 --> 00:25:05.439 +that demonstrates the representation + +00:25:05.440 --> 00:25:08.039 +of speech and written text together, + +00:25:08.040 --> 00:25:10.719 +please continue watching. + +00:25:10.720 --> 00:25:12.199 +Otherwise, you can stop here + +00:25:12.200 --> 00:25:15.279 +and enjoy the rest of the conference. + +NOTE Bonus material + +00:25:15.280 --> 00:25:39.079 +Welcome to the bonus material. + +00:25:39.080 --> 00:25:43.959 +I'm glad for those of you who are stuck around. + +00:25:43.960 --> 00:25:46.559 +We are now going to examine an instance + +00:25:46.560 --> 00:25:49.159 +of speech and text signals together + +00:25:49.160 --> 00:25:51.479 +that produce multiple layers. + +00:25:51.480 --> 00:25:54.839 +When we have--when we take a spoken conversation + +00:25:54.840 --> 00:25:58.719 +and use the best language processing models available, + +00:25:58.720 --> 00:26:00.679 +we suddenly hit a hard spot + +00:26:00.680 --> 00:26:03.239 +because the tools are typically not trained + +00:26:03.240 --> 00:26:05.359 +to filter out the unnecessary cruft + +00:26:05.360 --> 00:26:07.559 +in order to automatically interpret + +00:26:07.560 --> 00:26:09.559 +the part of what is being said + +00:26:09.560 --> 00:26:11.799 +that is actually relevant. + +00:26:11.800 --> 00:26:14.639 +Over time, language researchers + +00:26:14.640 --> 00:26:17.719 +have created many interdependent layers of annotations, + +00:26:17.720 --> 00:26:21.039 +yet the assumptions underlying them are seldom the same. + +00:26:21.040 --> 00:26:25.039 +Piecing together such related but disjointed annotations + +00:26:25.040 --> 00:26:28.039 +on their predictions poses a huge challenge. + +00:26:28.040 --> 00:26:30.719 +This is another place where we can leverage + +00:26:30.720 --> 00:26:33.119 +the data model underlying the Emacs editor, + +00:26:33.120 --> 00:26:35.359 +along with the structural editing capabilities + +00:26:35.360 --> 00:26:38.519 +of Org mode to improve current tools. + +00:26:38.520 --> 00:26:42.839 +Let's take this very simple looking utterance. + +00:26:42.840 --> 00:26:48.039 +"Um \{lipsmack\} and that's it. (\{laugh\})" + +00:26:48.040 --> 00:26:50.319 +Looks like the person-- so this is-- + +00:26:50.320 --> 00:26:54.519 +what you are seeing here is a transcript of an audio signal + +00:26:54.520 --> 00:27:00.759 +that has a lip smack and a laugh as part of it, + +00:27:00.760 --> 00:27:04.199 +and there is also a "Um" like interjection. + +00:27:04.200 --> 00:27:08.199 +So this has a few interesting noises + +00:27:08.200 --> 00:27:13.999 +and specific things that would be illustrative + +00:27:14.000 --> 00:27:20.479 +of what we are going to, how we are going to represent it. + +NOTE Syntactic analysis + +00:27:20.480 --> 00:27:25.839 +Okay. So let's say you want to have + +00:27:25.840 --> 00:27:28.879 +a syntactic analysis of this sentence or utterance. + +00:27:28.880 --> 00:27:30.959 +One common technique people use + +00:27:30.960 --> 00:27:32.879 +is just to remove the cruft, and, you know, + +00:27:32.880 --> 00:27:35.079 +write some rules, clean up the utterance, + +00:27:35.080 --> 00:27:36.719 +make it look like it's proper English, + +00:27:36.720 --> 00:27:40.239 +and then, you know, tokenize it, + +00:27:40.240 --> 00:27:43.079 +and basically just use standard tools to process it. + +00:27:43.080 --> 00:27:47.279 +But in that process, they end up eliminating + +00:27:47.280 --> 00:27:51.119 +valid pieces of signal that have meaning to others + +00:27:51.120 --> 00:27:52.799 +studying different phenomena of language. + +00:27:52.800 --> 00:27:56.479 +Here you have the rich transcript, + +00:27:56.480 --> 00:28:00.119 +the input to the syntactic parser. + +00:28:00.120 --> 00:28:05.919 +As you can see, there is a little tokenization happening + +00:28:05.920 --> 00:28:07.199 +where you'll be inserting space + +00:28:07.200 --> 00:28:12.119 +between "that" and the contracted is ('s), + +00:28:12.120 --> 00:28:15.599 +and between the period and the "it," + +00:28:15.600 --> 00:28:18.199 +and the output of the syntactic parser is shown below. + +00:28:18.200 --> 00:28:21.639 +which (surprise) is a S-expression. + +00:28:21.640 --> 00:28:24.919 +Like I said, the parse trees, when they were created, + +00:28:24.920 --> 00:28:29.799 +and still largely when they are used, are S-expressions, + +00:28:29.800 --> 00:28:32.999 +and most of the viewers here + +00:28:33.000 --> 00:28:35.119 +should not have much problem reading it. + +00:28:35.120 --> 00:28:37.279 +You can see this tree structure + +00:28:37.280 --> 00:28:39.279 +of this syntactic parser here. + +NOTE Forced alignment + +00:28:39.280 --> 00:28:40.919 +Now let's say you want to integrate + +00:28:40.920 --> 00:28:44.479 +phonetic information or phonetic layer + +00:28:44.480 --> 00:28:49.119 +that's in the audio signal, and do some analysis. + +00:28:49.120 --> 00:28:57.519 +Now, it would need you to do a few-- take a few steps. + +00:28:57.520 --> 00:29:01.679 +First, you would need to align the transcript + +00:29:01.680 --> 00:29:06.479 +with the audio. This process is called forced alignment, + +00:29:06.480 --> 00:29:10.399 +where you already know what the transcript is, + +00:29:10.400 --> 00:29:14.599 +and you have the audio, and you can get a good alignment + +00:29:14.600 --> 00:29:17.599 +using both pieces of information. + +00:29:17.600 --> 00:29:20.119 +And this is typically a technique that is used to + +00:29:20.120 --> 00:29:23.079 +create training data for training + +00:29:23.080 --> 00:29:25.839 +automatic speech recognizers. + +00:29:25.840 --> 00:29:29.639 +One interesting thing is that in order to do + +00:29:29.640 --> 00:29:32.879 +this forced alignment, you have to keep + +00:29:32.880 --> 00:29:35.799 +the non-speech events in transcript, + +00:29:35.800 --> 00:29:39.079 +because they consume some audio signal, + +00:29:39.080 --> 00:29:41.399 +and if you don't have that signal, + +00:29:41.400 --> 00:29:44.399 +the alignment process doesn't know exactly... + +00:29:44.400 --> 00:29:45.759 +you know, it doesn't do a good job, + +00:29:45.760 --> 00:29:50.039 +because it needs to align all parts of the signal + +00:29:50.040 --> 00:29:54.999 +with something, either pause or silence or noise or words. + +00:29:55.000 --> 00:29:59.719 +Interestingly, punctuations really don't factor in, + +00:29:59.720 --> 00:30:01.559 +because we don't speak in punctuations. + +00:30:01.560 --> 00:30:04.239 +So one of the things that you need to do + +00:30:04.240 --> 00:30:05.679 +is remove most of the punctuations, + +00:30:05.680 --> 00:30:08.039 +although you'll see there are some punctuations + +00:30:08.040 --> 00:30:12.599 +that can be kept, or that are to be kept. + +NOTE Alignment before tokenization + +00:30:12.600 --> 00:30:15.319 +And the other thing is that the alignment has to be done + +00:30:15.320 --> 00:30:20.159 +before tokenization, as it impacts pronunciation. + +00:30:20.160 --> 00:30:24.399 +To show an example: Here you see "that's". + +00:30:24.400 --> 00:30:26.919 +When it's one word, + +00:30:26.920 --> 00:30:31.959 +it has a slightly different pronunciation + +00:30:31.960 --> 00:30:35.679 +than when it is two words, which is "that is", + +00:30:35.680 --> 00:30:38.399 +like you can see "is." And so, + +00:30:38.400 --> 00:30:44.279 +if you split the tokens or split the words + +00:30:44.280 --> 00:30:48.119 +in order for syntactic parser to process it, + +00:30:48.120 --> 00:30:51.599 +you would end up getting the wrong phonetic analysis. + +00:30:51.600 --> 00:30:54.239 +And if you have--if you process it + +00:30:54.240 --> 00:30:55.319 +through the phonetic analysis, + +00:30:55.320 --> 00:30:59.159 +and you don't know how to integrate it + +00:30:59.160 --> 00:31:02.719 +with the tokenized syntax, you can, you know, + +00:31:02.720 --> 00:31:07.519 +that can be pretty tricky. And a lot of time, + +00:31:07.520 --> 00:31:10.759 +people write one-off pieces of code that handle these, + +00:31:10.760 --> 00:31:14.279 +but the idea here is to try to have a general architecture + +00:31:14.280 --> 00:31:17.239 +that seamlessly integrates all these pieces. + +00:31:17.240 --> 00:31:21.319 +Then you do the syntactic parsing of the remaining tokens. + +00:31:21.320 --> 00:31:24.799 +Then you align the data and the two annotations, + +00:31:24.800 --> 00:31:27.959 +and then integrate the two layers. + +00:31:27.960 --> 00:31:31.359 +Once that is done, then you can do all kinds of + +00:31:31.360 --> 00:31:33.919 +interesting analysis, and test various hypotheses + +00:31:33.920 --> 00:31:35.279 +and generate the statistics, + +00:31:35.280 --> 00:31:39.359 +but without that you only are dealing + +00:31:39.360 --> 00:31:42.879 +with one or the other part. + +NOTE Layers + +00:31:42.880 --> 00:31:48.319 +Let's just take a quick look at how each of the layers + +00:31:48.320 --> 00:31:51.159 +that are involved look like. + +00:31:51.160 --> 00:31:56.719 +So this is "Um \{lipsmack\}, and that's it. \{laugh\}" + +00:31:56.720 --> 00:32:00.159 +This is the transcript, and on the right hand side, + +00:32:00.160 --> 00:32:04.199 +you see the same thing as a transcript + +00:32:04.200 --> 00:32:06.239 +listed in a vertical in a column. + +00:32:06.240 --> 00:32:08.199 +You'll see why, in just a second. + +00:32:08.200 --> 00:32:09.879 +And there are some place-- + +00:32:09.880 --> 00:32:11.279 +there are some rows that are empty, + +00:32:11.280 --> 00:32:15.079 +some rows that are wider than the others, and we'll see why. + +00:32:15.080 --> 00:32:19.319 +The next is the tokenized sentence + +00:32:19.320 --> 00:32:20.959 +where you have space added, + +00:32:20.960 --> 00:32:23.599 +you know space between these two tokens: + +00:32:23.600 --> 00:32:26.599 +"that" and the apostrophe "s" ('s), + +00:32:26.600 --> 00:32:28.079 +and the "it" and the "period". + +00:32:28.080 --> 00:32:30.679 +And you see on the right hand side + +00:32:30.680 --> 00:32:33.559 +that the tokens have attributes. + +00:32:33.560 --> 00:32:36.439 +So there is a token index, and there are 1, 2, + +00:32:36.440 --> 00:32:38.839 +you know 0, 1, 2, 3, 4, 5 tokens, + +00:32:38.840 --> 00:32:41.479 +and each token has a start and end character, + +00:32:41.480 --> 00:32:45.799 +and space (sp) also has a start and end character, + +00:32:45.800 --> 00:32:50.399 +and space is represented by a "sp". And there are + +00:32:50.400 --> 00:32:54.319 +these other things that we removed, + +00:32:54.320 --> 00:32:56.239 +like the "\{LS\}" which is for "\{lipsmack\}" + +00:32:56.240 --> 00:32:59.399 +and "\{LG\}" which is "\{laugh\}" are showing grayed out, + +00:32:59.400 --> 00:33:02.439 +and you'll see why some of these things are grayed out + +00:33:02.440 --> 00:33:03.399 +in a little bit. + +00:33:03.400 --> 00:33:11.919 +This is what the forced alignment tool produces. + +00:33:11.920 --> 00:33:17.159 +Basically, it takes the transcript, + +00:33:17.160 --> 00:33:19.159 +and this is the transcript + +00:33:19.160 --> 00:33:24.119 +that has slightly different symbols, + +00:33:24.120 --> 00:33:26.239 +because different tools use different symbols + +00:33:26.240 --> 00:33:28.159 +and their various configurational things. + +00:33:28.160 --> 00:33:33.679 +But this is what is used to get an alignment + +00:33:33.680 --> 00:33:36.039 +or time alignment with phones. + +00:33:36.040 --> 00:33:40.079 +So this column shows the phones, and so each word... + +00:33:40.080 --> 00:33:43.879 +So, for example, "and" has been aligned with these phones, + +00:33:43.880 --> 00:33:46.879 +and these on the start and end + +00:33:46.880 --> 00:33:52.959 +are essentially temporal or time stamps that it aligned-- + +00:33:52.960 --> 00:33:54.279 +that has been aligned to it. + +00:33:54.280 --> 00:34:00.759 +Interestingly, sometimes we don't really have any pause + +00:34:00.760 --> 00:34:05.159 +or any time duration between some words + +00:34:05.160 --> 00:34:08.199 +and those are highlighted as gray here. + +00:34:08.200 --> 00:34:12.759 +See, there's this space... Actually + +00:34:12.760 --> 00:34:17.799 +it does not have any temporal content, + +00:34:17.800 --> 00:34:21.319 +whereas this other space has some duration. + +00:34:21.320 --> 00:34:24.839 +So the ones that have some duration are captured, + +00:34:24.840 --> 00:34:29.519 +while the others are the ones that in the earlier diagram + +00:34:29.520 --> 00:34:31.319 +we saw were left out. + +NOTE Variations + +00:34:31.320 --> 00:34:37.639 +And the aligner actually produces multiple files. + +00:34:37.640 --> 00:34:44.399 +One of the files has a different, slightly different + +00:34:44.400 --> 00:34:46.679 +variation on the same information, + +00:34:46.680 --> 00:34:49.999 +and in this case, you can see + +00:34:50.000 --> 00:34:52.399 +that the punctuation is missing, + +00:34:52.400 --> 00:34:57.599 +and the punctuation is, you know, deliberately missing, + +00:34:57.600 --> 00:35:02.279 +because there is no time associated with it, + +00:35:02.280 --> 00:35:06.439 +and you see that it's not the tokenized sentence-- + +00:35:06.440 --> 00:35:17.119 +a tokenized word. This... Now it gives you a full table, + +00:35:17.120 --> 00:35:21.239 +and you can't really look into it very carefully. + +00:35:21.240 --> 00:35:25.879 +But we can focus on the part that seems legible, + +00:35:25.880 --> 00:35:28.559 +or, you know, properly written sentence, + +00:35:28.560 --> 00:35:32.879 +process it and reincorporate it back into the whole. + +00:35:32.880 --> 00:35:35.879 +So if somebody wants to look at, for example, + +00:35:35.880 --> 00:35:39.679 +how many pauses the person made while they were talking, + +00:35:39.680 --> 00:35:42.919 +And they can actually measure the pause, the number, + +00:35:42.920 --> 00:35:46.279 +the duration, and make connections between that + +00:35:46.280 --> 00:35:49.639 +and the rich syntactic structure that is being produced. + +00:35:49.640 --> 00:35:57.279 +And in order to do that, you have to get these layers + +00:35:57.280 --> 00:35:59.039 +to align with each other, + +00:35:59.040 --> 00:36:04.359 +and this table is just a tabular representation + +00:36:04.360 --> 00:36:08.679 +of the information that we'll be storing in the YAMR file. + +00:36:08.680 --> 00:36:11.719 +Congratulations! You have reached + +00:36:11.720 --> 00:36:13.479 +the end of this demonstration. + +00:36:13.480 --> 00:36:17.000 +Thank you for your time and attention. |