WEBVTT captioned by sameer NOTE Introduction 00:00:00.000 --> 00:00:05.839 Thank you for joining me today. I'm Sameer Pradhan 00:00:05.840 --> 00:00:07.799 from the Linguistic Data Consortium 00:00:07.800 --> 00:00:10.079 at the University of Pennsylvania 00:00:10.080 --> 00:00:14.519 and founder of cemantix.org . 00:00:14.520 --> 00:00:16.879 Today we'll be addressing research 00:00:16.880 --> 00:00:18.719 in computational linguistics, 00:00:18.720 --> 00:00:22.039 also known as natural language processing 00:00:22.040 --> 00:00:24.719 a sub area of artificial intelligence 00:00:24.720 --> 00:00:27.759 with a focus on modeling and predicting 00:00:27.760 --> 00:00:31.919 complex linguistic structures from various signals. 00:00:31.920 --> 00:00:35.799 The work we present is limited to text and speech signals. 00:00:35.800 --> 00:00:38.639 but it can be extended to other signals. 00:00:38.640 --> 00:00:40.799 We propose an architecture, 00:00:40.800 --> 00:00:42.959 and we call it GRAIL, which allows 00:00:42.960 --> 00:00:44.639 the representation and aggregation 00:00:44.640 --> 00:00:50.199 of such rich structures in a systematic fashion. 00:00:50.200 --> 00:00:52.679 I'll demonstrate a proof of concept 00:00:52.680 --> 00:00:56.559 for representing and manipulating data and annotations 00:00:56.560 --> 00:00:58.519 for the specific purpose of building 00:00:58.520 --> 00:01:02.879 machine learning models that simulate understanding. 00:01:02.880 --> 00:01:05.679 These technologies have the potential for impact 00:01:05.680 --> 00:01:09.119 in almost every conceivable field 00:01:09.120 --> 00:01:13.399 that generates and uses data. NOTE Processing language 00:01:13.400 --> 00:01:15.039 We process human language 00:01:15.040 --> 00:01:16.719 when our brains receive and assimilate 00:01:16.720 --> 00:01:20.079 various signals which are then manipulated 00:01:20.080 --> 00:01:23.879 and interpreted within a syntactic structure. 00:01:23.880 --> 00:01:27.319 it's a complex process that I have simplified here 00:01:27.320 --> 00:01:30.759 for the purpose of comparison to machine learning. 00:01:30.760 --> 00:01:33.959 Recent machine learning models tend to require 00:01:33.960 --> 00:01:37.039 a large amount of raw, naturally occurring data 00:01:37.040 --> 00:01:40.199 and a varying amount of manually enriched data, 00:01:40.200 --> 00:01:43.199 commonly known as "annotations". 00:01:43.200 --> 00:01:45.959 Owing to the complex and numerous nature 00:01:45.960 --> 00:01:49.959 of linguistic phenomena, we have most often used 00:01:49.960 --> 00:01:52.999 a divide and conquer approach. 00:01:53.000 --> 00:01:55.399 The strength of this approach is that it allows us 00:01:55.400 --> 00:01:58.159 to focus on a single, or perhaps a few related 00:01:58.160 --> 00:02:00.439 linguistic phenomena. 00:02:00.440 --> 00:02:03.879 The weaknesses are the universe of these phenomena 00:02:03.880 --> 00:02:07.239 keep expanding, as language itself 00:02:07.240 --> 00:02:09.359 evolves and changes over time, 00:02:09.360 --> 00:02:13.119 and second, this approach requires an additional task 00:02:13.120 --> 00:02:14.839 of aggregating the interpretations, 00:02:14.840 --> 00:02:18.359 creating more opportunities for computer error. 00:02:18.360 --> 00:02:21.519 Our challenge, then, is to find the sweet spot 00:02:21.520 --> 00:02:25.239 that allows us to encode complex information 00:02:25.240 --> 00:02:27.719 without the use of manual annotation, 00:02:27.720 --> 00:02:34.559 or without the additional task of aggregation by computers. NOTE Annotation 00:02:34.560 --> 00:02:37.119 So what do I mean by "annotation"? 00:02:37.120 --> 00:02:39.759 In this talk the word annotation refers to 00:02:39.760 --> 00:02:43.519 the manual assignment of certain attributes 00:02:43.520 --> 00:02:48.639 to portions of a signal which is necessary 00:02:48.640 --> 00:02:51.639 to perform the end task. 00:02:51.640 --> 00:02:54.439 For example, in order for the algorithm 00:02:54.440 --> 00:02:57.439 to accurately interpret a pronoun, 00:02:57.440 --> 00:03:00.279 it needs to know that pronoun, 00:03:00.280 --> 00:03:03.799 what that pronoun refers back to. 00:03:03.800 --> 00:03:06.719 We may find this task trivial, however, 00:03:06.720 --> 00:03:10.599 current algorithms repeatedly fail in this task. 00:03:10.600 --> 00:03:13.319 So the complexities of understanding 00:03:13.320 --> 00:03:16.639 in computational linguistics require annotation. 00:03:16.640 --> 00:03:20.799 The world annotation itself is a useful example, 00:03:20.800 --> 00:03:22.679 because it also reminds us 00:03:22.680 --> 00:03:25.119 that words have multiple meetings 00:03:25.120 --> 00:03:27.519 as annotation itself does— 00:03:27.520 --> 00:03:30.559 just as I needed to define it in this context, 00:03:30.560 --> 00:03:33.799 so that my message won't be misinterpreted. 00:03:33.800 --> 00:03:39.039 So, too, must annotators do this for algorithms 00:03:39.040 --> 00:03:43.239 through the manual intervention. NOTE Learning from data 00:03:43.240 --> 00:03:44.759 Learning from raw data 00:03:44.760 --> 00:03:47.039 (commonly known as unsupervised learning) 00:03:47.040 --> 00:03:50.079 poses limitations for machine learning. 00:03:50.080 --> 00:03:53.039 As I described, modeling complex phenomena 00:03:53.040 --> 00:03:55.559 need manual annotations. 00:03:55.560 --> 00:03:58.559 The learning algorithm uses these annotations 00:03:58.560 --> 00:04:01.319 as examples to build statistical models. 00:04:01.320 --> 00:04:04.879 This is called supervised learning. 00:04:04.880 --> 00:04:06.319 Without going into too much detail, 00:04:06.320 --> 00:04:10.039 I'll simply note that the recent popularity 00:04:10.040 --> 00:04:12.519 of the concept of deep learning 00:04:12.520 --> 00:04:14.679 is that evolutionary step 00:04:14.680 --> 00:04:17.319 where we have learned to train models 00:04:17.320 --> 00:04:20.799 using trillions of parameters in ways that they can 00:04:20.800 --> 00:04:25.079 learn richer hierarchical structures 00:04:25.080 --> 00:04:29.399 from very large amounts of annotate, unannotated data. 00:04:29.400 --> 00:04:32.319 These models can then be fine-tuned, 00:04:32.320 --> 00:04:35.599 using varying amounts of annotated examples 00:04:35.600 --> 00:04:37.639 depending on the complexity of the task 00:04:37.640 --> 00:04:39.679 to generate better predictions. NOTE Manual annotation 00:04:39.680 --> 00:04:44.919 As you might imagine, manually annotating 00:04:44.920 --> 00:04:47.359 complex, linguistic phenomena 00:04:47.360 --> 00:04:51.719 can be very specific, labor-intensive task. 00:04:51.720 --> 00:04:54.279 For example, imagine if we were 00:04:54.280 --> 00:04:56.399 to go back through this presentation 00:04:56.400 --> 00:04:58.399 and connect all the pronouns 00:04:58.400 --> 00:04:59.919 with the nouns to which they refer. 00:04:59.920 --> 00:05:03.239 Even for a short 18 min presentation, 00:05:03.240 --> 00:05:05.239 this would require hundreds of annotations. 00:05:05.240 --> 00:05:08.519 The models we build are only as good 00:05:08.520 --> 00:05:11.119 as the quality of the annotations we make. 00:05:11.120 --> 00:05:12.679 We need guidelines 00:05:12.680 --> 00:05:15.759 that ensure that the annotations are done 00:05:15.760 --> 00:05:19.719 by at least two humans who have substantial agreement 00:05:19.720 --> 00:05:22.119 with each other in their interpretations. 00:05:22.120 --> 00:05:25.599 We know that if we try to trade a model using annotations 00:05:25.600 --> 00:05:28.519 that are very subjective, or have more noise, 00:05:28.520 --> 00:05:30.919 we will receive poor predictions. 00:05:30.920 --> 00:05:33.679 Additionally, there is the concern of introducing 00:05:33.680 --> 00:05:37.079 various unexpected biases into one's models. 00:05:37.080 --> 00:05:44.399 So annotation is really both an art and a science. NOTE How can we develop a unified representation? 00:05:44.400 --> 00:05:47.439 In the remaining time, 00:05:47.440 --> 00:05:49.999 we will turn to two fundamental questions. 00:05:50.000 --> 00:05:54.239 First, how can we develop a unified representation 00:05:54.240 --> 00:05:55.599 of data and annotations 00:05:55.600 --> 00:05:59.759 that encompasses arbitrary levels of linguistic information? 00:05:59.760 --> 00:06:03.839 There is a long history of attempting to answer 00:06:03.840 --> 00:06:04.839 this first question. 00:06:04.840 --> 00:06:08.839 This history is documented in our recent article, 00:06:08.840 --> 00:06:11.519 and you can refer to that article. 00:06:11.520 --> 00:06:16.719 It will be on the website. 00:06:16.720 --> 00:06:18.999 It is as if we, as a community, 00:06:19.000 --> 00:06:22.519 have been searching for our own Holy Grail. NOTE What role might Emacs and Org mode play? 00:06:22.520 --> 00:06:26.519 The second question we will pose is 00:06:26.520 --> 00:06:30.159 what role might Emacs, along with Org mode, 00:06:30.160 --> 00:06:31.919 play in this process? 00:06:31.920 --> 00:06:35.359 Well, the solution itself may not be tied to Emacs. 00:06:35.360 --> 00:06:38.359 Emacs has built in capabilities 00:06:38.360 --> 00:06:42.599 that could be useful for evaluating potential solutions. 00:06:42.600 --> 00:06:45.759 It's also one of the most extensively documented 00:06:45.760 --> 00:06:48.519 pieces of software and the most customizable 00:06:48.520 --> 00:06:51.599 piece of software that I have ever come across, 00:06:51.600 --> 00:06:55.279 and many would agree with that. NOTE The complex structure of language 00:06:55.280 --> 00:07:00.639 In order to approach this second question, 00:07:00.640 --> 00:07:03.919 we turn to the complex structure of language itself. 00:07:03.920 --> 00:07:07.679 At first glance, language appears to us 00:07:07.680 --> 00:07:09.879 as a series of words. 00:07:09.880 --> 00:07:13.439 Words form sentences, sentences form paragraphs, 00:07:13.440 --> 00:07:16.239 and paragraphs form completed text. 00:07:16.240 --> 00:07:19.039 If this was a sufficient description 00:07:19.040 --> 00:07:21.159 of the complexity of language, 00:07:21.160 --> 00:07:24.199 all of us would be able to speak and read 00:07:24.200 --> 00:07:26.559 at least ten different languages. 00:07:26.560 --> 00:07:29.279 We know it is much more complex than this. 00:07:29.280 --> 00:07:33.199 There is a rich, underlying recursive tree structure-- 00:07:33.200 --> 00:07:36.439 in fact, many possible tree structures 00:07:36.440 --> 00:07:39.439 which makes a particular sequence meaningful 00:07:39.440 --> 00:07:42.079 and many others meaningless. 00:07:42.080 --> 00:07:45.239 One of the better understood tree structures 00:07:45.240 --> 00:07:47.119 is the syntactic structure. 00:07:47.120 --> 00:07:49.439 While natural language 00:07:49.440 --> 00:07:51.679 has rich ambiguities and complexities, 00:07:51.680 --> 00:07:55.119 programming languages are designed to be parsed 00:07:55.120 --> 00:07:56.999 and interpreted deterministically. 00:07:57.000 --> 00:08:02.159 Emacs has been used for programming very effectively. 00:08:02.160 --> 00:08:05.359 So there is a potential for using Emacs 00:08:05.360 --> 00:08:06.559 as a tool for annotation. 00:08:06.560 --> 00:08:10.799 This would significantly improve our current set of tools. NOTE Annotation tools 00:08:10.800 --> 00:08:16.559 It is important to note that most of the annotation tools 00:08:16.560 --> 00:08:19.639 that have been developed over the past few decades 00:08:19.640 --> 00:08:22.879 have relied on graphical interfaces, 00:08:22.880 --> 00:08:26.919 even those used for enriching textual information. 00:08:26.920 --> 00:08:30.399 Most of the tools in current use 00:08:30.400 --> 00:08:36.159 are designed for a end user to add very specific, 00:08:36.160 --> 00:08:38.639 very restricted information. 00:08:38.640 --> 00:08:42.799 We have not really made use of the potential 00:08:42.800 --> 00:08:45.639 that an editor or a rich editing environment like Emacs 00:08:45.640 --> 00:08:47.239 can add to the mix. 00:08:47.240 --> 00:08:52.479 Emacs has long enabled the editing of, the manipulation of 00:08:52.480 --> 00:08:56.359 complex embedded tree structures abundant in source code. 00:08:56.360 --> 00:08:58.599 So it's not difficult to imagine that it would have 00:08:58.600 --> 00:09:00.359 many capabilities that we we need 00:09:00.360 --> 00:09:02.599 to represent actual language. 00:09:02.600 --> 00:09:04.759 In fact, it already does that with features 00:09:04.760 --> 00:09:06.399 that allow us to quickly navigate 00:09:06.400 --> 00:09:07.919 through sentences and paragraphs, 00:09:07.920 --> 00:09:09.799 and we don't need a few key strokes. 00:09:09.800 --> 00:09:13.599 Or to add various text properties to text spans 00:09:13.600 --> 00:09:17.039 to create overlays, to name but a few. 00:09:17.040 --> 00:09:22.719 Emacs figured out this way to handle Unicode, 00:09:22.720 --> 00:09:26.799 so you don't even have to worry about the complexity 00:09:26.800 --> 00:09:29.439 of managing multiple languages. 00:09:29.440 --> 00:09:34.039 It's built into Emacs. In fact, this is not the first time 00:09:34.040 --> 00:09:37.399 Emacs has been used for linguistic analysis. 00:09:37.400 --> 00:09:41.159 One of the breakthrough moments in language, 00:09:41.160 --> 00:09:44.439 natural language processing was the creation 00:09:44.440 --> 00:09:48.639 of manually created syntactic trees 00:09:48.640 --> 00:09:50.439 for a 1 million word collection 00:09:50.440 --> 00:09:52.399 of Wall Street Journal articles. 00:09:52.400 --> 00:09:54.879 This was else around 1992 00:09:54.880 --> 00:09:59.279 before Java or graphical interfaces were common. 00:09:59.280 --> 00:10:03.279 The tool that was used to create that corpus was Emacs. 00:10:03.280 --> 00:10:08.959 It was created at UPenn, and is famously known as 00:10:08.960 --> 00:10:12.719 the Penn Treebank. '92 was about when 00:10:12.720 --> 00:10:16.439 the Linguistic Data Consortium was also established, 00:10:16.440 --> 00:10:18.039 and it's been about 30 years 00:10:18.040 --> 00:10:20.719 that it has been creating various 00:10:20.720 --> 00:10:22.359 language-related resources. NOTE Org mode 00:10:22.360 --> 00:10:28.519 Org mode--in particular, the outlining mode, 00:10:28.520 --> 00:10:32.399 or rather the enhanced form of outlining mode-- 00:10:32.400 --> 00:10:35.599 allows us to create rich outlines, 00:10:35.600 --> 00:10:37.799 attaching properties to nodes, 00:10:37.800 --> 00:10:41.119 and provides commands for easily customizing 00:10:41.120 --> 00:10:43.879 sorting of various pieces of information 00:10:43.880 --> 00:10:45.639 as per one's requirement. 00:10:45.640 --> 00:10:50.239 This can also be a very useful tool. 00:10:50.240 --> 00:10:59.159 This enhanced form of outline-mode adds more power to Emacs. 00:10:59.160 --> 00:11:03.359 It provides commands for easily customizing 00:11:03.360 --> 00:11:05.159 and filtering information, 00:11:05.160 --> 00:11:08.999 while at the same time hiding unnecessary context. 00:11:09.000 --> 00:11:11.919 It also allows structural editing. 00:11:11.920 --> 00:11:16.039 This can be a very useful tool to enrich corpora 00:11:16.040 --> 00:11:20.919 where we are focusing on limited amount of phenomena. 00:11:20.920 --> 00:11:24.519 The two together allow us to create 00:11:24.520 --> 00:11:27.199 a rich representation 00:11:27.200 --> 00:11:32.999 that can simultaneously capture multiple possible sequences, 00:11:33.000 --> 00:11:38.759 capture details necessary to recreate the original source, 00:11:38.760 --> 00:11:42.079 allow the creation of hierarchical representation, 00:11:42.080 --> 00:11:44.679 provide structural editing capabilities 00:11:44.680 --> 00:11:47.439 that can take advantage of the concept of inheritance 00:11:47.440 --> 00:11:48.999 within the tree structure. 00:11:49.000 --> 00:11:54.279 Together they allow local manipulations of structures, 00:11:54.280 --> 00:11:56.199 thereby minimizing data coupling. 00:11:56.200 --> 00:11:59.119 The concept of tags in Org mode 00:11:59.120 --> 00:12:01.599 complement the hierarchy part. 00:12:01.600 --> 00:12:03.839 Hierarchies can be very rigid, 00:12:03.840 --> 00:12:06.039 but to tags on hierarchies, 00:12:06.040 --> 00:12:08.839 we can have a multifaceted representations. 00:12:08.840 --> 00:12:12.759 As a matter of fact, Org mode has the ability for the tags 00:12:12.760 --> 00:12:15.039 to have their own hierarchical structure 00:12:15.040 --> 00:12:18.639 which further enhances the representational power. 00:12:18.640 --> 00:12:22.639 All of this can be done as a sequence 00:12:22.640 --> 00:12:25.679 of mostly functional data transformations, 00:12:25.680 --> 00:12:27.439 because most of the capabilities 00:12:27.440 --> 00:12:29.759 can be configured and customized. 00:12:29.760 --> 00:12:32.799 It is not necessary to do everything at once. 00:12:32.800 --> 00:12:36.199 Instead, it allows us to incrementally increase 00:12:36.200 --> 00:12:37.919 the complexity of the representation. 00:12:37.920 --> 00:12:39.799 Finally, all of this can be done 00:12:39.800 --> 00:12:42.359 in plain-text representation 00:12:42.360 --> 00:12:45.479 which comes with its own advantages. NOTE Example 00:12:45.480 --> 00:12:50.679 Now let's take a simple example. 00:12:50.680 --> 00:12:55.999 This is a a short video that I'll play. 00:12:56.000 --> 00:12:59.679 The sentence is "I saw the moon with a telescope," 00:12:59.680 --> 00:13:03.999 and let's just make a copy of the sentence. 00:13:04.000 --> 00:13:09.199 What we can do now is to see: 00:13:09.200 --> 00:13:11.879 what does this sentence comprise? 00:13:11.880 --> 00:13:13.679 It has a noun phrase "I," 00:13:13.680 --> 00:13:17.479 followed by a word "saw." 00:13:17.480 --> 00:13:21.359 Then "the moon" is another noun phrase, 00:13:21.360 --> 00:13:24.839 and "with the telescope" is a prepositional phrase. 00:13:24.840 --> 00:13:30.759 Now one thing that you might remember, 00:13:30.760 --> 00:13:36.119 from grammar school or syntax is that 00:13:36.120 --> 00:13:41.279 there is a syntactic structure. 00:13:41.280 --> 00:13:44.359 And if you in this particular case-- 00:13:44.360 --> 00:13:47.919 because we know that the moon is not typically 00:13:47.920 --> 00:13:51.679 something that can hold the telescope, 00:13:51.680 --> 00:13:56.239 that the seeing must be done by me or "I," 00:13:56.240 --> 00:14:01.039 and the telescope must be in my hand, 00:14:01.040 --> 00:14:04.479 or "I" am viewing the moon with a telescope. 00:14:04.480 --> 00:14:13.519 However, it is possible that in a different context 00:14:13.520 --> 00:14:17.159 the moon could be referring to an animated character 00:14:17.160 --> 00:14:22.319 in a animated series, and could actually hold the telescope. 00:14:22.320 --> 00:14:23.479 And this is one of the most-- 00:14:23.480 --> 00:14:24.839 the oldest and one of the most-- 00:14:24.840 --> 00:14:26.319 and in that case the situation might be 00:14:26.320 --> 00:14:30.959 that I'm actually seeing the moon holding a telescope... 00:14:30.960 --> 00:14:36.079 I mean. The moon is holding the telescope, 00:14:36.080 --> 00:14:40.959 and I'm just seeing the moon holding the telescope. 00:14:40.960 --> 00:14:47.999 Complex linguistic ambiguity or linguistic 00:14:48.000 --> 00:14:53.599 phenomena that requires world knowledge, 00:14:53.600 --> 00:14:55.719 and it's called the PP attachment problem 00:14:55.720 --> 00:14:59.239 where the propositional phrase attachment 00:14:59.240 --> 00:15:04.599 can be ambiguous, and various different contextual cues 00:15:04.600 --> 00:15:06.879 have to be used to resolve the ambiguity. 00:15:06.880 --> 00:15:09.079 So in this case, as you saw, 00:15:09.080 --> 00:15:11.199 both the readings are technically true, 00:15:11.200 --> 00:15:13.959 depending on different contexts. 00:15:13.960 --> 00:15:16.599 So one thing we could do is just 00:15:16.600 --> 00:15:19.919 to cut the tree and duplicate it, 00:15:19.920 --> 00:15:21.599 and then let's create another node 00:15:21.600 --> 00:15:24.479 and call it an "OR" node. 00:15:24.480 --> 00:15:26.119 And because we are saying, 00:15:26.120 --> 00:15:28.359 this is one of the two interpretations. 00:15:28.360 --> 00:15:32.159 Now let's call one interpretation "a", 00:15:32.160 --> 00:15:36.159 and that interpretation essentially 00:15:36.160 --> 00:15:39.319 is this child of that node "a" 00:15:39.320 --> 00:15:41.799 and that says that the moon 00:15:41.800 --> 00:15:43.999 is holding the telescope. 00:15:44.000 --> 00:15:46.359 Now we can create another representation "b" 00:15:46.360 --> 00:15:53.919 where we capture the other interpretation, 00:15:53.920 --> 00:15:59.959 where this, the act, the moon or--I am actually 00:15:59.960 --> 00:16:00.519 holding the telescope, 00:16:00.520 --> 00:16:06.799 and watching the moon using it. 00:16:06.800 --> 00:16:09.199 So now we have two separate interpretations 00:16:09.200 --> 00:16:11.679 in the same structure, 00:16:11.680 --> 00:16:15.519 and all we do--we're able to do is with this, 00:16:15.520 --> 00:16:18.159 with very quick key strokes now... 00:16:18.160 --> 00:16:22.439 While we are at it, let's add another interesting thing, 00:16:22.440 --> 00:16:25.159 this node that represents "I": 00:16:25.160 --> 00:16:28.919 "He." It can be "She". 00:16:28.920 --> 00:16:35.759 It can be "the children," or it can be "The people". 00:16:35.760 --> 00:16:45.039 Basically, any entity that has the capability to "see" 00:16:45.040 --> 00:16:53.359 can be substituted in this particular node. 00:16:53.360 --> 00:16:57.399 Let's see what we have here now. 00:16:57.400 --> 00:17:01.239 We just are getting sort of a zoom view 00:17:01.240 --> 00:17:04.599 of the entire structure, what we created, 00:17:04.600 --> 00:17:08.039 and essentially you can see that 00:17:08.040 --> 00:17:11.879 by just, you know, using a few keystrokes, 00:17:11.880 --> 00:17:17.839 we were able to capture two different interpretations 00:17:17.840 --> 00:17:20.879 of a a simple sentence, 00:17:20.880 --> 00:17:23.759 and they are also able to add 00:17:23.760 --> 00:17:27.799 these alternate pieces of information 00:17:27.800 --> 00:17:30.559 that could help machine learning algorithms 00:17:30.560 --> 00:17:32.439 generalize better. 00:17:32.440 --> 00:17:36.239 All right. NOTE Different readings 00:17:36.240 --> 00:17:40.359 Now, let's look at the next thing. So in a sense, 00:17:40.360 --> 00:17:46.679 we can use this power of functional data structures 00:17:46.680 --> 00:17:50.239 to represent various potentially conflicting 00:17:50.240 --> 00:17:55.559 and structural readings of that piece of text. 00:17:55.560 --> 00:17:58.079 In addition to that, we can also create more texts, 00:17:58.080 --> 00:17:59.799 each with different structure, 00:17:59.800 --> 00:18:01.559 and have them all in the same place. 00:18:01.560 --> 00:18:04.239 This allows us to address the interpretation 00:18:04.240 --> 00:18:06.879 of a static sentence that might be occurring in the world, 00:18:06.880 --> 00:18:09.639 while simultaneously inserting information 00:18:09.640 --> 00:18:11.519 that would add more value to it. 00:18:11.520 --> 00:18:14.999 This makes the enrichment process also very efficient. 00:18:15.000 --> 00:18:19.519 Additionally, we can envision 00:18:19.520 --> 00:18:23.999 a power user of the future, or present, 00:18:24.000 --> 00:18:27.479 who can not only annotate a span, 00:18:27.480 --> 00:18:31.279 but also edit the information in situ 00:18:31.280 --> 00:18:34.639 in a way that would help machine algorithms 00:18:34.640 --> 00:18:36.879 generalize better by making more efficient use 00:18:36.880 --> 00:18:37.719 of the annotations. 00:18:37.720 --> 00:18:41.519 So together, Emacs and Org mode can speed up 00:18:41.520 --> 00:18:42.959 the enrichment of the signals 00:18:42.960 --> 00:18:44.519 in a way that allows us 00:18:44.520 --> 00:18:47.719 to focus on certain aspects and ignore others. 00:18:47.720 --> 00:18:50.839 Extremely complex landscape of rich structures 00:18:50.840 --> 00:18:53.039 can be captured consistently, 00:18:53.040 --> 00:18:55.639 in a fashion that allows computers 00:18:55.640 --> 00:18:56.759 to understand language. 00:18:56.760 --> 00:19:00.879 We can then build tools to enhance the tasks 00:19:00.880 --> 00:19:03.319 that we do in our everyday life. 00:19:03.320 --> 00:19:10.759 YAMR is acronym, or the file's type or specification 00:19:10.760 --> 00:19:15.239 that we are creating to capture this new 00:19:15.240 --> 00:19:17.679 rich representation. NOTE Spontaneous speech 00:19:17.680 --> 00:19:21.959 We'll now look at an example of spontaneous speech 00:19:21.960 --> 00:19:24.799 that occurs in spoken conversations. 00:19:24.800 --> 00:19:28.599 Conversations frequently contain errors in speech: 00:19:28.600 --> 00:19:30.799 interruptions, disfluencies, 00:19:30.800 --> 00:19:33.959 verbal sounds such as cough or laugh, 00:19:33.960 --> 00:19:35.039 and other noises. 00:19:35.040 --> 00:19:38.199 In this sense, spontaneous speech is similar 00:19:38.200 --> 00:19:39.799 to a functional data stream. 00:19:39.800 --> 00:19:42.759 We cannot take back words that come out of our mouth, 00:19:42.760 --> 00:19:47.239 but we tend to make mistakes, and we correct ourselves 00:19:47.240 --> 00:19:49.039 as soon as we realize that we have made-- 00:19:49.040 --> 00:19:50.679 we have misspoken. 00:19:50.680 --> 00:19:53.159 This process manifests through a combination 00:19:53.160 --> 00:19:56.279 of a handful of mechanisms, including immediate correction 00:19:56.280 --> 00:20:00.959 after an error, and we do this unconsciously. 00:20:00.960 --> 00:20:02.719 Computers, on the other hand, 00:20:02.720 --> 00:20:06.639 must be taught to understand these cases. 00:20:06.640 --> 00:20:12.799 What we see here is a example document or outline, 00:20:12.800 --> 00:20:18.119 or part of a document that illustrates 00:20:18.120 --> 00:20:22.919 various different aspects of the representation. 00:20:22.920 --> 00:20:25.919 We don't have a lot of time to go through 00:20:25.920 --> 00:20:28.239 many of the details. 00:20:28.240 --> 00:20:31.759 I would highly encourage you to play a... 00:20:31.760 --> 00:20:39.159 I'm planning on making some videos, or ascii cinemas, 00:20:39.160 --> 00:20:42.559 that I'll be posting, and you can, 00:20:42.560 --> 00:20:46.759 if you're interested, you can go through those. 00:20:46.760 --> 00:20:50.359 The idea here is to try to do 00:20:50.360 --> 00:20:54.599 a slightly more complex use case. 00:20:54.600 --> 00:20:57.639 But again, given the time constraint 00:20:57.640 --> 00:21:00.279 and the amount of information 00:21:00.280 --> 00:21:01.519 that needs to fit in the screen, 00:21:01.520 --> 00:21:05.559 this may not be very informative, 00:21:05.560 --> 00:21:08.399 but at least it will give you some idea 00:21:08.400 --> 00:21:10.439 of what can be possible. 00:21:10.440 --> 00:21:13.279 And in this particular case, what you're seeing is that 00:21:13.280 --> 00:21:18.319 there is a sentence which is "What I'm I'm tr- telling now." 00:21:18.320 --> 00:21:21.159 Essentially, there is a repetition of the word "I'm", 00:21:21.160 --> 00:21:23.279 and then there is a partial word 00:21:23.280 --> 00:21:25.159 that somebody tried to say "telling", 00:21:25.160 --> 00:21:29.599 but started saying "tr-", and then corrected themselves 00:21:29.600 --> 00:21:30.959 and said, "telling now." 00:21:30.960 --> 00:21:39.239 So in this case, you see, we can capture words 00:21:39.240 --> 00:21:44.919 or a sequence of words, or a sequence of tokens. 00:21:44.920 --> 00:21:52.279 One thing to... An interesting thing to note is that in NLP, 00:21:52.280 --> 00:21:55.319 sometimes we have to break typically 00:21:55.320 --> 00:22:01.199 words that don't have spaces into two separate words, 00:22:01.200 --> 00:22:04.119 especially contractions like "I'm", 00:22:04.120 --> 00:22:08.199 so the syntactic parser needs needs two separate nodes. 00:22:08.200 --> 00:22:11.199 But anyway, so I'll... You can see that here. 00:22:11.200 --> 00:22:15.759 The other... This view. What this view shows is that 00:22:15.760 --> 00:22:19.759 with each of the nodes in the sentence 00:22:19.760 --> 00:22:23.079 or in the representation, 00:22:23.080 --> 00:22:26.079 you can have a lot of different properties 00:22:26.080 --> 00:22:27.559 that you can attach to them, 00:22:27.560 --> 00:22:30.119 and these properties are typically hidden, 00:22:30.120 --> 00:22:32.719 like you saw in the earlier slide. 00:22:32.720 --> 00:22:35.599 But you can make use of all these properties 00:22:35.600 --> 00:22:39.439 to do various kind of searches and filtering. 00:22:39.440 --> 00:22:43.519 And on the right hand side here-- 00:22:43.520 --> 00:22:48.799 this is actually not a legitimate syntax-- 00:22:48.800 --> 00:22:51.279 but on the right are descriptions 00:22:51.280 --> 00:22:53.479 of what each of these represent. 00:22:53.480 --> 00:22:57.319 All the information is also available in the article. 00:22:57.320 --> 00:23:04.279 You can see there... It shows how much rich context 00:23:04.280 --> 00:23:05.879 you can capture. 00:23:05.880 --> 00:23:08.799 This is just a closer snapshot 00:23:08.800 --> 00:23:10.159 of the properties on the node, 00:23:10.160 --> 00:23:13.119 and you can see we can have things like, 00:23:13.120 --> 00:23:14.799 whether the word is a token or not, 00:23:14.800 --> 00:23:17.359 or that it's incomplete, whether some words 00:23:17.360 --> 00:23:19.959 might want to be filtered out for parsing, 00:23:19.960 --> 00:23:23.039 and we can say this: PARSE_IGNORE, 00:23:23.040 --> 00:23:25.519 or some words or restart markers... 00:23:25.520 --> 00:23:29.239 We can mark, add a RESTART_MARKER, or sometimes, 00:23:29.240 --> 00:23:31.999 some of these might have durations. Things like that. NOTE Editing properties in column view 00:23:32.000 --> 00:23:38.799 The other fascinating thing of this representation 00:23:38.800 --> 00:23:42.599 is that you can edit properties in the column view. 00:23:42.600 --> 00:23:45.399 And suddenly, you have this tabular data structure 00:23:45.400 --> 00:23:48.879 combined with the hierarchical data structure. 00:23:48.880 --> 00:23:53.119 And as you can--you may not be able to see it here, 00:23:53.120 --> 00:23:56.879 but what has also happened here is that 00:23:56.880 --> 00:24:01.159 some of the tags have been inherited 00:24:01.160 --> 00:24:02.479 from the earlier nodes. 00:24:02.480 --> 00:24:07.919 And so you get a much fuller picture of things. 00:24:07.920 --> 00:24:13.919 Essentially you, can filter out things 00:24:13.920 --> 00:24:15.319 that you want to process, 00:24:15.320 --> 00:24:20.279 process them, and then reintegrate it into the whole. NOTE Conclusion 00:24:20.280 --> 00:24:25.479 So, in conclusion, today we have proposed and demonstrated 00:24:25.480 --> 00:24:27.559 the use of an architecture (GRAIL), 00:24:27.560 --> 00:24:31.319 which allows the representation, manipulation, 00:24:31.320 --> 00:24:34.759 and aggregation of rich linguistic structures 00:24:34.760 --> 00:24:36.519 in a systematic fashion. 00:24:36.520 --> 00:24:41.359 We have shown how GRAIL advances the tools 00:24:41.360 --> 00:24:44.599 available for building machine learning models 00:24:44.600 --> 00:24:46.879 that simulate understanding. 00:24:46.880 --> 00:24:51.679 Thank you very much for your time and attention today. 00:24:51.680 --> 00:24:54.639 My contact information is on this slide. 00:24:54.640 --> 00:25:02.599 If you are interested in an additional example 00:25:02.600 --> 00:25:05.439 that demonstrates the representation 00:25:05.440 --> 00:25:08.039 of speech and written text together, 00:25:08.040 --> 00:25:10.719 please continue watching. 00:25:10.720 --> 00:25:12.199 Otherwise, you can stop here 00:25:12.200 --> 00:25:15.279 and enjoy the rest of the conference. NOTE Bonus material 00:25:15.280 --> 00:25:39.079 Welcome to the bonus material. 00:25:39.080 --> 00:25:43.959 I'm glad for those of you who are stuck around. 00:25:43.960 --> 00:25:46.559 We are now going to examine an instance 00:25:46.560 --> 00:25:49.159 of speech and text signals together 00:25:49.160 --> 00:25:51.479 that produce multiple layers. 00:25:51.480 --> 00:25:54.839 When we have--when we take a spoken conversation 00:25:54.840 --> 00:25:58.719 and use the best language processing models available, 00:25:58.720 --> 00:26:00.679 we suddenly hit a hard spot 00:26:00.680 --> 00:26:03.239 because the tools are typically not trained 00:26:03.240 --> 00:26:05.359 to filter out the unnecessary cruft 00:26:05.360 --> 00:26:07.559 in order to automatically interpret 00:26:07.560 --> 00:26:09.559 the part of what is being said 00:26:09.560 --> 00:26:11.799 that is actually relevant. 00:26:11.800 --> 00:26:14.639 Over time, language researchers 00:26:14.640 --> 00:26:17.719 have created many interdependent layers of annotations, 00:26:17.720 --> 00:26:21.039 yet the assumptions underlying them are seldom the same. 00:26:21.040 --> 00:26:25.039 Piecing together such related but disjointed annotations 00:26:25.040 --> 00:26:28.039 on their predictions poses a huge challenge. 00:26:28.040 --> 00:26:30.719 This is another place where we can leverage 00:26:30.720 --> 00:26:33.119 the data model underlying the Emacs editor, 00:26:33.120 --> 00:26:35.359 along with the structural editing capabilities 00:26:35.360 --> 00:26:38.519 of Org mode to improve current tools. 00:26:38.520 --> 00:26:42.839 Let's take this very simple looking utterance. 00:26:42.840 --> 00:26:48.039 "Um \{lipsmack\} and that's it. (\{laugh\})" 00:26:48.040 --> 00:26:50.319 Looks like the person-- so this is-- 00:26:50.320 --> 00:26:54.519 what you are seeing here is a transcript of an audio signal 00:26:54.520 --> 00:27:00.759 that has a lip smack and a laugh as part of it, 00:27:00.760 --> 00:27:04.199 and there is also a "Um" like interjection. 00:27:04.200 --> 00:27:08.199 So this has a few interesting noises 00:27:08.200 --> 00:27:13.999 and specific things that would be illustrative 00:27:14.000 --> 00:27:20.479 of what we are going to, how we are going to represent it. NOTE Syntactic analysis 00:27:20.480 --> 00:27:25.839 Okay. So let's say you want to have 00:27:25.840 --> 00:27:28.879 a syntactic analysis of this sentence or utterance. 00:27:28.880 --> 00:27:30.959 One common technique people use 00:27:30.960 --> 00:27:32.879 is just to remove the cruft, and, you know, 00:27:32.880 --> 00:27:35.079 write some rules, clean up the utterance, 00:27:35.080 --> 00:27:36.719 make it look like it's proper English, 00:27:36.720 --> 00:27:40.239 and then, you know, tokenize it, 00:27:40.240 --> 00:27:43.079 and basically just use standard tools to process it. 00:27:43.080 --> 00:27:47.279 But in that process, they end up eliminating 00:27:47.280 --> 00:27:51.119 valid pieces of signal that have meaning to others 00:27:51.120 --> 00:27:52.799 studying different phenomena of language. 00:27:52.800 --> 00:27:56.479 Here you have the rich transcript, 00:27:56.480 --> 00:28:00.119 the input to the syntactic parser. 00:28:00.120 --> 00:28:05.919 As you can see, there is a little tokenization happening 00:28:05.920 --> 00:28:07.199 where you'll be inserting space 00:28:07.200 --> 00:28:12.119 between "that" and the contracted is ('s), 00:28:12.120 --> 00:28:15.599 and between the period and the "it," 00:28:15.600 --> 00:28:18.199 and the output of the syntactic parser is shown below. 00:28:18.200 --> 00:28:21.639 which (surprise) is a S-expression. 00:28:21.640 --> 00:28:24.919 Like I said, the parse trees, when they were created, 00:28:24.920 --> 00:28:29.799 and still largely when they are used, are S-expressions, 00:28:29.800 --> 00:28:32.999 and most of the viewers here 00:28:33.000 --> 00:28:35.119 should not have much problem reading it. 00:28:35.120 --> 00:28:37.279 You can see this tree structure 00:28:37.280 --> 00:28:39.279 of this syntactic parser here. NOTE Forced alignment 00:28:39.280 --> 00:28:40.919 Now let's say you want to integrate 00:28:40.920 --> 00:28:44.479 phonetic information or phonetic layer 00:28:44.480 --> 00:28:49.119 that's in the audio signal, and do some analysis. 00:28:49.120 --> 00:28:57.519 Now, it would need you to do a few-- take a few steps. 00:28:57.520 --> 00:29:01.679 First, you would need to align the transcript 00:29:01.680 --> 00:29:06.479 with the audio. This process is called forced alignment, 00:29:06.480 --> 00:29:10.399 where you already know what the transcript is, 00:29:10.400 --> 00:29:14.599 and you have the audio, and you can get a good alignment 00:29:14.600 --> 00:29:17.599 using both pieces of information. 00:29:17.600 --> 00:29:20.119 And this is typically a technique that is used to 00:29:20.120 --> 00:29:23.079 create training data for training 00:29:23.080 --> 00:29:25.839 automatic speech recognizers. 00:29:25.840 --> 00:29:29.639 One interesting thing is that in order to do 00:29:29.640 --> 00:29:32.879 this forced alignment, you have to keep 00:29:32.880 --> 00:29:35.799 the non-speech events in transcript, 00:29:35.800 --> 00:29:39.079 because they consume some audio signal, 00:29:39.080 --> 00:29:41.399 and if you don't have that signal, 00:29:41.400 --> 00:29:44.399 the alignment process doesn't know exactly... 00:29:44.400 --> 00:29:45.759 you know, it doesn't do a good job, 00:29:45.760 --> 00:29:50.039 because it needs to align all parts of the signal 00:29:50.040 --> 00:29:54.999 with something, either pause or silence or noise or words. 00:29:55.000 --> 00:29:59.719 Interestingly, punctuations really don't factor in, 00:29:59.720 --> 00:30:01.559 because we don't speak in punctuations. 00:30:01.560 --> 00:30:04.239 So one of the things that you need to do 00:30:04.240 --> 00:30:05.679 is remove most of the punctuations, 00:30:05.680 --> 00:30:08.039 although you'll see there are some punctuations 00:30:08.040 --> 00:30:12.599 that can be kept, or that are to be kept. NOTE Alignment before tokenization 00:30:12.600 --> 00:30:15.319 And the other thing is that the alignment has to be done 00:30:15.320 --> 00:30:20.159 before tokenization, as it impacts pronunciation. 00:30:20.160 --> 00:30:24.399 To show an example: Here you see "that's". 00:30:24.400 --> 00:30:26.919 When it's one word, 00:30:26.920 --> 00:30:31.959 it has a slightly different pronunciation 00:30:31.960 --> 00:30:35.679 than when it is two words, which is "that is", 00:30:35.680 --> 00:30:38.399 like you can see "is." And so, 00:30:38.400 --> 00:30:44.279 if you split the tokens or split the words 00:30:44.280 --> 00:30:48.119 in order for syntactic parser to process it, 00:30:48.120 --> 00:30:51.599 you would end up getting the wrong phonetic analysis. 00:30:51.600 --> 00:30:54.239 And if you have--if you process it 00:30:54.240 --> 00:30:55.319 through the phonetic analysis, 00:30:55.320 --> 00:30:59.159 and you don't know how to integrate it 00:30:59.160 --> 00:31:02.719 with the tokenized syntax, you can, you know, 00:31:02.720 --> 00:31:07.519 that can be pretty tricky. And a lot of time, 00:31:07.520 --> 00:31:10.759 people write one-off pieces of code that handle these, 00:31:10.760 --> 00:31:14.279 but the idea here is to try to have a general architecture 00:31:14.280 --> 00:31:17.239 that seamlessly integrates all these pieces. 00:31:17.240 --> 00:31:21.319 Then you do the syntactic parsing of the remaining tokens. 00:31:21.320 --> 00:31:24.799 Then you align the data and the two annotations, 00:31:24.800 --> 00:31:27.959 and then integrate the two layers. 00:31:27.960 --> 00:31:31.359 Once that is done, then you can do all kinds of 00:31:31.360 --> 00:31:33.919 interesting analysis, and test various hypotheses 00:31:33.920 --> 00:31:35.279 and generate the statistics, 00:31:35.280 --> 00:31:39.359 but without that you only are dealing 00:31:39.360 --> 00:31:42.879 with one or the other part. NOTE Layers 00:31:42.880 --> 00:31:48.319 Let's just take a quick look at how each of the layers 00:31:48.320 --> 00:31:51.159 that are involved look like. 00:31:51.160 --> 00:31:56.719 So this is "Um \{lipsmack\}, and that's it. \{laugh\}" 00:31:56.720 --> 00:32:00.159 This is the transcript, and on the right hand side, 00:32:00.160 --> 00:32:04.199 you see the same thing as a transcript 00:32:04.200 --> 00:32:06.239 listed in a vertical in a column. 00:32:06.240 --> 00:32:08.199 You'll see why, in just a second. 00:32:08.200 --> 00:32:09.879 And there are some place-- 00:32:09.880 --> 00:32:11.279 there are some rows that are empty, 00:32:11.280 --> 00:32:15.079 some rows that are wider than the others, and we'll see why. 00:32:15.080 --> 00:32:19.319 The next is the tokenized sentence 00:32:19.320 --> 00:32:20.959 where you have space added, 00:32:20.960 --> 00:32:23.599 you know space between these two tokens: 00:32:23.600 --> 00:32:26.599 "that" and the apostrophe "s" ('s), 00:32:26.600 --> 00:32:28.079 and the "it" and the "period". 00:32:28.080 --> 00:32:30.679 And you see on the right hand side 00:32:30.680 --> 00:32:33.559 that the tokens have attributes. 00:32:33.560 --> 00:32:36.439 So there is a token index, and there are 1, 2, 00:32:36.440 --> 00:32:38.839 you know 0, 1, 2, 3, 4, 5 tokens, 00:32:38.840 --> 00:32:41.479 and each token has a start and end character, 00:32:41.480 --> 00:32:45.799 and space (sp) also has a start and end character, 00:32:45.800 --> 00:32:50.399 and space is represented by a "sp". And there are 00:32:50.400 --> 00:32:54.319 these other things that we removed, 00:32:54.320 --> 00:32:56.239 like the "\{LS\}" which is for "\{lipsmack\}" 00:32:56.240 --> 00:32:59.399 and "\{LG\}" which is "\{laugh\}" are showing grayed out, 00:32:59.400 --> 00:33:02.439 and you'll see why some of these things are grayed out 00:33:02.440 --> 00:33:03.399 in a little bit. 00:33:03.400 --> 00:33:11.919 This is what the forced alignment tool produces. 00:33:11.920 --> 00:33:17.159 Basically, it takes the transcript, 00:33:17.160 --> 00:33:19.159 and this is the transcript 00:33:19.160 --> 00:33:24.119 that has slightly different symbols, 00:33:24.120 --> 00:33:26.239 because different tools use different symbols 00:33:26.240 --> 00:33:28.159 and their various configurational things. 00:33:28.160 --> 00:33:33.679 But this is what is used to get an alignment 00:33:33.680 --> 00:33:36.039 or time alignment with phones. 00:33:36.040 --> 00:33:40.079 So this column shows the phones, and so each word... 00:33:40.080 --> 00:33:43.879 So, for example, "and" has been aligned with these phones, 00:33:43.880 --> 00:33:46.879 and these on the start and end 00:33:46.880 --> 00:33:52.959 are essentially temporal or time stamps that it aligned-- 00:33:52.960 --> 00:33:54.279 that has been aligned to it. 00:33:54.280 --> 00:34:00.759 Interestingly, sometimes we don't really have any pause 00:34:00.760 --> 00:34:05.159 or any time duration between some words 00:34:05.160 --> 00:34:08.199 and those are highlighted as gray here. 00:34:08.200 --> 00:34:12.759 See, there's this space... Actually 00:34:12.760 --> 00:34:17.799 it does not have any temporal content, 00:34:17.800 --> 00:34:21.319 whereas this other space has some duration. 00:34:21.320 --> 00:34:24.839 So the ones that have some duration are captured, 00:34:24.840 --> 00:34:29.519 while the others are the ones that in the earlier diagram 00:34:29.520 --> 00:34:31.319 we saw were left out. NOTE Variations 00:34:31.320 --> 00:34:37.639 And the aligner actually produces multiple files. 00:34:37.640 --> 00:34:44.399 One of the files has a different, slightly different 00:34:44.400 --> 00:34:46.679 variation on the same information, 00:34:46.680 --> 00:34:49.999 and in this case, you can see 00:34:50.000 --> 00:34:52.399 that the punctuation is missing, 00:34:52.400 --> 00:34:57.599 and the punctuation is, you know, deliberately missing, 00:34:57.600 --> 00:35:02.279 because there is no time associated with it, 00:35:02.280 --> 00:35:06.439 and you see that it's not the tokenized sentence-- 00:35:06.440 --> 00:35:17.119 a tokenized word. This... Now it gives you a full table, 00:35:17.120 --> 00:35:21.239 and you can't really look into it very carefully. 00:35:21.240 --> 00:35:25.879 But we can focus on the part that seems legible, 00:35:25.880 --> 00:35:28.559 or, you know, properly written sentence, 00:35:28.560 --> 00:35:32.879 process it and reincorporate it back into the whole. 00:35:32.880 --> 00:35:35.879 So if somebody wants to look at, for example, 00:35:35.880 --> 00:35:39.679 how many pauses the person made while they were talking, 00:35:39.680 --> 00:35:42.919 And they can actually measure the pause, the number, 00:35:42.920 --> 00:35:46.279 the duration, and make connections between that 00:35:46.280 --> 00:35:49.639 and the rich syntactic structure that is being produced. 00:35:49.640 --> 00:35:57.279 And in order to do that, you have to get these layers 00:35:57.280 --> 00:35:59.039 to align with each other, 00:35:59.040 --> 00:36:04.359 and this table is just a tabular representation 00:36:04.360 --> 00:36:08.679 of the information that we'll be storing in the YAMR file. 00:36:08.680 --> 00:36:11.719 Congratulations! You have reached 00:36:11.720 --> 00:36:13.479 the end of this demonstration. 00:36:13.480 --> 00:36:17.000 Thank you for your time and attention.