summaryrefslogtreecommitdiffstats
path: root/2022/captions/emacsconf-2022-grail--graila-generalized-representation-and-aggregation-of-information-layers--sameer-pradhan--main.vtt
blob: a642f94a14f4a508ecb53d79ae7852d5259fe3ab (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
WEBVTT captioned by sameer

NOTE Introduction

00:00:00.000 --> 00:00:05.839
Thank you for joining me today. I'm Sameer Pradhan

00:00:05.840 --> 00:00:07.799
from the Linguistic Data Consortium

00:00:07.800 --> 00:00:10.079
at the University of Pennsylvania

00:00:10.080 --> 00:00:14.519
and founder of cemantix.org .

00:00:14.520 --> 00:00:16.879
Today we'll be addressing research

00:00:16.880 --> 00:00:18.719
in computational linguistics,

00:00:18.720 --> 00:00:22.039
also known as natural language processing

00:00:22.040 --> 00:00:24.719
a sub area of artificial intelligence

00:00:24.720 --> 00:00:27.759
with a focus on modeling and predicting

00:00:27.760 --> 00:00:31.919
complex linguistic structures from various signals.

00:00:31.920 --> 00:00:35.799
The work we present is limited to text and speech signals.

00:00:35.800 --> 00:00:38.639
but it can be extended to other signals.

00:00:38.640 --> 00:00:40.799
We propose an architecture,

00:00:40.800 --> 00:00:42.959
and we call it GRAIL, which allows

00:00:42.960 --> 00:00:44.639
the representation and aggregation

00:00:44.640 --> 00:00:50.199
of such rich structures in a systematic fashion.

00:00:50.200 --> 00:00:52.679
I'll demonstrate a proof of concept

00:00:52.680 --> 00:00:56.559
for representing and manipulating data and annotations

00:00:56.560 --> 00:00:58.519
for the specific purpose of building

00:00:58.520 --> 00:01:02.879
machine learning models that simulate understanding.

00:01:02.880 --> 00:01:05.679
These technologies have the potential for impact

00:01:05.680 --> 00:01:09.119
in almost every conceivable field

00:01:09.120 --> 00:01:13.399
that generates and uses data.

NOTE Processing language

00:01:13.400 --> 00:01:15.039
We process human language

00:01:15.040 --> 00:01:16.719
when our brains receive and assimilate

00:01:16.720 --> 00:01:20.079
various signals which are then manipulated

00:01:20.080 --> 00:01:23.879
and interpreted within a syntactic structure.

00:01:23.880 --> 00:01:27.319
it's a complex process that I have simplified here

00:01:27.320 --> 00:01:30.759
for the purpose of comparison to machine learning.

00:01:30.760 --> 00:01:33.959
Recent machine learning models tend to require

00:01:33.960 --> 00:01:37.039
a large amount of raw, naturally occurring data

00:01:37.040 --> 00:01:40.199
and a varying amount of manually enriched data,

00:01:40.200 --> 00:01:43.199
commonly known as "annotations".

00:01:43.200 --> 00:01:45.959
Owing to the complex and numerous nature

00:01:45.960 --> 00:01:49.959
of linguistic phenomena, we have most often used

00:01:49.960 --> 00:01:52.999
a divide and conquer approach.

00:01:53.000 --> 00:01:55.399
The strength of this approach is that it allows us

00:01:55.400 --> 00:01:58.159
to focus on a single, or perhaps a few related

00:01:58.160 --> 00:02:00.439
linguistic phenomena.

00:02:00.440 --> 00:02:03.879
The weaknesses are the universe of these phenomena

00:02:03.880 --> 00:02:07.239
keep expanding, as language itself

00:02:07.240 --> 00:02:09.359
evolves and changes over time,

00:02:09.360 --> 00:02:13.119
and second, this approach requires an additional task

00:02:13.120 --> 00:02:14.839
of aggregating the interpretations,

00:02:14.840 --> 00:02:18.359
creating more opportunities for computer error.

00:02:18.360 --> 00:02:21.519
Our challenge, then, is to find the sweet spot

00:02:21.520 --> 00:02:25.239
that allows us to encode complex information

00:02:25.240 --> 00:02:27.719
without the use of manual annotation,

00:02:27.720 --> 00:02:34.559
or without the additional task of aggregation by computers.

NOTE Annotation

00:02:34.560 --> 00:02:37.119
So what do I mean by "annotation"?

00:02:37.120 --> 00:02:39.759
In this talk the word annotation refers to

00:02:39.760 --> 00:02:43.519
the manual assignment of certain attributes

00:02:43.520 --> 00:02:48.639
to portions of a signal which is necessary

00:02:48.640 --> 00:02:51.639
to perform the end task.

00:02:51.640 --> 00:02:54.439
For example, in order for the algorithm

00:02:54.440 --> 00:02:57.439
to accurately interpret a pronoun,

00:02:57.440 --> 00:03:00.279
it needs to know that pronoun,

00:03:00.280 --> 00:03:03.799
what that pronoun refers back to.

00:03:03.800 --> 00:03:06.719
We may find this task trivial, however,

00:03:06.720 --> 00:03:10.599
current algorithms repeatedly fail in this task.

00:03:10.600 --> 00:03:13.319
So the complexities of understanding

00:03:13.320 --> 00:03:16.639
in computational linguistics require annotation.

00:03:16.640 --> 00:03:20.799
The world annotation itself is a useful example,

00:03:20.800 --> 00:03:22.679
because it also reminds us

00:03:22.680 --> 00:03:25.119
that words have multiple meetings

00:03:25.120 --> 00:03:27.519
as annotation itself does—

00:03:27.520 --> 00:03:30.559
just as I needed to define it in this context,

00:03:30.560 --> 00:03:33.799
so that my message won't be misinterpreted.

00:03:33.800 --> 00:03:39.039
So, too, must annotators do this for algorithms

00:03:39.040 --> 00:03:43.239
through the manual intervention.

NOTE Learning from data

00:03:43.240 --> 00:03:44.759
Learning from raw data

00:03:44.760 --> 00:03:47.039
(commonly known as unsupervised learning)

00:03:47.040 --> 00:03:50.079
poses limitations for machine learning.

00:03:50.080 --> 00:03:53.039
As I described, modeling complex phenomena

00:03:53.040 --> 00:03:55.559
need manual annotations.

00:03:55.560 --> 00:03:58.559
The learning algorithm uses these annotations

00:03:58.560 --> 00:04:01.319
as examples to build statistical models.

00:04:01.320 --> 00:04:04.879
This is called supervised learning.

00:04:04.880 --> 00:04:06.319
Without going into too much detail,

00:04:06.320 --> 00:04:10.039
I'll simply note that the recent popularity

00:04:10.040 --> 00:04:12.519
of the concept of deep learning

00:04:12.520 --> 00:04:14.679
is that evolutionary step

00:04:14.680 --> 00:04:17.319
where we have learned to train models

00:04:17.320 --> 00:04:20.799
using trillions of parameters in ways that they can

00:04:20.800 --> 00:04:25.079
learn richer hierarchical structures

00:04:25.080 --> 00:04:29.399
from very large amounts of annotate, unannotated data.

00:04:29.400 --> 00:04:32.319
These models can then be fine-tuned,

00:04:32.320 --> 00:04:35.599
using varying amounts of annotated examples

00:04:35.600 --> 00:04:37.639
depending on the complexity of the task

00:04:37.640 --> 00:04:39.679
to generate better predictions.

NOTE Manual annotation

00:04:39.680 --> 00:04:44.919
As you might imagine, manually annotating

00:04:44.920 --> 00:04:47.359
complex, linguistic phenomena

00:04:47.360 --> 00:04:51.719
can be very specific, labor-intensive task.

00:04:51.720 --> 00:04:54.279
For example, imagine if we were

00:04:54.280 --> 00:04:56.399
to go back through this presentation

00:04:56.400 --> 00:04:58.399
and connect all the pronouns

00:04:58.400 --> 00:04:59.919
with the nouns to which they refer.

00:04:59.920 --> 00:05:03.239
Even for a short 18 min presentation,

00:05:03.240 --> 00:05:05.239
this would require hundreds of annotations.

00:05:05.240 --> 00:05:08.519
The models we build are only as good

00:05:08.520 --> 00:05:11.119
as the quality of the annotations we make.

00:05:11.120 --> 00:05:12.679
We need guidelines

00:05:12.680 --> 00:05:15.759
that ensure that the annotations are done

00:05:15.760 --> 00:05:19.719
by at least two humans who have substantial agreement

00:05:19.720 --> 00:05:22.119
with each other in their interpretations.

00:05:22.120 --> 00:05:25.599
We know that if we try to trade a model using annotations

00:05:25.600 --> 00:05:28.519
that are very subjective, or have more noise,

00:05:28.520 --> 00:05:30.919
we will receive poor predictions.

00:05:30.920 --> 00:05:33.679
Additionally, there is the concern of introducing

00:05:33.680 --> 00:05:37.079
various unexpected biases into one's models.

00:05:37.080 --> 00:05:44.399
So annotation is really both an art and a science.

NOTE How can we develop a unified representation?

00:05:44.400 --> 00:05:47.439
In the remaining time,

00:05:47.440 --> 00:05:49.999
we will turn to two fundamental questions.

00:05:50.000 --> 00:05:54.239
First, how can we develop a unified representation

00:05:54.240 --> 00:05:55.599
of data and annotations

00:05:55.600 --> 00:05:59.759
that encompasses arbitrary levels of linguistic information?

00:05:59.760 --> 00:06:03.839
There is a long history of attempting to answer

00:06:03.840 --> 00:06:04.839
this first question.

00:06:04.840 --> 00:06:08.839
This history is documented in our recent article,

00:06:08.840 --> 00:06:11.519
and you can refer to that article.

00:06:11.520 --> 00:06:16.719
It will be on the website.

00:06:16.720 --> 00:06:18.999
It is as if we, as a community,

00:06:19.000 --> 00:06:22.519
have been searching for our own Holy Grail.

NOTE What role might Emacs and Org mode play?

00:06:22.520 --> 00:06:26.519
The second question we will pose is

00:06:26.520 --> 00:06:30.159
what role might Emacs, along with Org mode,

00:06:30.160 --> 00:06:31.919
play in this process?

00:06:31.920 --> 00:06:35.359
Well, the solution itself may not be tied to Emacs.

00:06:35.360 --> 00:06:38.359
Emacs has built in capabilities

00:06:38.360 --> 00:06:42.599
that could be useful for evaluating potential solutions.

00:06:42.600 --> 00:06:45.759
It's also one of the most extensively documented

00:06:45.760 --> 00:06:48.519
pieces of software and the most customizable

00:06:48.520 --> 00:06:51.599
piece of software that I have ever come across,

00:06:51.600 --> 00:06:55.279
and many would agree with that.

NOTE The complex structure of language

00:06:55.280 --> 00:07:00.639
In order to approach this second question,

00:07:00.640 --> 00:07:03.919
we turn to the complex structure of language itself.

00:07:03.920 --> 00:07:07.679
At first glance, language appears to us

00:07:07.680 --> 00:07:09.879
as a series of words.

00:07:09.880 --> 00:07:13.439
Words form sentences, sentences form paragraphs,

00:07:13.440 --> 00:07:16.239
and paragraphs form completed text.

00:07:16.240 --> 00:07:19.039
If this was a sufficient description

00:07:19.040 --> 00:07:21.159
of the complexity of language,

00:07:21.160 --> 00:07:24.199
all of us would be able to speak and read

00:07:24.200 --> 00:07:26.559
at least ten different languages.

00:07:26.560 --> 00:07:29.279
We know it is much more complex than this.

00:07:29.280 --> 00:07:33.199
There is a rich, underlying recursive tree structure--

00:07:33.200 --> 00:07:36.439
in fact, many possible tree structures

00:07:36.440 --> 00:07:39.439
which makes a particular sequence meaningful

00:07:39.440 --> 00:07:42.079
and many others meaningless.

00:07:42.080 --> 00:07:45.239
One of the better understood tree structures

00:07:45.240 --> 00:07:47.119
is the syntactic structure.

00:07:47.120 --> 00:07:49.439
While natural language

00:07:49.440 --> 00:07:51.679
has rich ambiguities and complexities,

00:07:51.680 --> 00:07:55.119
programming languages are designed to be parsed

00:07:55.120 --> 00:07:56.999
and interpreted deterministically.

00:07:57.000 --> 00:08:02.159
Emacs has been used for programming very effectively.

00:08:02.160 --> 00:08:05.359
So there is a potential for using Emacs

00:08:05.360 --> 00:08:06.559
as a tool for annotation.

00:08:06.560 --> 00:08:10.799
This would significantly improve our current set of tools.

NOTE Annotation tools

00:08:10.800 --> 00:08:16.559
It is important to note that most of the annotation tools

00:08:16.560 --> 00:08:19.639
that have been developed over the past few decades

00:08:19.640 --> 00:08:22.879
have relied on graphical interfaces,

00:08:22.880 --> 00:08:26.919
even those used for enriching textual information.

00:08:26.920 --> 00:08:30.399
Most of the tools in current use

00:08:30.400 --> 00:08:36.159
are designed for a end user to add very specific,

00:08:36.160 --> 00:08:38.639
very restricted information.

00:08:38.640 --> 00:08:42.799
We have not really made use of the potential

00:08:42.800 --> 00:08:45.639
that an editor or a rich editing environment like Emacs

00:08:45.640 --> 00:08:47.239
can add to the mix.

00:08:47.240 --> 00:08:52.479
Emacs has long enabled the editing of, the manipulation of

00:08:52.480 --> 00:08:56.359
complex embedded tree structures abundant in source code.

00:08:56.360 --> 00:08:58.599
So it's not difficult to imagine that it would have

00:08:58.600 --> 00:09:00.359
many capabilities that we we need

00:09:00.360 --> 00:09:02.599
to represent actual language.

00:09:02.600 --> 00:09:04.759
In fact, it already does that with features

00:09:04.760 --> 00:09:06.399
that allow us to quickly navigate

00:09:06.400 --> 00:09:07.919
through sentences and paragraphs,

00:09:07.920 --> 00:09:09.799
and we don't need a few key strokes.

00:09:09.800 --> 00:09:13.599
Or to add various text properties to text spans

00:09:13.600 --> 00:09:17.039
to create overlays, to name but a few.

00:09:17.040 --> 00:09:22.719
Emacs figured out this way to handle Unicode,

00:09:22.720 --> 00:09:26.799
so you don't even have to worry about the complexity

00:09:26.800 --> 00:09:29.439
of managing multiple languages.

00:09:29.440 --> 00:09:34.039
It's built into Emacs. In fact, this is not the first time

00:09:34.040 --> 00:09:37.399
Emacs has been used for linguistic analysis.

00:09:37.400 --> 00:09:41.159
One of the breakthrough moments in language,

00:09:41.160 --> 00:09:44.439
natural language processing was the creation

00:09:44.440 --> 00:09:48.639
of manually created syntactic trees

00:09:48.640 --> 00:09:50.439
for a 1 million word collection

00:09:50.440 --> 00:09:52.399
of Wall Street Journal articles.

00:09:52.400 --> 00:09:54.879
This was else around 1992

00:09:54.880 --> 00:09:59.279
before Java or graphical interfaces were common.

00:09:59.280 --> 00:10:03.279
The tool that was used to create that corpus was Emacs.

00:10:03.280 --> 00:10:08.959
It was created at UPenn, and is famously known as

00:10:08.960 --> 00:10:12.719
the Penn Treebank. '92 was about when

00:10:12.720 --> 00:10:16.439
the Linguistic Data Consortium was also established,

00:10:16.440 --> 00:10:18.039
and it's been about 30 years

00:10:18.040 --> 00:10:20.719
that it has been creating various

00:10:20.720 --> 00:10:22.359
language-related resources.

NOTE Org mode

00:10:22.360 --> 00:10:28.519
Org mode--in particular, the outlining mode,

00:10:28.520 --> 00:10:32.399
or rather the enhanced form of outlining mode--

00:10:32.400 --> 00:10:35.599
allows us to create rich outlines,

00:10:35.600 --> 00:10:37.799
attaching properties to nodes,

00:10:37.800 --> 00:10:41.119
and provides commands for easily customizing

00:10:41.120 --> 00:10:43.879
sorting of various pieces of information

00:10:43.880 --> 00:10:45.639
as per one's requirement.

00:10:45.640 --> 00:10:50.239
This can also be a very useful tool.

00:10:50.240 --> 00:10:59.159
This enhanced form of outline-mode adds more power to Emacs.

00:10:59.160 --> 00:11:03.359
It provides commands for easily customizing

00:11:03.360 --> 00:11:05.159
and filtering information,

00:11:05.160 --> 00:11:08.999
while at the same time hiding unnecessary context.

00:11:09.000 --> 00:11:11.919
It also allows structural editing.

00:11:11.920 --> 00:11:16.039
This can be a very useful tool to enrich corpora

00:11:16.040 --> 00:11:20.919
where we are focusing on limited amount of phenomena.

00:11:20.920 --> 00:11:24.519
The two together allow us to create

00:11:24.520 --> 00:11:27.199
a rich representation

00:11:27.200 --> 00:11:32.999
that can simultaneously capture multiple possible sequences,

00:11:33.000 --> 00:11:38.759
capture details necessary to recreate the original source,

00:11:38.760 --> 00:11:42.079
allow the creation of hierarchical representation,

00:11:42.080 --> 00:11:44.679
provide structural editing capabilities

00:11:44.680 --> 00:11:47.439
that can take advantage of the concept of inheritance

00:11:47.440 --> 00:11:48.999
within the tree structure.

00:11:49.000 --> 00:11:54.279
Together they allow local manipulations of structures,

00:11:54.280 --> 00:11:56.199
thereby minimizing data coupling.

00:11:56.200 --> 00:11:59.119
The concept of tags in Org mode

00:11:59.120 --> 00:12:01.599
complement the hierarchy part.

00:12:01.600 --> 00:12:03.839
Hierarchies can be very rigid,

00:12:03.840 --> 00:12:06.039
but to tags on hierarchies,

00:12:06.040 --> 00:12:08.839
we can have a multifaceted representations.

00:12:08.840 --> 00:12:12.759
As a matter of fact, Org mode has the ability for the tags

00:12:12.760 --> 00:12:15.039
to have their own hierarchical structure

00:12:15.040 --> 00:12:18.639
which further enhances the representational power.

00:12:18.640 --> 00:12:22.639
All of this can be done as a sequence

00:12:22.640 --> 00:12:25.679
of mostly functional data transformations,

00:12:25.680 --> 00:12:27.439
because most of the capabilities

00:12:27.440 --> 00:12:29.759
can be configured and customized.

00:12:29.760 --> 00:12:32.799
It is not necessary to do everything at once.

00:12:32.800 --> 00:12:36.199
Instead, it allows us to incrementally increase

00:12:36.200 --> 00:12:37.919
the complexity of the representation.

00:12:37.920 --> 00:12:39.799
Finally, all of this can be done

00:12:39.800 --> 00:12:42.359
in plain-text representation

00:12:42.360 --> 00:12:45.479
which comes with its own advantages.

NOTE Example

00:12:45.480 --> 00:12:50.679
Now let's take a simple example.

00:12:50.680 --> 00:12:55.999
This is a a short video that I'll play.

00:12:56.000 --> 00:12:59.679
The sentence is "I saw the moon with a telescope,"

00:12:59.680 --> 00:13:03.999
and let's just make a copy of the sentence.

00:13:04.000 --> 00:13:09.199
What we can do now is to see:

00:13:09.200 --> 00:13:11.879
what does this sentence comprise?

00:13:11.880 --> 00:13:13.679
It has a noun phrase "I,"

00:13:13.680 --> 00:13:17.479
followed by a word "saw."

00:13:17.480 --> 00:13:21.359
Then "the moon" is another noun phrase,

00:13:21.360 --> 00:13:24.839
and "with the telescope" is a prepositional phrase.

00:13:24.840 --> 00:13:30.759
Now one thing that you might remember,

00:13:30.760 --> 00:13:36.119
from grammar school or syntax is that

00:13:36.120 --> 00:13:41.279
there is a syntactic structure.

00:13:41.280 --> 00:13:44.359
And if you in this particular case--

00:13:44.360 --> 00:13:47.919
because we know that the moon is not typically

00:13:47.920 --> 00:13:51.679
something that can hold the telescope,

00:13:51.680 --> 00:13:56.239
that the seeing must be done by me or "I,"

00:13:56.240 --> 00:14:01.039
and the telescope must be in my hand,

00:14:01.040 --> 00:14:04.479
or "I" am viewing the moon with a telescope.

00:14:04.480 --> 00:14:13.519
However, it is possible that in a different context

00:14:13.520 --> 00:14:17.159
the moon could be referring to an animated character

00:14:17.160 --> 00:14:22.319
in a animated series, and could actually hold the telescope.

00:14:22.320 --> 00:14:23.479
And this is one of the most--

00:14:23.480 --> 00:14:24.839
the oldest and one of the most--

00:14:24.840 --> 00:14:26.319
and in that case the situation might be

00:14:26.320 --> 00:14:30.959
that I'm actually seeing the moon holding a telescope...

00:14:30.960 --> 00:14:36.079
I mean. The moon is holding the telescope,

00:14:36.080 --> 00:14:40.959
and I'm just seeing the moon holding the telescope.

00:14:40.960 --> 00:14:47.999
Complex linguistic ambiguity or linguistic

00:14:48.000 --> 00:14:53.599
phenomena that requires world knowledge,

00:14:53.600 --> 00:14:55.719
and it's called the PP attachment problem

00:14:55.720 --> 00:14:59.239
where the propositional phrase attachment

00:14:59.240 --> 00:15:04.599
can be ambiguous, and various different contextual cues

00:15:04.600 --> 00:15:06.879
have to be used to resolve the ambiguity.

00:15:06.880 --> 00:15:09.079
So in this case, as you saw,

00:15:09.080 --> 00:15:11.199
both the readings are technically true,

00:15:11.200 --> 00:15:13.959
depending on different contexts.

00:15:13.960 --> 00:15:16.599
So one thing we could do is just

00:15:16.600 --> 00:15:19.919
to cut the tree and duplicate it,

00:15:19.920 --> 00:15:21.599
and then let's create another node

00:15:21.600 --> 00:15:24.479
and call it an "OR" node.

00:15:24.480 --> 00:15:26.119
And because we are saying,

00:15:26.120 --> 00:15:28.359
this is one of the two interpretations.

00:15:28.360 --> 00:15:32.159
Now let's call one interpretation "a",

00:15:32.160 --> 00:15:36.159
and that interpretation essentially

00:15:36.160 --> 00:15:39.319
is this child of that node "a"

00:15:39.320 --> 00:15:41.799
and that says that the moon

00:15:41.800 --> 00:15:43.999
is holding the telescope.

00:15:44.000 --> 00:15:46.359
Now we can create another representation "b"

00:15:46.360 --> 00:15:53.919
where we capture the other interpretation,

00:15:53.920 --> 00:15:59.959
where this, the act, the moon or--I am actually

00:15:59.960 --> 00:16:00.519
holding the telescope,

00:16:00.520 --> 00:16:06.799
and watching the moon using it.

00:16:06.800 --> 00:16:09.199
So now we have two separate interpretations

00:16:09.200 --> 00:16:11.679
in the same structure,

00:16:11.680 --> 00:16:15.519
and all we do--we're able to do is with this,

00:16:15.520 --> 00:16:18.159
with very quick key strokes now...

00:16:18.160 --> 00:16:22.439
While we are at it, let's add another interesting thing,

00:16:22.440 --> 00:16:25.159
this node that represents "I":

00:16:25.160 --> 00:16:28.919
"He." It can be "She".

00:16:28.920 --> 00:16:35.759
It can be "the children," or it can be "The people".

00:16:35.760 --> 00:16:45.039
Basically, any entity that has the capability to "see"

00:16:45.040 --> 00:16:53.359
can be substituted in this particular node.

00:16:53.360 --> 00:16:57.399
Let's see what we have here now.

00:16:57.400 --> 00:17:01.239
We just are getting sort of a zoom view

00:17:01.240 --> 00:17:04.599
of the entire structure, what we created,

00:17:04.600 --> 00:17:08.039
and essentially you can see that

00:17:08.040 --> 00:17:11.879
by just, you know, using a few keystrokes,

00:17:11.880 --> 00:17:17.839
we were able to capture two different interpretations

00:17:17.840 --> 00:17:20.879
of a a simple sentence,

00:17:20.880 --> 00:17:23.759
and they are also able to add

00:17:23.760 --> 00:17:27.799
these alternate pieces of information

00:17:27.800 --> 00:17:30.559
that could help machine learning algorithms

00:17:30.560 --> 00:17:32.439
generalize better.

00:17:32.440 --> 00:17:36.239
All right.

NOTE Different readings

00:17:36.240 --> 00:17:40.359
Now, let's look at the next thing. So in a sense,

00:17:40.360 --> 00:17:46.679
we can use this power of functional data structures

00:17:46.680 --> 00:17:50.239
to represent various potentially conflicting

00:17:50.240 --> 00:17:55.559
and structural readings of that piece of text.

00:17:55.560 --> 00:17:58.079
In addition to that, we can also create more texts,

00:17:58.080 --> 00:17:59.799
each with different structure,

00:17:59.800 --> 00:18:01.559
and have them all in the same place.

00:18:01.560 --> 00:18:04.239
This allows us to address the interpretation

00:18:04.240 --> 00:18:06.879
of a static sentence that might be occurring in the world,

00:18:06.880 --> 00:18:09.639
while simultaneously inserting information

00:18:09.640 --> 00:18:11.519
that would add more value to it.

00:18:11.520 --> 00:18:14.999
This makes the enrichment process also very efficient.

00:18:15.000 --> 00:18:19.519
Additionally, we can envision

00:18:19.520 --> 00:18:23.999
a power user of the future, or present,

00:18:24.000 --> 00:18:27.479
who can not only annotate a span,

00:18:27.480 --> 00:18:31.279
but also edit the information in situ

00:18:31.280 --> 00:18:34.639
in a way that would help machine algorithms

00:18:34.640 --> 00:18:36.879
generalize better by making more efficient use

00:18:36.880 --> 00:18:37.719
of the annotations.

00:18:37.720 --> 00:18:41.519
So together, Emacs and Org mode can speed up

00:18:41.520 --> 00:18:42.959
the enrichment of the signals

00:18:42.960 --> 00:18:44.519
in a way that allows us

00:18:44.520 --> 00:18:47.719
to focus on certain aspects and ignore others.

00:18:47.720 --> 00:18:50.839
Extremely complex landscape of rich structures

00:18:50.840 --> 00:18:53.039
can be captured consistently,

00:18:53.040 --> 00:18:55.639
in a fashion that allows computers

00:18:55.640 --> 00:18:56.759
to understand language.

00:18:56.760 --> 00:19:00.879
We can then build tools to enhance the tasks

00:19:00.880 --> 00:19:03.319
that we do in our everyday life.

00:19:03.320 --> 00:19:10.759
YAMR is acronym, or the file's type or specification

00:19:10.760 --> 00:19:15.239
that we are creating to capture this new

00:19:15.240 --> 00:19:17.679
rich representation.

NOTE Spontaneous speech

00:19:17.680 --> 00:19:21.959
We'll now look at an example of spontaneous speech

00:19:21.960 --> 00:19:24.799
that occurs in spoken conversations.

00:19:24.800 --> 00:19:28.599
Conversations frequently contain errors in speech:

00:19:28.600 --> 00:19:30.799
interruptions, disfluencies,

00:19:30.800 --> 00:19:33.959
verbal sounds such as cough or laugh,

00:19:33.960 --> 00:19:35.039
and other noises.

00:19:35.040 --> 00:19:38.199
In this sense, spontaneous speech is similar

00:19:38.200 --> 00:19:39.799
to a functional data stream.

00:19:39.800 --> 00:19:42.759
We cannot take back words that come out of our mouth,

00:19:42.760 --> 00:19:47.239
but we tend to make mistakes, and we correct ourselves

00:19:47.240 --> 00:19:49.039
as soon as we realize that we have made--

00:19:49.040 --> 00:19:50.679
we have misspoken.

00:19:50.680 --> 00:19:53.159
This process manifests through a combination

00:19:53.160 --> 00:19:56.279
of a handful of mechanisms, including immediate correction

00:19:56.280 --> 00:20:00.959
after an error, and we do this unconsciously.

00:20:00.960 --> 00:20:02.719
Computers, on the other hand,

00:20:02.720 --> 00:20:06.639
must be taught to understand these cases.

00:20:06.640 --> 00:20:12.799
What we see here is a example document or outline,

00:20:12.800 --> 00:20:18.119
or part of a document that illustrates

00:20:18.120 --> 00:20:22.919
various different aspects of the representation.

00:20:22.920 --> 00:20:25.919
We don't have a lot of time to go through

00:20:25.920 --> 00:20:28.239
many of the details.

00:20:28.240 --> 00:20:31.759
I would highly encourage you to play a...

00:20:31.760 --> 00:20:39.159
I'm planning on making some videos, or ascii cinemas,

00:20:39.160 --> 00:20:42.559
that I'll be posting, and you can,

00:20:42.560 --> 00:20:46.759
if you're interested, you can go through those.

00:20:46.760 --> 00:20:50.359
The idea here is to try to do

00:20:50.360 --> 00:20:54.599
a slightly more complex use case.

00:20:54.600 --> 00:20:57.639
But again, given the time constraint

00:20:57.640 --> 00:21:00.279
and the amount of information

00:21:00.280 --> 00:21:01.519
that needs to fit in the screen,

00:21:01.520 --> 00:21:05.559
this may not be very informative,

00:21:05.560 --> 00:21:08.399
but at least it will give you some idea

00:21:08.400 --> 00:21:10.439
of what can be possible.

00:21:10.440 --> 00:21:13.279
And in this particular case, what you're seeing is that

00:21:13.280 --> 00:21:18.319
there is a sentence which is "What I'm I'm tr- telling now."

00:21:18.320 --> 00:21:21.159
Essentially, there is a repetition of the word "I'm",

00:21:21.160 --> 00:21:23.279
and then there is a partial word

00:21:23.280 --> 00:21:25.159
that somebody tried to say "telling",

00:21:25.160 --> 00:21:29.599
but started saying "tr-", and then corrected themselves

00:21:29.600 --> 00:21:30.959
and said, "telling now."

00:21:30.960 --> 00:21:39.239
So in this case, you see, we can capture words

00:21:39.240 --> 00:21:44.919
or a sequence of words, or a sequence of tokens.

00:21:44.920 --> 00:21:52.279
One thing to... An interesting thing to note is that in NLP,

00:21:52.280 --> 00:21:55.319
sometimes we have to break typically

00:21:55.320 --> 00:22:01.199
words that don't have spaces into two separate words,

00:22:01.200 --> 00:22:04.119
especially contractions like "I'm",

00:22:04.120 --> 00:22:08.199
so the syntactic parser needs needs two separate nodes.

00:22:08.200 --> 00:22:11.199
But anyway, so I'll... You can see that here.

00:22:11.200 --> 00:22:15.759
The other... This view. What this view shows is that

00:22:15.760 --> 00:22:19.759
with each of the nodes in the sentence

00:22:19.760 --> 00:22:23.079
or in the representation,

00:22:23.080 --> 00:22:26.079
you can have a lot of different properties

00:22:26.080 --> 00:22:27.559
that you can attach to them,

00:22:27.560 --> 00:22:30.119
and these properties are typically hidden,

00:22:30.120 --> 00:22:32.719
like you saw in the earlier slide.

00:22:32.720 --> 00:22:35.599
But you can make use of all these properties

00:22:35.600 --> 00:22:39.439
to do various kind of searches and filtering.

00:22:39.440 --> 00:22:43.519
And on the right hand side here--

00:22:43.520 --> 00:22:48.799
this is actually not a legitimate syntax--

00:22:48.800 --> 00:22:51.279
but on the right are descriptions

00:22:51.280 --> 00:22:53.479
of what each of these represent.

00:22:53.480 --> 00:22:57.319
All the information is also available in the article.

00:22:57.320 --> 00:23:04.279
You can see there... It shows how much rich context

00:23:04.280 --> 00:23:05.879
you can capture.

00:23:05.880 --> 00:23:08.799
This is just a closer snapshot

00:23:08.800 --> 00:23:10.159
of the properties on the node,

00:23:10.160 --> 00:23:13.119
and you can see we can have things like,

00:23:13.120 --> 00:23:14.799
whether the word is a token or not,

00:23:14.800 --> 00:23:17.359
or that it's incomplete, whether some words

00:23:17.360 --> 00:23:19.959
might want to be filtered out for parsing,

00:23:19.960 --> 00:23:23.039
and we can say this: PARSE_IGNORE,

00:23:23.040 --> 00:23:25.519
or some words or restart markers...

00:23:25.520 --> 00:23:29.239
We can mark, add a RESTART_MARKER, or sometimes,

00:23:29.240 --> 00:23:31.999
some of these might have durations. Things like that.

NOTE Editing properties in column view

00:23:32.000 --> 00:23:38.799
The other fascinating thing of this representation

00:23:38.800 --> 00:23:42.599
is that you can edit properties in the column view.

00:23:42.600 --> 00:23:45.399
And suddenly, you have this tabular data structure

00:23:45.400 --> 00:23:48.879
combined with the hierarchical data structure.

00:23:48.880 --> 00:23:53.119
And as you can--you may not be able to see it here,

00:23:53.120 --> 00:23:56.879
but what has also happened here is that

00:23:56.880 --> 00:24:01.159
some of the tags have been inherited

00:24:01.160 --> 00:24:02.479
from the earlier nodes.

00:24:02.480 --> 00:24:07.919
And so you get a much fuller picture of things.

00:24:07.920 --> 00:24:13.919
Essentially you, can filter out things

00:24:13.920 --> 00:24:15.319
that you want to process,

00:24:15.320 --> 00:24:20.279
process them, and then reintegrate it into the whole.

NOTE Conclusion

00:24:20.280 --> 00:24:25.479
So, in conclusion, today we have proposed and demonstrated

00:24:25.480 --> 00:24:27.559
the use of an architecture (GRAIL),

00:24:27.560 --> 00:24:31.319
which allows the representation, manipulation,

00:24:31.320 --> 00:24:34.759
and aggregation of rich linguistic structures

00:24:34.760 --> 00:24:36.519
in a systematic fashion.

00:24:36.520 --> 00:24:41.359
We have shown how GRAIL advances the tools

00:24:41.360 --> 00:24:44.599
available for building machine learning models

00:24:44.600 --> 00:24:46.879
that simulate understanding.

00:24:46.880 --> 00:24:51.679
Thank you very much for your time and attention today.

00:24:51.680 --> 00:24:54.639
My contact information is on this slide.

00:24:54.640 --> 00:25:02.599
If you are interested in an additional example

00:25:02.600 --> 00:25:05.439
that demonstrates the representation

00:25:05.440 --> 00:25:08.039
of speech and written text together,

00:25:08.040 --> 00:25:10.719
please continue watching.

00:25:10.720 --> 00:25:12.199
Otherwise, you can stop here

00:25:12.200 --> 00:25:15.279
and enjoy the rest of the conference.

NOTE Bonus material

00:25:15.280 --> 00:25:39.079
Welcome to the bonus material.

00:25:39.080 --> 00:25:43.959
I'm glad for those of you who are stuck around.

00:25:43.960 --> 00:25:46.559
We are now going to examine an instance

00:25:46.560 --> 00:25:49.159
of speech and text signals together

00:25:49.160 --> 00:25:51.479
that produce multiple layers.

00:25:51.480 --> 00:25:54.839
When we have--when we take a spoken conversation

00:25:54.840 --> 00:25:58.719
and use the best language processing models available,

00:25:58.720 --> 00:26:00.679
we suddenly hit a hard spot

00:26:00.680 --> 00:26:03.239
because the tools are typically not trained

00:26:03.240 --> 00:26:05.359
to filter out the unnecessary cruft

00:26:05.360 --> 00:26:07.559
in order to automatically interpret

00:26:07.560 --> 00:26:09.559
the part of what is being said

00:26:09.560 --> 00:26:11.799
that is actually relevant.

00:26:11.800 --> 00:26:14.639
Over time, language researchers

00:26:14.640 --> 00:26:17.719
have created many interdependent layers of annotations,

00:26:17.720 --> 00:26:21.039
yet the assumptions underlying them are seldom the same.

00:26:21.040 --> 00:26:25.039
Piecing together such related but disjointed annotations

00:26:25.040 --> 00:26:28.039
on their predictions poses a huge challenge.

00:26:28.040 --> 00:26:30.719
This is another place where we can leverage

00:26:30.720 --> 00:26:33.119
the data model underlying the Emacs editor,

00:26:33.120 --> 00:26:35.359
along with the structural editing capabilities

00:26:35.360 --> 00:26:38.519
of Org mode to improve current tools.

00:26:38.520 --> 00:26:42.839
Let's take this very simple looking utterance.

00:26:42.840 --> 00:26:48.039
"Um \{lipsmack\} and that's it. (\{laugh\})"

00:26:48.040 --> 00:26:50.319
Looks like the person-- so this is--

00:26:50.320 --> 00:26:54.519
what you are seeing here is a transcript of an audio signal

00:26:54.520 --> 00:27:00.759
that has a lip smack and a laugh as part of it,

00:27:00.760 --> 00:27:04.199
and there is also a "Um" like interjection.

00:27:04.200 --> 00:27:08.199
So this has a few interesting noises

00:27:08.200 --> 00:27:13.999
and specific things that would be illustrative

00:27:14.000 --> 00:27:20.479
of what we are going to, how we are going to represent it.

NOTE Syntactic analysis

00:27:20.480 --> 00:27:25.839
Okay. So let's say you want to have

00:27:25.840 --> 00:27:28.879
a syntactic analysis of this sentence or utterance.

00:27:28.880 --> 00:27:30.959
One common technique people use

00:27:30.960 --> 00:27:32.879
is just to remove the cruft, and, you know,

00:27:32.880 --> 00:27:35.079
write some rules, clean up the utterance,

00:27:35.080 --> 00:27:36.719
make it look like it's proper English,

00:27:36.720 --> 00:27:40.239
and then, you know, tokenize it,

00:27:40.240 --> 00:27:43.079
and basically just use standard tools to process it.

00:27:43.080 --> 00:27:47.279
But in that process, they end up eliminating

00:27:47.280 --> 00:27:51.119
valid pieces of signal that have meaning to others

00:27:51.120 --> 00:27:52.799
studying different phenomena of language.

00:27:52.800 --> 00:27:56.479
Here you have the rich transcript,

00:27:56.480 --> 00:28:00.119
the input to the syntactic parser.

00:28:00.120 --> 00:28:05.919
As you can see, there is a little tokenization happening

00:28:05.920 --> 00:28:07.199
where you'll be inserting space

00:28:07.200 --> 00:28:12.119
between "that" and the contracted is ('s),

00:28:12.120 --> 00:28:15.599
and between the period and the "it,"

00:28:15.600 --> 00:28:18.199
and the output of the syntactic parser is shown below.

00:28:18.200 --> 00:28:21.639
which (surprise) is a S-expression.

00:28:21.640 --> 00:28:24.919
Like I said, the parse trees, when they were created,

00:28:24.920 --> 00:28:29.799
and still largely when they are used, are S-expressions,

00:28:29.800 --> 00:28:32.999
and most of the viewers here

00:28:33.000 --> 00:28:35.119
should not have much problem reading it.

00:28:35.120 --> 00:28:37.279
You can see this tree structure

00:28:37.280 --> 00:28:39.279
of this syntactic parser here.

NOTE Forced alignment

00:28:39.280 --> 00:28:40.919
Now let's say you want to integrate

00:28:40.920 --> 00:28:44.479
phonetic information or phonetic layer

00:28:44.480 --> 00:28:49.119
that's in the audio signal, and do some analysis.

00:28:49.120 --> 00:28:57.519
Now, it would need you to do a few-- take a few steps.

00:28:57.520 --> 00:29:01.679
First, you would need to align the transcript

00:29:01.680 --> 00:29:06.479
with the audio. This process is called forced alignment,

00:29:06.480 --> 00:29:10.399
where you already know what the transcript is,

00:29:10.400 --> 00:29:14.599
and you have the audio, and you can get a good alignment

00:29:14.600 --> 00:29:17.599
using both pieces of information.

00:29:17.600 --> 00:29:20.119
And this is typically a technique that is used to

00:29:20.120 --> 00:29:23.079
create training data for training

00:29:23.080 --> 00:29:25.839
automatic speech recognizers.

00:29:25.840 --> 00:29:29.639
One interesting thing is that in order to do

00:29:29.640 --> 00:29:32.879
this forced alignment, you have to keep

00:29:32.880 --> 00:29:35.799
the non-speech events in transcript,

00:29:35.800 --> 00:29:39.079
because they consume some audio signal,

00:29:39.080 --> 00:29:41.399
and if you don't have that signal,

00:29:41.400 --> 00:29:44.399
the alignment process doesn't know exactly...

00:29:44.400 --> 00:29:45.759
you know, it doesn't do a good job,

00:29:45.760 --> 00:29:50.039
because it needs to align all parts of the signal

00:29:50.040 --> 00:29:54.999
with something, either pause or silence or noise or words.

00:29:55.000 --> 00:29:59.719
Interestingly, punctuations really don't factor in,

00:29:59.720 --> 00:30:01.559
because we don't speak in punctuations.

00:30:01.560 --> 00:30:04.239
So one of the things that you need to do

00:30:04.240 --> 00:30:05.679
is remove most of the punctuations,

00:30:05.680 --> 00:30:08.039
although you'll see there are some punctuations

00:30:08.040 --> 00:30:12.599
that can be kept, or that are to be kept.

NOTE Alignment before tokenization

00:30:12.600 --> 00:30:15.319
And the other thing is that the alignment has to be done

00:30:15.320 --> 00:30:20.159
before tokenization, as it impacts pronunciation.

00:30:20.160 --> 00:30:24.399
To show an example: Here you see "that's".

00:30:24.400 --> 00:30:26.919
When it's one word,

00:30:26.920 --> 00:30:31.959
it has a slightly different pronunciation

00:30:31.960 --> 00:30:35.679
than when it is two words, which is "that is",

00:30:35.680 --> 00:30:38.399
like you can see "is." And so,

00:30:38.400 --> 00:30:44.279
if you split the tokens or split the words

00:30:44.280 --> 00:30:48.119
in order for syntactic parser to process it,

00:30:48.120 --> 00:30:51.599
you would end up getting the wrong phonetic analysis.

00:30:51.600 --> 00:30:54.239
And if you have--if you process it

00:30:54.240 --> 00:30:55.319
through the phonetic analysis,

00:30:55.320 --> 00:30:59.159
and you don't know how to integrate it

00:30:59.160 --> 00:31:02.719
with the tokenized syntax, you can, you know,

00:31:02.720 --> 00:31:07.519
that can be pretty tricky. And a lot of time,

00:31:07.520 --> 00:31:10.759
people write one-off pieces of code that handle these,

00:31:10.760 --> 00:31:14.279
but the idea here is to try to have a general architecture

00:31:14.280 --> 00:31:17.239
that seamlessly integrates all these pieces.

00:31:17.240 --> 00:31:21.319
Then you do the syntactic parsing of the remaining tokens.

00:31:21.320 --> 00:31:24.799
Then you align the data and the two annotations,

00:31:24.800 --> 00:31:27.959
and then integrate the two layers.

00:31:27.960 --> 00:31:31.359
Once that is done, then you can do all kinds of

00:31:31.360 --> 00:31:33.919
interesting analysis, and test various hypotheses

00:31:33.920 --> 00:31:35.279
and generate the statistics,

00:31:35.280 --> 00:31:39.359
but without that you only are dealing

00:31:39.360 --> 00:31:42.879
with one or the other part.

NOTE Layers

00:31:42.880 --> 00:31:48.319
Let's just take a quick look at how each of the layers

00:31:48.320 --> 00:31:51.159
that are involved look like.

00:31:51.160 --> 00:31:56.719
So this is "Um \{lipsmack\}, and that's it. \{laugh\}"

00:31:56.720 --> 00:32:00.159
This is the transcript, and on the right hand side,

00:32:00.160 --> 00:32:04.199
you see the same thing as a transcript

00:32:04.200 --> 00:32:06.239
listed in a vertical in a column.

00:32:06.240 --> 00:32:08.199
You'll see why, in just a second.

00:32:08.200 --> 00:32:09.879
And there are some place--

00:32:09.880 --> 00:32:11.279
there are some rows that are empty,

00:32:11.280 --> 00:32:15.079
some rows that are wider than the others, and we'll see why.

00:32:15.080 --> 00:32:19.319
The next is the tokenized sentence

00:32:19.320 --> 00:32:20.959
where you have space added,

00:32:20.960 --> 00:32:23.599
you know space between these two tokens:

00:32:23.600 --> 00:32:26.599
"that" and the apostrophe "s" ('s),

00:32:26.600 --> 00:32:28.079
and the "it" and the "period".

00:32:28.080 --> 00:32:30.679
And you see on the right hand side

00:32:30.680 --> 00:32:33.559
that the tokens have attributes.

00:32:33.560 --> 00:32:36.439
So there is a token index, and there are 1, 2,

00:32:36.440 --> 00:32:38.839
you know 0, 1, 2, 3, 4, 5 tokens,

00:32:38.840 --> 00:32:41.479
and each token has a start and end character,

00:32:41.480 --> 00:32:45.799
and space (sp) also has a start and end character,

00:32:45.800 --> 00:32:50.399
and space is represented by a "sp".  And there are

00:32:50.400 --> 00:32:54.319
these other things that we removed,

00:32:54.320 --> 00:32:56.239
like the "\{LS\}" which is for "\{lipsmack\}"

00:32:56.240 --> 00:32:59.399
and "\{LG\}" which is "\{laugh\}" are showing grayed out,

00:32:59.400 --> 00:33:02.439
and you'll see why some of these things are grayed out

00:33:02.440 --> 00:33:03.399
in a little bit.

00:33:03.400 --> 00:33:11.919
This is what the forced alignment tool produces.

00:33:11.920 --> 00:33:17.159
Basically, it takes the transcript,

00:33:17.160 --> 00:33:19.159
and this is the transcript

00:33:19.160 --> 00:33:24.119
that has slightly different symbols,

00:33:24.120 --> 00:33:26.239
because different tools use different symbols

00:33:26.240 --> 00:33:28.159
and their various configurational things.

00:33:28.160 --> 00:33:33.679
But this is what is used to get an alignment

00:33:33.680 --> 00:33:36.039
or time alignment with phones.

00:33:36.040 --> 00:33:40.079
So this column shows the phones, and so each word...

00:33:40.080 --> 00:33:43.879
So, for example, "and" has been aligned with these phones,

00:33:43.880 --> 00:33:46.879
and these on the start and end

00:33:46.880 --> 00:33:52.959
are essentially temporal or time stamps that it aligned--

00:33:52.960 --> 00:33:54.279
that has been aligned to it.

00:33:54.280 --> 00:34:00.759
Interestingly, sometimes we don't really have any pause

00:34:00.760 --> 00:34:05.159
or any time duration between some words

00:34:05.160 --> 00:34:08.199
and those are highlighted as gray here.

00:34:08.200 --> 00:34:12.759
See, there's this space... Actually

00:34:12.760 --> 00:34:17.799
it does not have any temporal content,

00:34:17.800 --> 00:34:21.319
whereas this other space has some duration.

00:34:21.320 --> 00:34:24.839
So the ones that have some duration are captured,

00:34:24.840 --> 00:34:29.519
while the others are the ones that in the earlier diagram

00:34:29.520 --> 00:34:31.319
we saw were left out.

NOTE Variations

00:34:31.320 --> 00:34:37.639
And the aligner actually produces multiple files.

00:34:37.640 --> 00:34:44.399
One of the files has a different, slightly different

00:34:44.400 --> 00:34:46.679
variation on the same information,

00:34:46.680 --> 00:34:49.999
and in this case, you can see

00:34:50.000 --> 00:34:52.399
that the punctuation is missing,

00:34:52.400 --> 00:34:57.599
and the punctuation is, you know, deliberately missing,

00:34:57.600 --> 00:35:02.279
because there is no time associated with it,

00:35:02.280 --> 00:35:06.439
and you see that it's not the tokenized sentence--

00:35:06.440 --> 00:35:17.119
a tokenized word. This... Now it gives you a full table,

00:35:17.120 --> 00:35:21.239
and you can't really look into it very carefully.

00:35:21.240 --> 00:35:25.879
But we can focus on the part that seems legible,

00:35:25.880 --> 00:35:28.559
or, you know, properly written sentence,

00:35:28.560 --> 00:35:32.879
process it and reincorporate it back into the whole.

00:35:32.880 --> 00:35:35.879
So if somebody wants to look at, for example,

00:35:35.880 --> 00:35:39.679
how many pauses the person made while they were talking,

00:35:39.680 --> 00:35:42.919
And they can actually measure the pause, the number,

00:35:42.920 --> 00:35:46.279
the duration, and make connections between that

00:35:46.280 --> 00:35:49.639
and the rich syntactic structure that is being produced.

00:35:49.640 --> 00:35:57.279
And in order to do that, you have to get these layers

00:35:57.280 --> 00:35:59.039
to align with each other,

00:35:59.040 --> 00:36:04.359
and this table is just a tabular representation

00:36:04.360 --> 00:36:08.679
of the information that we'll be storing in the YAMR file.

00:36:08.680 --> 00:36:11.719
Congratulations! You have reached

00:36:11.720 --> 00:36:13.479
the end of this demonstration.

00:36:13.480 --> 00:36:17.000
Thank you for your time and attention.