2023/talks/voice.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284

[[!meta title="Enhancing productivity with voice computing"]]
[[!meta copyright="Copyright &copy; 2023 Blaine Mooers"]]
[[!inline pages="internal(2023/info/voice-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->

# Enhancing productivity with voice computing
Blaine Mooers (he/him/his) - Pronunciation: pronounced like "moors", blaine-mooers(at)ouhsc.edu, <https://basicsciences.ouhsc.edu/bmb/Faculty/bio_details/mooers-blaine-hm-phd>, <https://twitter.com/BlaineMooers>, <https://github.com/MooersLab>, <https://codeberg.org/MooersLab>, mastodon(at)bhmooers

[[!inline pages="internal(2023/info/voice-before)" raw="yes"]]

Voice computing uses speech recognition software to convert speech into text, commands, or code.
While there is a venerated program called EmacSpeaks for converting text into speech, a
"EmacsListens" for converting speech into text is not available yet.
The Emacs Wiki describes the underdeveloped situation for speech-to-text in Emacs.
I will explain how two external software packages convert my speech into text and computer
commands that can be used with Emacs.

First, I present some motivations for using voice computing.
These can be divided into two categories: productivity improvement and health-related issues.
In this second category, there is the under-appreciated cure for ``standing desk envy'';
the cure is achievable with a large dose of voice computing while standing.

I found one software package (Voice In) to be quite accurate for speech-to-text or dictation
(Voice In Plus, <https://dictanote.co/voicein/plus/>), but less versatile for speech-to-commands.
I have used this package daily and I found a three-fold increase in my daily word count almost
immediately.
Of course, there are limits here; you can talk for only so many hours per day.

Second, I found another software package that has a less accurate language model (Talon Voice,
<http://talon.wiki/>)) but that supports custom commands that can be executed anywhere you can
place the cursor, including in virtual machines and on remote servers.
Talon Voice will appeal to those who like to tinker with configuration files, yet it is easy to
use.

I will explain how I have integrated these two packages into my workflow.
I have developed a library of commands that expand 94 English contractions when spoken.
This library eliminates tedious downstream editing of formal prose where I do not use
contractions.
The library is available on GitHub for both Voice In Plus
(<https://github.com/mooersLab/voice-in-plus-contractions>) and Talon Voice
(<https://github.com/MooersLab/talon-contractions>).

I also supply the interactive quizzes for mastering the basic Voice In commands
(<https://github.com/MooersLab/voice-in-basics-quiz>) and the Talon Voice phonetic alphabet
(<https://github.com/MooersLab/talon-voice-quizzes/qTalonAlphabet.py>)
I learned the Talon alphabet in one day by taking the quiz at spaced intervals.
The quiz took only 60 seconds to complete when I was proficient.

# About the speaker:

I am an Associate Professor of Biochemistry at the University of
Oklahoma Health Sciences Center. I use X-ray crystallography to study
the structures of RNA, proteins, and protein-drug complexes. I have
been using Python and LaTeX for a dozen years, and Jupyter Notebooks
since 2013. I have been using Emacs every day for 2.5 years. I
discovered voice computing this summer when my chronic repetitive
stress injury flared up while entering data in a spreadsheet. I
tripled my daily word count by using the speech-to-text, and I get a
kick out of running remote computers by speech-to-command.

# Transcript

[slide 1]
Hi, I'm Blaine Mooers. I'm an associate professor of biochemistry at
 the University of Oklahoma Health Sciences Center in Oklahoma City. 
 My lab studies the role of RNA structure in RNA editing. We use X-ray 
 crystallography to study the structures of these RNAs. We spend a lot 
 of time in the lab preparing our samples for structural studies, and then
 we also spend a lot of time at the computer analyzing the resulting data.
 I was seeking ways of using voice computing to try to enhance
  my productivity. 

[slide 2]
I divide voice computing into three activities, speech-to-text or dictation, 
speech-to-commands, and speech-to-code. I'll be talking about 
speech-to-text and speech-to-commands today because these are two 
activities that are probably most broadly applicable to the workflows of 
people attending this conference. 

[slide 3]
This talk will not be about Emacspeak. This is a venerated program for 
converting text to speech. We're talking about the flow of information in 
the opposite direction, speech-to-text. We need an Emacs Listens. We 
don't have one, so I had to seek help from outside the Emacs world via
 the Voice In Plus. This runs in the Google Chrome web browser, and it's 
 very good for speech-to-text and very easy to learn how to use. It also 
 has some speech-to-commands. However, Talon Voice is much better 
 with the speech-to-commands, and it's also great at speech-to-code. 

[slide 4]
the motivations are, obviously, as I mentioned already, for improved 
productivity. So, if you're a fast typist who types faster than they can speak, 
then nonetheless you might still benefit from voice computing when you 
grow tired of using the keyboard. On the other hand, you might be a 
slow typist who talks faster than they can type. In this case, you're 
definitely going to benefit from dictation because you'll be able to encode 
more words in text documents in a given day. If you're a coder, 
then you may get a kick out of opening programs and websites and 
coding projects by using your voice. 

[slide 5]
Then there are health-related reasons. You may have impaired use 
of your hands, eyes, or both due to accident or disease, or you may 
suffer from a repetitive stress injury. Many of us have this in a mild 
but chronic form of it. We can't take a three-month sabbatical from 
the keyboard without losing our jobs, so these injuries tend to persist. 
And then you may have learned that it's not good for your health to 
sit for prolonged periods of time with your staring at a computer screen. 
You can actually dictate to your computer from 20 feet away while 
looking out the window, thereby giving your lower body a break 
and your eyes a break. 

[slide 5]
I'm not God, so I have to bring data. I have two data points here, 
the number of words that I wrote in June and July this year and in 
September and October. I adopted the use of voice computing in 
the middle of August. As you can see, I got an over three-fold increase
 in my output. 

[slide 6]
So this is the Chrome store website for voice-in. It's only available 
for Google Chrome. You just hit the install button to install it. To configure it, 
you need to select a language. It has support for 40 languages and 
it supports about a dozen different dialects of English, including Australian. 

[slide 7]
It works on web pages with text areas, so it works. I use it regularly on 
Overleaf and 750words.com, a distraction-free environment for writing.
 It also works in webmails. It works in Google. It works in Jupyter Lab, 
 of course, because that runs in the browser. It also works in Jupyter 
 Notebook and Colab Notebook. It should work in Cloudmacs. I've 
 mapped option-L to opening Voice In when the cursor is on a web page 
 that has a text area. The presence of a text area is the main limiting factor. 

[slide 8]
Voice In has a number of built-in commands. You can turn it off by saying 
"stop dictation". It doesn't distinguish between a command mode and 
a dictation mode. It has undo command. When you use the command 
"copy that" to a copy of selection. You say  "press enter" to issue a 
command or submit text that has been written in a web form, and then 
"press tab"  will open up the next tab in a web browser. The scroll up 
and down will allow you to navigate a web page. I've put together a quiz 
about these commands so that you can go through this quiz several 
times until you get at least 90 percent of them correct, 90 percent of 
the questions correct. In order to boost your recall of the commands, 
I have a Python script that you can probably pound through the quiz 
with in less than a minute, once you know the commands. I also 
provide an Elisp version of this quiz, but it's a little slower to operate. 

[slide 9]
These are some common errors that I've run into with Voice In. It 
likes to contract statements like "I will" into "I'll". Contractions are not 
used in formal writing, and most of my writing is formal writing, so 
this annoys me. I will show you how I corrected for that problem. It 
also drops the first word in sentences quite often. This might be some 
speech issue that I have. It inserts the wrong word because it's not 
in the dictionary that was used to train it. So, for example, the word 
PyMOL is the name of a molecular graphics program that we use in 
our field. It doesn't recognize PyMOL. Instead, it substitutes in the 
word "primal". Since I don't use "primal" very often, I've mapped the 
word "primal" to "PyMOL" in some custom commands I'll talk about 
in a minute. Then there's a problem that the commands that exist might 
get executed when you speak them when, in fact, you wanted to use 
the words in those commands during your dictation. So this is a problem, 
a pitfall of Voice In, in that it doesn't have a command mode that's 
separate from a dictation mode. 

[slide 10]
You can set up through a very easy-to-use GUI custom voice 
commands mapped to what you want inserted, so this is how 
misinterpreted words can be corrected. You just map the misinterpreted 
word to the intended word. You can also map the contractions to their 
expansions. I did this for 94 English contractions, and you can find 
these on GitHub. You can also insert acronyms and expand those 
acronyms. I apply the same approach to the first names of colleagues. 
I say "expand Fred", for example, to get Fred's first and last name 
with the correct spelling of his very long German name. You can also 
insert other trivia like favorite URLs. You can insert LaTeX snippets. 
It handles correctly multi-line snippets. You just have to enclose them 
in double quotes. You can even insert BibTeX cite keys for references
that you use frequently. All fields have certain key references for certain 
methods or topics. 

[slide 11]
Then it has a set of commands that you can customize for the purpose 
of speech-to-commands to get the computer to do something like open 
up a specific website or save the current writing. In this case, we have 
"press: command-s" for saving the current writing. You can change the 
language with "lang:", and you can change the case of the text with "case:". 

[slide 12]
But the speech-to-command repertoire is quite limited in Voice In, 
so it's now time to pick up on Talon Voice. This is an open source project. 
It's free. It is highly configurable via TalonScript, which is a subset of 
Python. You can use either TalonScript or Python to configure it, but it's 
easier to code up your configuration in TalonScript. It has a Python 
interpreter embedded in it, so you don't have to mess around with 
installing yet another Python interpreter. It runs on all platforms, and it 
has a dictation mode that's separate from a command mode. You can 
activate it, and it'll be in a listening state asleep. You just bark out 
"Talon Wake" to start to wake it up, and "Talon Sleep" to have it go into 
a listening state. It has a very welcoming community in the Talon Slack 
channel. Then I need to point out that there's several packages that 
others have developed that run on top of Talon, but one of particular note 
is by Pokey Rule. He has on his website some really well-done videos 
that demonstrate how he uses Cursorless to move the cursor around 
using voice commands. This, however, runs on VS Code. At least that's 
the text editor for which he's primarily developing Cursorless. 

[slide 13]
I followed the install protocol outlined by Tara Roys. She has a collection 
of tutorials on YouTube as well as on GitHub that are quite helpful. I 
followed her tutorial for installing Talon on macOS without any issues, 
but allow for half an hour to an hour to go through the process. When 
you're done, you'll have this Talon icon appear in the toolbar on the Mac. 
When it has this diagonal line across it, that means it's in the sleep state. 
So, this leads to cascading pull-down menus. This is it for the GUI. One 
of your first tasks is to select a language model that will be used to 
interpret the sounds that you generate as words. And the other kind of 
key feature is that there's a, under scripting, there's a view log pull-down 
that opens up a window displaying the log file. Whenever you make a 
change in a Talon configuration file, that change is implemented 
immediately. You do not have to restart Talon to get the change 
to take effect. 

[slide 14]
This is an example of a Talon file. It has two components. It has a header 
above the dash that describes the scope of the commands contained 
below the dash. Each command is separated by a blank line. If a voice 
command is mapped to multiple actions, these are listed separately on
indented lines below the first line. The words that are in square brackets
are optional. So, I have mapped the word toggle voice in, or the phrase 
toggle voice in, to the keyboard shortcut Alt L in order to toggle on or off 
voice in. If I toggle voice in on, I need to immediately toggle off Talon, 
and this is done through this key command for Control T, which is 
mapped to speech toggle. Speech toggle. Then there are, there's a 
couple other examples. So, if there's no header present, it's an optional 
feature of Talon files, then the commands in the file will apply in all 
situations, in all modes. 

[slide 15]
Here we have two restrictions. These commands will only work when 
using the iTerm2 terminal emulator for the Mac, and then only when the 
title of the window in ccc has this particular address, which is what 
appears when I've logged into the supercomputer at the University of 
Oklahoma. One of the commands in this file is checkjobs. It's mapped 
to an alias, a bash alias called cj for "check jobs", which in turn is mapped 
to a script called checkjobs.sh that, when it's run, returns a listing of the 
pending and running jobs on the supercomputer in a format that I find 
pleasing. This backslash n after cj, the new line character, enters the 
command, so I don't have to do that as an additional step. Likewise, 
here's a similar setup for interacting with a Ubuntu virtual machine. 

[slide 16]
In terms of picking up voice computing, these are my recommendations. 
You're going to run into more errors than you may like initially, and so 
you need some patience in dealing with those. And also, it'll take you 
a while to get your head wrapped around Talon and how it works. You'll 
definitely want to use custom commands to correct the errors or 
shortcomings of the language models. And you've seen how, by 
opening up projects by voice commands, you can reduce friction in 
terms of restarting work on a project. You've seen how Voice In is 
preferred for more accurate dictation. I think my error rate is about 
1 to 2 percent. That is, 1 to 2 out of 100 words are incorrect versus 
Talon Voice where I think the error rate is closer to 5 percent. I have 
put together a library of Enlgish contractions and their expansion for 
Talon too, and they can be found here on GitHub. And I also have posted 
a quiz of 600 questions about some basic Talon commands.

[slide 17]
I'd like to thank the people who've helped me out on the Talon Slack 
channel and members of the Oklahoma Data Science Workshop where 
I gave an hour-long talk on this topic several weeks ago. I'd like to thank 
my friends at the Berlin and Austin Emacs Meetup and at the M-x research 
Slack channel. And I thank these grant funding agencies for supporting 
my work. I'll be happy to take any questions.

[[!inline pages="internal(2023/info/voice-after)" raw="yes"]]

[[!inline pages="internal(2023/info/voice-nav)" raw="yes"]]