summaryrefslogtreecommitdiffstats
path: root/2023/talks/voice.md
blob: 3d96a967f86d5c66df05dbadaf1d1aa465b2b83e (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
[[!meta title="Enhancing productivity with voice computing"]]
[[!meta copyright="Copyright © 2023 Blaine Mooers"]]
[[!inline pages="internal(2023/info/voice-nav)" raw="yes"]]

<!-- Initially generated with emacsconf-publish-talk-page and then left alone for manual editing -->
<!-- You can manually edit this file to update the abstract, add links, etc. --->

# Enhancing productivity with voice computing
Blaine Mooers (he/him/his) - Pronunciation: pronounced like "moors", blaine-mooers(at)ouhsc.edu, <https://basicsciences.ouhsc.edu/bmb/Faculty/bio_details/mooers-blaine-hm-phd>, <https://twitter.com/BlaineMooers>, <https://github.com/MooersLab>, <https://codeberg.org/MooersLab>, mastodon(at)bhmooers

[[!inline pages="internal(2023/info/voice-before)" raw="yes"]]

Voice computing uses speech recognition software to convert speech into text, commands, or code.
While there is a venerated program called EmacSpeaks for converting text into speech, an
``EmacsListens'' for converting speech into text is not available yet.
The Emacs Wiki describes the underdeveloped situation for speech-to-text in Emacs.
I will explain how two external software packages convert my speech into text and computer
commands that can be used with Emacs.

First, I present some motivations for using voice computing.
These can be divided into two categories: productivity improvement and health-related issues.
In this second category, there is the underappreciated cure for ``standing desk envy'';
the cure is achievable with a large dose of voice computing while standing.

I found one software package (Voice In) to be quite accurate for speech-to-text or dictation
(Voice In Plus, <https://dictanote.co/voicein/plus/>), but less versatile for speech-to-commands.
I have used this package daily, and I found a three-fold increase in my daily word count almost
immediately.
Of course, there are limits here; you can talk for only so many hours per day.

Second, I found another software package that has a less accurate language model (Talon Voice,
<http://talon.wiki/>)) but that supports custom commands that can be executed anywhere you can
place the cursor, including in virtual machines and on remote servers.
Talon Voice will appeal to those who like to tinker with configuration files, yet it is easy to
use.

I will explain how I have integrated these two packages into my workflow.
I have developed a library of commands that expand 94 English contractions when spoken.
This library eliminates tedious downstream editing of formal prose where I do not use
contractions.
The library is available on GitHub for both Voice In Plus
(<https://github.com/mooersLab/voice-in-plus-contractions>) and Talon Voice
(<https://github.com/MooersLab/talon-contractions>).

I also supply the interactive quizzes to master the basic Voice In commands
(<https://github.com/MooersLab/voice-in-basics-quiz>) and the Talon Voice phonetic alphabet
(<https://github.com/MooersLab/talon-voice-quizzes/qTalonAlphabet.py>)
I learned the Talon alphabet in one day by taking the quiz at spaced intervals.
The quiz took only 60 seconds to complete when I was proficient.

I store my daily writing in a multi-file LaTeX document with one tex file per day.
365 files are compiled into one PDF per year. This is usually about 1000 pages.
I am not going to push my luck with a multiyear document.
Each month is a chapter. The resulting PDF is a breeze to scroll and search.
It has an autogenerated table of contents and an index. I have posted 
a blank version for 2023 and another for the upcoming year 
(<https://github.com/MooersLab/diary2024inLaTeX>)
One could take a similar approach in org-mode by using Bastian Bechtold's 
org-journal package (<https://github.com/bastibe/org-journal>).

I gave a 60-minute talk on this topic to the Oklahoma Data Science Workshop 
2023 Nov. 16 (<https://mediasite.ouhsc.edu/Mediasite/Channel/python>).
This workshop meets once a month and is for people interested in data 
science and scientific computing. You do not have to be an Oklahoma
resident to attend. Send me e-mail if you want to be added to our mailing list.

# About the speaker:

I am an Associate Professor of Biochemistry at the University of
Oklahoma Health Sciences Center. I use X-ray crystallography to study
the structures of RNA, proteins, and protein-drug complexes. I have
been using Python and LaTeX for a dozen years, and Jupyter Notebooks
since 2013. I have been using Emacs every day for 2.5 years. I
discovered voice computing this summer when my chronic repetitive
stress injury flared up while entering data in a spreadsheet. I
tripled my daily word count by using the speech-to-text, and I get a
kick out of running remote computers by speech-to-command.
# Discussion

## Questions and answers

-   Q:  Comment there is a text to command thing called clipea that
    would be awesome <https://github.com/dave1010/clipea>
    -   A: <https://sourceforge.net/projects/sox/> also a good
        alternative.
-   Q: Could you comment on how speaking vs. typing affects your
    logic/content.  Thanks!
    -   A: I find that this is like the difference between writing your thoughts
		down on a blank piece of printer paper versus paper bound with a
		leather notebook. I do not think there has any real difference. I know
		that some people believe there is a solid certain difference but this
		is, for the purpose I am using this, for the purpose of generating the
		first draft, because my skills with the-- using my voice to edit my
		text is still not very well developed, I am still more efficient using
		the keyboard for that stage.

		So the hardest part about
		writing generally is getting the first crappy draft written. I
		have found that dictation is perfectly fine for that phase. I
		find it actually very conducive for just getting the text out. The
		biggest problem that most of us have is applying our internal editor and
		that inhibits us from generating words in a free-flowing
		fashion. 

		I generally do my generative writing--actually, I divide my writing
		into two categories: generative writing (generating the first crappy
		draft) and then rewriting. Rewriting is probably 80-90% of writing
		where you can go back and rework the order of the sentences, order of
		paragraphs, the order of words in a sentence and so forth. It is
		really hard work that is best done later in the day when I am more
		awake. I do my generative writing first thing in the morning when I am
		feel horrible. That is when my internal editor is not very awake and I
		can get more words out more words past that gatekeeper. I can do this
		sitting down. I can do this standing up. I can do this 20 feet away
		from my computer looking out the window to get my eyes a break. I find
		it is just a very enjoyable to use it in this fashion. The downside is
		that I wind up generating three times as much text. That makes for
		three times as much work when it comes to rewriting the text, and that
		means I am using the keyboard a lot and later on in the day.

		I have not made any progress on recovering from my own repetitive
		stress injury. I hope that I will add the use of voice commands,
		speech-to-commands, for editing the text in the future and I will
		eventually give my hands more of a break.

		This allows you to actually separate those two activities not only by
		time... So many professional writers will spend several hours in the
		morning doing the generative part and then they will spend the rest of
		the day rewriting. They have separated this to activities temporally.
		What most people actually do is they they do the generative part and
		then they write one sentence, and they apply that internal editor
		right away because they want to write the first draft as a perfect
		version, as a final draft, and that is what slows them down
		dramatically.

		This also allows you to separate these two activities in terms of
		modality. You are going to do the generative writing by Voice In, the
		rewriting by keyboard. I think this is like what most people... One way
		that many people can get into using speech-to-text in a productive way
		that sounds great...
    -   A: (not the author, just an audiance): So, for example, when
        you\'re talking, you have an immense feeling of the topic you
        have. You can close your eyes and do your body gestures to
        manipulate a concept or idea, and you have\... I just feel you
        feel more creative than just tapping. Definitely you have much
        more speed advantage over tapping, but more important thing is
        you use your body as a whole to interact with those ideas.
        \[this one is done via voice\...\]
        -   but typing is definitely good for acturate control, such as
            M-x some-command \...
-   Q: Have you tried the ChatGTP voice chat interface, if so how has
    been your experience of it? As someone experienced with voice
    control, interested to hear your thoughts, performance relative to
    the open source tools in particular. 
    -   A: I do not have much experience with that particular software. I have
		use Whisper a little bit, and so that is related. Of course, you have
		this problem of lag. I find that Whisper is good for spitting out a
		sentence maybe for a docstring and a programming file. I find that it
		is very prone to hallucinations. I find myself spending half my
		time deleting the hallucinations, and I feel like the net gain is
		diminished as a result, or there has not much of a net gain in terms of
		what I am getting out of it.
-   Q: Are any of these voice command/dictions freemium?
    -   A: To be able to add custom commands, you have to pay
		$48 a year. The Talon Voice software is free and the only
		limitation there is access to the language model. If you want to get
		the beta version, you need to subscribe to Patreon to support the
		developer. I did that, and I really did not find much of
		an improvement. I really do not intend to do that in the future.
		But otherwise in Talon Voice, everything is open and free. The Slack
		community is incredibly welcoming. Its parallels with
		the Emacs Community are pretty striking.
-   Q: How good is Talon compared to whisper?
    - A: With Talon, I find that the first part of the sentence will
		be fairly accurate. When I am doing dictation and then towards
		the end, the errors... In general, I think its error rate is
		about five words out of 100 or so or will be wrong. Whisper is
		wonderful because it will insert punctuation for you, but I
		guess its errors are longer and that will hallucinate full
		sentences for you. So they both have significant error rates.
		They are just different kinds of errors. Hopefully, both over
		time... [Talon] errors are generally shorter in extent. It do
		not hallucinate as long.
- Q: are any of those voice command/dictation tools libre? i can not find that information on the web
  - (not the speaker): 
    - this FAQ <https://talon.wiki/faq/> says that Talon Voice is closed source
	- talon voice is non-free <https://talonvoice.com/EULA.txt>
    - Mistral 7B is apache 2.0 license  i.e. no restrictions


## Notes

- From the speaker: I really appreciate the high level of accuracy that I am getting from
Voice In. I would use Talon Voice for dictation, but at this point,
there is a significant difference between the level of accuracy of
Voice In versus Talon Voice. It's large enough of a difference that I'll
probably use Voice In for a while until I can figure out how to get 
Talon Voice to generate more accurate text.
-   When you do Org mode and you have the bullets, it can allows you to naturally shard your thoughts in a way that is really easy to edit. ... It has a
summarizing capability. It allows you to you know pull back and get a
overview.
- Great stuff, definitely going to test-drive Talon


[[!inline pages="internal(2023/info/voice-after)" raw="yes"]]

[[!inline pages="internal(2023/info/voice-nav)" raw="yes"]]