Experiments with voice control on Linux
A few months ago, I was messing around with creating a dictation tool with PureScript that used sherpa-onnx to easily run a variety of the best open source STT models efficiently. One my main aims was to create a practical, if basic, voice input tool for Linux that was packaged for Flatpak and Snapcraft distribution. How nice would it be, I thought, if there was a voice input tool that was Linux-focused enough to be packaged in the maximally-convenient way for Linux users?
Unfortunately, I found Snapcraft difficult to work with, and I didn't have the time or patience to finish the packaging part of it. On the other hand, I did make a little voice input tool with a simple CLI interface, and I thought I would write this post to explain both that tool and the other experiment with voice input that I did a few years ago but never wrote about.
Background
I've had an interest in voice-based input since 2019, when I experienced a brief period of chronic pain due to poor typing ergonomics. In the process of researching alternative input methods, I discovered a whole subculture of programmers experimenting with voice-based coding. One of the early introductions I had to this idea was this youtube video, which shows programming in Emacs using dragonfly, an abstraction layer over Dragon and other STT providers. In a similar vein, this video presents voice coding using Talon.
I'm far from an expert in this field, so take my opinions with a grain of salt, but some of my observations about the software options for voice-based coding and how they've evolved over time are that:
-
There has been a trend towards diversification of STT models and engines. Early systems, as far as I can tell, often relied on one of a small number of proprietary speech recognition engines, like Dragon, under the hood. Over time, projects like Talon, Serenade, and others started to develop their own in-house STT models.
-
Dragon does not run on Linux, so if a Linux user wanted to code by voice in the early days, they would often have to create some wild setup involving running Windows in a virtual machine and shuttling transcriptions over the network to the host system. Over time, the projects with their own proprietary STT models were able to more easily support Linux and could offer AppImages that (sometimes) worked out of the box on Linux. As of the last time I checked a few months ago, I still feel that there is no option that checks all of these boxes: open source, highly accurate, easy to install on Linux and works out of the box.
-
The design space is very large and there have been lots of creative experiments. Voice-input tools can vary in many dimensions, but one of the most important is what I think of as being "low-level" vs "high-level." A "low-level" tool would be one in which you are expected to write a "grammar" that maps regular expressions to actions such as system commands to run, keystrokes to execute, etc. Every "utterance" that you speak is matched against the regular expressions that you defined, and if there is a match, then the corresponding action is executed. By "utterance," I mean a silence-delimited segment of user speech captured from the microphone. The low-level tool might provide some utilities for basic things like executing key strokes, spawning system commands, etc, but not much more. A "high-level" voice input tool, by contrast, would be one that comes with a batteries-included grammar out of the box. In the context of voice-based coding, the grammar would include commands for high-level operations like renaming a variable across your whole project. Dictating source code using a low-level voice input tool often meant creating an efficient grammar for dictating key chords and then using that grammar to control text editors like Vim or Emacs, which are already designed to efficiently edit text with a small number of key strokes. Serenade was an example of a high-level tool. It integrated with IDEs out the box and provided lots of high-level commands for writing code in different languages.
-
Related to what I'm calling the "low-level"/"high-level" distinction, there is the "symbolic" vs "connectionist" distinction. A low-level tool like I mentioned above, where you have to configure how your speech maps to commands by writing a system of regular expressions, could be called symbolic. It's a deterministic, human-designed mapping of patterns of speech to actions. On the other hand, something like a voice interface to an LLM is connectionist. In this case, the mapping from patterns of speech to actions (say, tools that the LLM is equipped with that will trigger actions on your computer) is much more fuzzy and probabilistic. It's up to the internals of the LLM, which no one truly understands, to decide whether a given utterance should trigger a particular action. I'm not really sure how Serenade was built, but IMO it definitely had the feel of a connectionist system.
-
I'm not sure about the exact relation between the low-level/high-level distinction and the symbolic/connectionist distinction. I suspect that being symbolic and low-level goes together well and that being connectionist and high-level goes together well, but that there could be exceptions. If you do most of your development in a terminal-based text editor like Vim, for instance, then a high-level, connectionist, LLM-based voice input system would be inefficient (slower, less predictable, unnecessarily costly). If you, like most people, use a mainstream IDE like VSCode and want to be able to issue high-level commands like "add an environment variable named supabase secret key to settings.py" without memorizing a complex grammar, then you need a high-level, connectionist system. I think the ideal system would combine the two, coming with a powerful high-level grammar out of the box, but also letting you drop into an efficient low-level grammar for more fine-grained, predictable input when needed.
Over the years, I've experimented with a few of these voice input programs, but for whatever reason, they have never become a stable part of my workflow. I recovered from my pain in 2019 mostly by improving my ergonomics and diversifying my input devices - alternating between right-handed and left-handed mice and using foot pedals for clicking. Still, that experience in 2019 sparked a long time interest in voice-based input, and a few years ago I tried to build a voice-coding tool as a side project.
Vocoder
My first experiment with trying to create one of these tools myself was Vocoder. With Vocoder, I tried to distill what I saw as the core idea behind the existing idea of a "speech command grammar" and generalize it. A simplified model of programs like Dragonfly is that they identify utterances, or segments of user speech delimited by silence, and match their transcriptions against a collection of user-defined regular expressions. Each regular expression is associated with an action, such as executing a system command or calling a Python function. The utterance may also be parsed to extract structured data for the action.
Conceptually, another way of viewing this type of system is that the user has defined a grammar that specifies a finite state machine (FSM). As the user speaks within a given utterance, each word advances the FSM through states, and when the utterance ends, the final state determines what action is executed. For the next utterance, we again start at the initial state of the FSM.
In Vocoder, I explored a generalization of this idea in which the user's speech throughout the entire session, rather than within a particular utterance, is the input tape to the FSM. Intermediate states can trigger actions and we do not reset to the FSM's initial state for each utterance.
This model makes it possible to express higher‑level concepts directly in the grammar. For example, the following Vocoder program defines a global “sleep mode.” When active, speech is ignored until the user explicitly wakes the system up again. The grammar is written using Vocoder’s embedded DSL, which is essentially a special syntax for defining regular expressions:
from vocoder.app import App
from vocoder.grammar import Grammar
from vocoder.lexicons import en_frequent
g = Grammar()
g(f"""
!start = < ~(vocoder sleep) <* :en - vocoder > ~(vocoder wake)
| ~< :en - vocoder > -> %{g(lambda words: print(" ".join(words)))}
>
:en = :{g(en_frequent(30_000))}
""")
App(g).run()
When you run this program, Vocoder continuously prints your speech to stdout. Saying “vocoder sleep” causes it to ignore all input until you say “vocoder wake,” at which point transcription resumes. Notably, the entire concept of a sleep mode is expressed purely in the grammar, using regular‑expression‑like constructs.
In plain english, the grammar says something like
Say either
AorBone or more times, whereAis the phrase "vocoder sleep" followed by zero or more words followed by "vocoder wake," andBis zero or more words that don't include "vocoder." WheneverBis said, print those words. WhenAis said, do nothing.
(The details of the DSL are in the repo.)
Expressions prefixed with a tilde must occur within a single utterance, or else they are not considered part of the grammar. This lets utterance boundaries participate meaningfully in state transitions and makes the model strictly more general than simple utterance‑based matching (i.e. the Dragonfly-like system that we started with).
Vocoder’s DSL also takes advantage of Python’s f‑strings and dynamic typing. Word lists, dictionaries (for example, mapping spoken numbers to numeric values), and action functions can be injected directly into the grammar at the point where they are conceptually relevant.
Another feature of the DSL is structured capture. Grammar expressions are parsed into a tree that mirrors their regex structure, and named captures are passed to action functions based on parameter names. For example, the following program defines a grammar to execute key chords by voice:
from pynput.keyboard import Controller, Key
keyboard = Controller()
def execute_chord(mods, term):
with keyboard.pressed(*mods):
keyboard.press(term)
keyboard.release(term)
g = Grammar()
g(f"""
!start = < !chord >
!chord ~= <*:modifier> @mods :terminal @term => %{g(execute_chord)}
:modifier = :{g({
"super": Key.cmd,
"control": Key.ctrl,
"shift": Key.shift,
"meta": Key.alt,
# etc
})}
:terminal = :{g({
"alfa": "a",
"bravo": "b",
"charlie": "c",
# etc
})}
""")
The structured capture feature means that the action execute_chord receives a list-like container of pynput key objects as its mod argument, which makes its function definition very concise. That's because @mod captures the <*:modifier> expression, which gathers zero or more "modifiers" into a list, and the definition of ":modifier" is a dictionary that maps spoken words to pynput key objects.
Vocoder was a fun side project to work on, but unfortunately the STT model is not super accurate, especially without a top notch microphone. Moreover, the grammar system was tightly integrated with the specific STT model, such that it is difficult to plug in different models.
Voice
So, that was Vocoder. As mentioned earlier, I experimented a few months ago with writing a new dictation tool from scratch using PureScript, which I have named "Voice," the source code for which is now available on GitHub. As a side project, I didn't make Voice to explore any particular idea about voice input in the way that I did with Vocoder. I was aiming to make a simple dictation tool that I could package for the main, modern Linux software package repositories like Snap and Flatpak. Moreover, I wanted to see how easy it would be to use a purely functional programming language like PureScript, together with modern open source STT libraries and models, to make something that works at a basic level.
For speech recognition, I used sherpa-onnx. Sherpa-onnx uses a Node addon to run models efficiently and it uses Onnx to integrate lots of models from different research groups and make them available through a common interface. At the moment, you can use Voice with
- Parakeet V2
- Parakeet V3
- Moonshine
- Whisper tiny
- Whisper small
You can download one of these models by running something like
voice download-model -m parakeetv3
Or rather, you will be able to invoke Voice like that once it is installable via Flatpak or Snap. Until then, you will have to clone the repo, install the dependencies, and invoke it with npm run start. So the above command would in fact be
npm run start -- download-model -m parakeetv3
Still, because it looks cooler, I will continue to describe the interface as if it's available via the command "voice."
It has commands to record audio, transcribe audio, and execute commands or dictate text that you can combine using pipes. For example, the following command records audio, transcribes it, and then types what you said using xdotool to simulate keystrokes:
voice record | voice transcribe -m parakeetv3 | voice dictate
If you have a yaml file, commands.yaml, that maps phrases to system commands, you can instead execute commands when their trigger phrases are spoken:
voice record | voice transcribe -m parakeetv3 | voice execute -c commands.yaml
The program comes with moonshine as a default model, which is used if you omit a model choice like -m parakeetv3.
I'm not sure what the future of Voice is, although I do plan on finishing the packaging of it for Snap and Flatpak at some point.