Ricky's Blog

Experiments with voice control on Linux

A few months ago, I was messing around with creating a dictation tool with PureScript that used sherpa-onnx to easily run a variety of the best open source STT models efficiently. One my main aims was to create a practical, if basic, voice input tool for Linux that was packaged for Flatpak and Snapcraft distribution. How nice would it be, I thought, if there was a voice input tool that was Linux-focused enough to be packaged in the maximally-convenient way for Linux users?

Unfortunately, I found Snapcraft difficult to work with, and I didn't have the time or patience to finish the packaging part of it. On the other hand, I did make a little voice input tool with a simple CLI interface, and I thought I would write this post to explain both that tool and the other experiment with voice input that I did a few years ago but never wrote about.

Background

I've had an interest in voice-based input since 2019, when I experienced a brief period of chronic pain due to poor typing ergonomics. In the process of researching alternative input methods, I discovered a whole subculture of programmers experimenting with voice-based coding. One of the early introductions I had to this idea was this youtube video, which shows programming in Emacs using dragonfly, an abstraction layer over Dragon and other STT providers. In a similar vein, this video presents voice coding using Talon.

I'm far from an expert in this field, so take my opinions with a grain of salt, but some of my observations about the software options for voice-based coding and how they've evolved over time are that:

Over the years, I've experimented with a few of these voice input programs, but for whatever reason, they have never become a stable part of my workflow. I recovered from my pain in 2019 mostly by improving my ergonomics and diversifying my input devices - alternating between right-handed and left-handed mice and using foot pedals for clicking. Still, that experience in 2019 sparked a long time interest in voice-based input, and a few years ago I tried to build a voice-coding tool as a side project.

Vocoder

My first experiment with trying to create one of these tools myself was Vocoder. With Vocoder, I tried to distill what I saw as the core idea behind the existing idea of a "speech command grammar" and generalize it. A simplified model of programs like Dragonfly is that they identify utterances, or segments of user speech delimited by silence, and match their transcriptions against a collection of user-defined regular expressions. Each regular expression is associated with an action, such as executing a system command or calling a Python function. The utterance may also be parsed to extract structured data for the action.

Conceptually, another way of viewing this type of system is that the user has defined a grammar that specifies a finite state machine (FSM). As the user speaks within a given utterance, each word advances the FSM through states, and when the utterance ends, the final state determines what action is executed. For the next utterance, we again start at the initial state of the FSM.

In Vocoder, I explored a generalization of this idea in which the user's speech throughout the entire session, rather than within a particular utterance, is the input tape to the FSM. Intermediate states can trigger actions and we do not reset to the FSM's initial state for each utterance.

This model makes it possible to express higher‑level concepts directly in the grammar. For example, the following Vocoder program defines a global “sleep mode.” When active, speech is ignored until the user explicitly wakes the system up again. The grammar is written using Vocoder’s embedded DSL, which is essentially a special syntax for defining regular expressions:

from vocoder.app import App
from vocoder.grammar import Grammar
from vocoder.lexicons import en_frequent

g = Grammar()
g(f"""
!start = < ~(vocoder sleep) <* :en - vocoder > ~(vocoder wake)
           |  ~< :en - vocoder > -> %{g(lambda words: print(" ".join(words)))}
         >
:en = :{g(en_frequent(30_000))}
""")

App(g).run()

When you run this program, Vocoder continuously prints your speech to stdout. Saying “vocoder sleep” causes it to ignore all input until you say “vocoder wake,” at which point transcription resumes. Notably, the entire concept of a sleep mode is expressed purely in the grammar, using regular‑expression‑like constructs.

In plain english, the grammar says something like

Say either A or B one or more times, where A is the phrase "vocoder sleep" followed by zero or more words followed by "vocoder wake," and B is zero or more words that don't include "vocoder." Whenever B is said, print those words. When A is said, do nothing.

(The details of the DSL are in the repo.)

Expressions prefixed with a tilde must occur within a single utterance, or else they are not considered part of the grammar. This lets utterance boundaries participate meaningfully in state transitions and makes the model strictly more general than simple utterance‑based matching (i.e. the Dragonfly-like system that we started with).

Vocoder’s DSL also takes advantage of Python’s f‑strings and dynamic typing. Word lists, dictionaries (for example, mapping spoken numbers to numeric values), and action functions can be injected directly into the grammar at the point where they are conceptually relevant.

Another feature of the DSL is structured capture. Grammar expressions are parsed into a tree that mirrors their regex structure, and named captures are passed to action functions based on parameter names. For example, the following program defines a grammar to execute key chords by voice:

from pynput.keyboard import Controller, Key

keyboard = Controller()

def execute_chord(mods, term):
    with keyboard.pressed(*mods):
        keyboard.press(term)
        keyboard.release(term)

g = Grammar()

g(f"""
!start = < !chord >
!chord ~= <*:modifier> @mods :terminal @term => %{g(execute_chord)}

:modifier = :{g({
    "super": Key.cmd,
    "control": Key.ctrl,
    "shift": Key.shift,
    "meta": Key.alt,
    # etc
})}
:terminal = :{g({
    "alfa": "a",
    "bravo": "b",
    "charlie": "c",
    # etc
})}
""")

The structured capture feature means that the action execute_chord receives a list-like container of pynput key objects as its mod argument, which makes its function definition very concise. That's because @mod captures the <*:modifier> expression, which gathers zero or more "modifiers" into a list, and the definition of ":modifier" is a dictionary that maps spoken words to pynput key objects.

Vocoder was a fun side project to work on, but unfortunately the STT model is not super accurate, especially without a top notch microphone. Moreover, the grammar system was tightly integrated with the specific STT model, such that it is difficult to plug in different models.

Voice

So, that was Vocoder. As mentioned earlier, I experimented a few months ago with writing a new dictation tool from scratch using PureScript, which I have named "Voice," the source code for which is now available on GitHub. As a side project, I didn't make Voice to explore any particular idea about voice input in the way that I did with Vocoder. I was aiming to make a simple dictation tool that I could package for the main, modern Linux software package repositories like Snap and Flatpak. Moreover, I wanted to see how easy it would be to use a purely functional programming language like PureScript, together with modern open source STT libraries and models, to make something that works at a basic level.

For speech recognition, I used sherpa-onnx. Sherpa-onnx uses a Node addon to run models efficiently and it uses Onnx to integrate lots of models from different research groups and make them available through a common interface. At the moment, you can use Voice with

You can download one of these models by running something like

voice download-model -m parakeetv3

Or rather, you will be able to invoke Voice like that once it is installable via Flatpak or Snap. Until then, you will have to clone the repo, install the dependencies, and invoke it with npm run start. So the above command would in fact be

npm run start -- download-model -m parakeetv3

Still, because it looks cooler, I will continue to describe the interface as if it's available via the command "voice."

It has commands to record audio, transcribe audio, and execute commands or dictate text that you can combine using pipes. For example, the following command records audio, transcribes it, and then types what you said using xdotool to simulate keystrokes:

voice record | voice transcribe -m parakeetv3 | voice dictate

If you have a yaml file, commands.yaml, that maps phrases to system commands, you can instead execute commands when their trigger phrases are spoken:

voice record | voice transcribe -m parakeetv3 | voice execute -c commands.yaml

The program comes with moonshine as a default model, which is used if you omit a model choice like -m parakeetv3.

I'm not sure what the future of Voice is, although I do plan on finishing the packaging of it for Snap and Flatpak at some point.