<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="pretty-atom-feed.xsl" type="text/xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>Ricky&#39;s Blog</title>
  <subtitle>Programming and stuff</subtitle>
  <link href="https://blog.ricky0123.com/feed/feed.xml" rel="self" />
  <link href="https://blog.ricky0123.com/" />
  <updated>2026-02-15T00:00:00Z</updated>
  <id>https://blog.ricky0123.com/</id>
  <author>
    <name>Ricky Samore</name>
  </author>
  <entry>
    <title>Experiments with voice control on Linux
</title>
    <link href="https://blog.ricky0123.com/blog/voice/" />
    <updated>2026-02-15T00:00:00Z</updated>
    <id>https://blog.ricky0123.com/blog/voice/</id>
    <content type="html">&lt;p&gt;A few months ago, I was messing around with creating a dictation tool with PureScript that used &lt;a href=&quot;https://github.com/k2-fsa/sherpa-onnx&quot; target=&quot;_blank&quot;&gt;sherpa-onnx&lt;/a&gt; to easily run a variety of the best open source STT models efficiently. One my main aims was to create a practical, if basic, voice input tool for Linux that was packaged for Flatpak and Snapcraft distribution. How nice would it be, I thought, if there was a voice input tool that was Linux-focused enough to be packaged in the maximally-convenient way for Linux users?&lt;/p&gt;
&lt;p&gt;Unfortunately, I found Snapcraft difficult to work with, and I didn&#39;t have the time or patience to finish the packaging part of it. On the other hand, I did make a little &lt;a href=&quot;https://github.com/ricky0123/voice&quot; target=&quot;_blank&quot;&gt;voice input tool&lt;/a&gt; with a simple CLI interface, and I thought I would write this post to explain both that tool and the other experiment with voice input that I did a few years ago but never wrote about.&lt;/p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
&lt;p&gt;I&#39;ve had an interest in voice-based input since 2019, when I experienced a brief period of chronic pain due to poor typing ergonomics. In the process of researching alternative input methods, I discovered a whole subculture of programmers experimenting with voice-based coding. One of the early introductions I had to this idea was &lt;a href=&quot;https://www.youtube.com/watch?v=8SkdfdXWYaI&quot; target=&quot;_blank&quot;&gt;this youtube video&lt;/a&gt;, which shows programming in Emacs using &lt;a href=&quot;https://github.com/dictation-toolbox/dragonfly&quot; target=&quot;_blank&quot;&gt;dragonfly&lt;/a&gt;, an abstraction layer over &lt;a href=&quot;https://dragon.nuance.com/en-us/home-professional-and-consumer&quot; target=&quot;_blank&quot;&gt;Dragon&lt;/a&gt; and other STT providers. In a similar vein, &lt;a href=&quot;https://www.youtube.com/watch?v=YKuRkGkf5HU0&quot; target=&quot;_blank&quot;&gt;this video&lt;/a&gt; presents voice coding using &lt;a href=&quot;https://talonvoice.com/&quot; target=&quot;_blank&quot;&gt;Talon&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I&#39;m far from an expert in this field, so take my opinions with a grain of salt, but some of my observations about the software options for voice-based coding and how they&#39;ve evolved over time are that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;There has been a trend towards diversification of STT models and engines. Early systems, as far as I can tell, often relied on one of a small number of proprietary speech recognition engines, like Dragon, under the hood. Over time, projects like Talon, &lt;a href=&quot;https://serenade.ai/&quot; target=&quot;_blank&quot;&gt;Serenade&lt;/a&gt;, and others started to develop their own in-house STT models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dragon does not run on Linux, so if a Linux user wanted to code by voice in the early days, they would often have to create some wild setup involving running Windows in a virtual machine and shuttling transcriptions over the network to the host system. Over time, the projects with their own proprietary STT models were able to more easily support Linux and could offer AppImages that (sometimes) worked out of the box on Linux. As of the last time I checked a few months ago, I still feel that there is no option that checks all of these boxes: open source, highly accurate, easy to install on Linux and works out of the box.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The design space is very large and there have been lots of creative experiments. Voice-input tools can vary in many dimensions, but one of the most important is what I think of as being &amp;quot;low-level&amp;quot; vs &amp;quot;high-level.&amp;quot; A &amp;quot;low-level&amp;quot; tool would be one in which you are expected to write a &amp;quot;grammar&amp;quot; that maps regular expressions to actions such as system commands to run, keystrokes to execute, etc. Every &amp;quot;utterance&amp;quot; that you speak is matched against the regular expressions that you defined, and if there is a match, then the corresponding action is executed. By &amp;quot;utterance,&amp;quot; I mean a silence-delimited segment of user speech captured from the microphone. The low-level tool might provide some utilities for basic things like executing key strokes, spawning system commands, etc, but not much more. A &amp;quot;high-level&amp;quot; voice input tool, by contrast, would be one that comes with a batteries-included grammar out of the box. In the context of voice-based coding, the grammar would include commands for high-level operations like renaming a variable across your whole project. Dictating source code using a low-level voice input tool often meant creating an efficient grammar for dictating key chords and then using that grammar to control text editors like Vim or Emacs, which are already designed to efficiently edit text with a small number of key strokes. Serenade was an example of a high-level tool. It integrated with IDEs out the box and provided lots of high-level commands for writing code in different languages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Related to what I&#39;m calling the &amp;quot;low-level&amp;quot;/&amp;quot;high-level&amp;quot; distinction, there is the &amp;quot;symbolic&amp;quot; vs &amp;quot;connectionist&amp;quot; distinction. A low-level tool like I mentioned above, where you have to configure how your speech maps to commands by writing a system of regular expressions, could be called symbolic. It&#39;s a deterministic, human-designed mapping of patterns of speech to actions. On the other hand, something like a voice interface to an LLM is connectionist. In this case, the mapping from patterns of speech to actions (say, tools that the LLM is equipped with that will trigger actions on your computer) is much more fuzzy and probabilistic. It&#39;s up to the internals of the LLM, which no one truly understands, to decide whether a given utterance should trigger a particular action. I&#39;m not really sure how Serenade was built, but IMO it definitely had the feel of a connectionist system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;I&#39;m not sure about the exact relation between the low-level/high-level distinction and the symbolic/connectionist distinction. I suspect that being symbolic and low-level goes together well and that being connectionist and high-level goes together well, but that there could be exceptions. If you do most of your development in a terminal-based text editor like Vim, for instance, then a high-level, connectionist, LLM-based voice input system would be inefficient (slower, less predictable, unnecessarily costly). If you, like most people, use a mainstream IDE like VSCode and want to be able to issue high-level commands like &amp;quot;add an environment variable named supabase secret key to settings.py&amp;quot; without memorizing a complex grammar, then you need a high-level, connectionist system. I think the ideal system would combine the two, coming with a powerful high-level grammar out of the box, but also letting you drop into an efficient low-level grammar for more fine-grained, predictable input when needed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Over the years, I&#39;ve experimented with a few of these voice input programs, but for whatever reason, they have never become a stable part of my workflow. I recovered from my pain in 2019 mostly by improving my ergonomics and diversifying my input devices - alternating between right-handed and left-handed mice and using foot pedals for clicking. Still, that experience in 2019 sparked a long time interest in voice-based input, and a few years ago I tried to build a voice-coding tool as a side project.&lt;/p&gt;
&lt;h2 id=&quot;vocoder&quot;&gt;Vocoder&lt;/h2&gt;
&lt;p&gt;My first experiment with trying to create one of these tools myself was &lt;a href=&quot;https://github.com/ricky0123/vocoder&quot; target=&quot;_blank&quot;&gt;Vocoder&lt;/a&gt;. With Vocoder, I tried to distill what I saw as the core idea behind the existing idea of a &amp;quot;speech command grammar&amp;quot; and generalize it. A simplified model of programs like Dragonfly is that they identify &lt;em&gt;utterances&lt;/em&gt;, or segments of user speech delimited by silence, and match their transcriptions against a collection of user-defined regular expressions. Each regular expression is associated with an &lt;em&gt;action&lt;/em&gt;, such as executing a system command or calling a Python function. The utterance may also be parsed to extract structured data for the action.&lt;/p&gt;
&lt;p&gt;Conceptually, another way of viewing this type of system is that the user has defined a grammar that specifies a finite state machine (FSM). As the user speaks within a given utterance, each word advances the FSM through states, and when the utterance ends, the final state determines what action is executed. For the next utterance, we again start at the initial state of the FSM.&lt;/p&gt;
&lt;p&gt;In Vocoder, I explored a generalization of this idea in which the user&#39;s speech throughout the entire session, rather than within a particular utterance, is the input tape to the FSM. Intermediate states can trigger actions and we do not reset to the FSM&#39;s initial state for each utterance.&lt;/p&gt;
&lt;p&gt;This model makes it possible to express higher‑level concepts directly in the grammar. For example, the following Vocoder program defines a global “sleep mode.” When active, speech is ignored until the user explicitly wakes the system up again. The grammar is written using Vocoder’s embedded DSL, which is essentially a special syntax for defining regular expressions:&lt;/p&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; vocoder&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;app &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; App
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; vocoder&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;grammar &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Grammar
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; vocoder&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;lexicons &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; en_frequent

g &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Grammar&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;&quot;&quot;
!start = &amp;lt; ~(vocoder sleep) &amp;lt;* :en - vocoder &gt; ~(vocoder wake)
           |  ~&amp;lt; :en - vocoder &gt; -&gt; %&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;lambda&lt;/span&gt; words&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;join&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;words&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;
         &gt;
:en = :&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;en_frequent&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;30_000&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;
&quot;&quot;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

App&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;run&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you run this program, Vocoder continuously prints your speech to stdout. Saying “vocoder sleep” causes it to ignore all input until you say “vocoder wake,” at which point transcription resumes. Notably, the entire concept of a sleep mode is expressed purely in the grammar, using regular‑expression‑like constructs.&lt;/p&gt;
&lt;p&gt;In plain english, the grammar says something like&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Say either &lt;code&gt;A&lt;/code&gt; or &lt;code&gt;B&lt;/code&gt; one or more times, where &lt;code&gt;A&lt;/code&gt; is the phrase &amp;quot;vocoder sleep&amp;quot; followed by zero or more words followed by &amp;quot;vocoder wake,&amp;quot; and &lt;code&gt;B&lt;/code&gt; is zero or more words that don&#39;t include &amp;quot;vocoder.&amp;quot; Whenever &lt;code&gt;B&lt;/code&gt; is said, print those words. When &lt;code&gt;A&lt;/code&gt; is said, do nothing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;(The details of the DSL are in &lt;a href=&quot;https://github.com/ricky0123/vocoder/blob/master/docs/user-guide.md&quot; target=&quot;_blank&quot;&gt;the repo&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;Expressions prefixed with a tilde must occur within a single utterance, or else they are not considered part of the grammar. This lets utterance boundaries participate meaningfully in state transitions and makes the model strictly more general than simple utterance‑based matching (i.e. the Dragonfly-like system that we started with).&lt;/p&gt;
&lt;p&gt;Vocoder’s DSL also takes advantage of Python’s f‑strings and dynamic typing. Word lists, dictionaries (for example, mapping spoken numbers to numeric values), and action functions can be injected directly into the grammar at the point where they are conceptually relevant.&lt;/p&gt;
&lt;p&gt;Another feature of the DSL is structured capture. Grammar expressions are parsed into a tree that mirrors their regex structure, and named captures are passed to action functions based on parameter names. For example, the following program defines a grammar to execute key chords by voice:&lt;/p&gt;
&lt;pre class=&quot;language-python&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; pynput&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;keyboard &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Controller&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Key

keyboard &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Controller&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;execute_chord&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;mods&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; term&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;with&lt;/span&gt; keyboard&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;pressed&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt;mods&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        keyboard&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;press&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;term&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
        keyboard&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;release&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;term&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

g &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Grammar&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string-interpolation&quot;&gt;&lt;span class=&quot;token string&quot;&gt;f&quot;&quot;&quot;
!start = &amp;lt; !chord &gt;
!chord ~= &amp;lt;*:modifier&gt; @mods :terminal @term =&gt; %&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;execute_chord&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;

:modifier = :&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;super&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Key&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;cmd&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;control&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Key&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ctrl&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;shift&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Key&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;shift&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;meta&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; Key&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;alt&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;# etc&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;
:terminal = :&lt;/span&gt;&lt;span class=&quot;token interpolation&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;g&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;alfa&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;a&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;bravo&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;b&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token string&quot;&gt;&quot;charlie&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;c&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;token comment&quot;&gt;# etc&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;
&quot;&quot;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The structured capture feature means that the action &lt;code&gt;execute_chord&lt;/code&gt; receives a list-like container of pynput key objects as its &lt;code&gt;mod&lt;/code&gt; argument, which makes its function definition very concise. That&#39;s because &lt;code&gt;@mod&lt;/code&gt; captures the &lt;code&gt;&amp;lt;*:modifier&amp;gt;&lt;/code&gt; expression, which gathers zero or more &amp;quot;modifiers&amp;quot; into a list, and the definition of &amp;quot;:modifier&amp;quot; is a dictionary that maps spoken words to pynput key objects.&lt;/p&gt;
&lt;p&gt;Vocoder was a fun side project to work on, but unfortunately the STT model is not super accurate, especially without a top notch microphone. Moreover, the grammar system was tightly integrated with the specific STT model, such that it is difficult to plug in different models.&lt;/p&gt;
&lt;h2 id=&quot;voice&quot;&gt;Voice&lt;/h2&gt;
&lt;p&gt;So, that was Vocoder. As mentioned earlier, I experimented a few months ago with writing a new dictation tool from scratch using PureScript, which I have named &amp;quot;Voice,&amp;quot; the source code for which is now &lt;a href=&quot;https://github.com/ricky0123/voice&quot; target=&quot;_blank&quot;&gt;available on GitHub&lt;/a&gt;. As a side project, I didn&#39;t make Voice to explore any particular idea about voice input in the way that I did with Vocoder. I was aiming to make a simple dictation tool that I could package for the main, modern Linux software package repositories like Snap and Flatpak. Moreover, I wanted to see how easy it would be to use a purely functional programming language like PureScript, together with modern open source STT libraries and models, to make something that works at a basic level.&lt;/p&gt;
&lt;p&gt;For speech recognition, I used &lt;a href=&quot;https://github.com/k2-fsa/sherpa-onnx&quot; target=&quot;_blank&quot;&gt;sherpa-onnx&lt;/a&gt;. Sherpa-onnx uses a Node addon to run models efficiently and it uses Onnx to integrate lots of models from different research groups and make them available through a common interface. At the moment, you can use Voice with&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parakeet V2&lt;/li&gt;
&lt;li&gt;Parakeet V3&lt;/li&gt;
&lt;li&gt;Moonshine&lt;/li&gt;
&lt;li&gt;Whisper tiny&lt;/li&gt;
&lt;li&gt;Whisper small&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can download one of these models by running something like&lt;/p&gt;
&lt;pre class=&quot;language-sh&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-sh&quot;&gt;voice download-model &lt;span class=&quot;token parameter variable&quot;&gt;-m&lt;/span&gt; parakeetv3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or rather, you &lt;em&gt;will&lt;/em&gt; be able to invoke Voice like that once it is installable via Flatpak or Snap. Until then, you will have to clone the repo, install the dependencies, and invoke it with &lt;code&gt;npm run start&lt;/code&gt;. So the above command would in fact be&lt;/p&gt;
&lt;pre class=&quot;language-sh&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-sh&quot;&gt;&lt;span class=&quot;token function&quot;&gt;npm&lt;/span&gt; run start -- download-model &lt;span class=&quot;token parameter variable&quot;&gt;-m&lt;/span&gt; parakeetv3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Still, because it looks cooler, I will continue to describe the interface as if it&#39;s available via the command &amp;quot;voice.&amp;quot;&lt;/p&gt;
&lt;p&gt;It has commands to record audio, transcribe audio, and execute commands or dictate text that you can combine using pipes. For example, the following command records audio, transcribes it, and then types what you said using &lt;code&gt;xdotool&lt;/code&gt; to simulate keystrokes:&lt;/p&gt;
&lt;pre class=&quot;language-sh&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-sh&quot;&gt;voice record &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; voice transcribe &lt;span class=&quot;token parameter variable&quot;&gt;-m&lt;/span&gt; parakeetv3 &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; voice dictate&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you have a yaml file, &lt;code&gt;commands.yaml&lt;/code&gt;, that maps phrases to system commands, you can instead execute commands when their trigger phrases are spoken:&lt;/p&gt;
&lt;pre class=&quot;language-sh&quot; tabindex=&quot;0&quot;&gt;&lt;code class=&quot;language-sh&quot;&gt;voice record &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; voice transcribe &lt;span class=&quot;token parameter variable&quot;&gt;-m&lt;/span&gt; parakeetv3 &lt;span class=&quot;token operator&quot;&gt;|&lt;/span&gt; voice execute &lt;span class=&quot;token parameter variable&quot;&gt;-c&lt;/span&gt; commands.yaml&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The program comes with &lt;a href=&quot;https://github.com/moonshine-ai/moonshine&quot; target=&quot;_blank&quot;&gt;moonshine&lt;/a&gt; as a default model, which is used if you omit a model choice like &lt;code&gt;-m parakeetv3&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I&#39;m not sure what the future of Voice is, although I do plan on finishing the packaging of it for Snap and Flatpak at some point.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>Reproducing the results of a computational life experiment, Part 1</title>
    <link href="https://blog.ricky0123.com/blog/complexity_v2/" />
    <updated>2025-11-25T00:00:00Z</updated>
    <id>https://blog.ricky0123.com/blog/complexity_v2/</id>
    <content type="html">&lt;p&gt;A while back, I listened to a &lt;a href=&quot;https://www.preposterousuniverse.com/podcast/2024/08/19/286-blaise-aguera-y-arcas-on-the-emergence-of-replication-and-computation/&quot; target=&quot;_blank&quot;&gt;podcast interview&lt;/a&gt; with one of the authors of the paper &lt;a href=&quot;https://arxiv.org/pdf/2406.19108&quot; target=&quot;_blank&quot;&gt;Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interactions&lt;/a&gt;. The paper is about a computational life experiment where you randomly initialize a bunch of Brainfuck programs and have them &amp;quot;interact&amp;quot; with each another repeatedly over time to produce new programs. Interactions work by concatenating two programs, running the resulting composite program as a variant of Brainfuck in which the program itself is its own output tape, and breaking apart the modified program in order to produce two offspring programs. The paper describes how over time the programs become more complex and an ecosystem of self-replicating programs emerges.&lt;/p&gt;
&lt;p&gt;I was hooked from early on in the interview. It&#39;s reasonably counterintuitive to me that anything interesting comes out of this experiment. I might have thought that no matter how long you run the experiment, the array of ascii codes that make up the population of Brainfuck programs looks more or less like an IID random sample of ASCII characters, most of which are no-ops. Instead, it turns out that computation itself is an attractor, and there is typically a state transition where self-replicating programs arise and the population rapidly becomes more complex and ordered. They make a number of other interesting observations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The system never settles into a static state in which the same self-replicators dominate forever. Instead, there is an ever-changing ecosystem of new replicators that arise and overwrite old ones.&lt;/li&gt;
&lt;li&gt;Random mutations are not necessary for the system to evolve towards complexity and self-replicators. Rather, it&#39;s the randomness inherent in the initial population, the interactions, and the self-modifying aspect of the programs that leads to this behavior.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&#39;s tempting to take this in a philosophical and grandiose direction. Does the arc of the Platonic mathematical universe bend towards greater complexity and computation? Is this a kind of natural law driving the emergence of life and the complexity that we see around us?&lt;/p&gt;
&lt;p&gt;I don&#39;t know what kind of generalizations are valid here, and I&#39;m open to the possibility that there is some way of looking at the experiment that makes the result seem intuitive or superficial. But in any case, I found the experiment entertaining and wanted to implement it. But before discussing my implementation, it&#39;s important to understand the experiment better. In a typical run of the experiment, we initialize &lt;math display=&quot;inline&quot;&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;17&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;≈&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;130&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; programs of length 64 (a command in the language Brainfuck is a single character), where the individual characters are drawn randomly from the 128 code points of ascii. During each interaction, two programs are chosen at random and concatenated. The resulting composite program is interpreted and executed as a variant of Brainfuck in which the data array and input and output streams all refer to the program itself. In detail, there is a read head, write head, and instruction head all pointing to a location along the 128-length composite program and there are 10 instructions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Semantics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;move read head to the left&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;move read head to the right&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;{&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;move write head to the left&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;move write head to the right&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;decrement tape at read head&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;increment tape at read head&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;replace value of tape at write head with the value at read head&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;,&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;replace value of tape at read head with the value at write head&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;if &lt;code&gt;tape[readHead] == 0&lt;/code&gt;, then jump forwards to matching &lt;code&gt;]&lt;/code&gt; command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;if &lt;code&gt;tape[readHead] != 0&lt;/code&gt;, then jump backwards to matching &lt;code&gt;[&lt;/code&gt; command&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From looking at the examples in the paper, I inferred that movements of the read head and write head wrap around the end of the tape, so that, for instance, moving the read head left when it is at 0 causes it to jump to 127. I assume that the instruction pointer does not work that way, so that when it executes the final instruction, the program terminates (unless the final instruction is &lt;code&gt;]&lt;/code&gt; and &lt;code&gt;tape[readHead] != 0&lt;/code&gt;). Incrementing and decrementing values of the tape also take on this character where the highest value 127 increments to the lowest value, and the lowest value 0 decrements to the highest value.&lt;/p&gt;
&lt;p&gt;How many interactions does it take to reach the state transition described above? In one graph, the authors show the state transition lasting from around epochs 2300 to 2800. They don&#39;t explicitly say what an epoch is in this context, but I am guessing it is a round of &lt;code&gt;N&lt;/code&gt; interactions, with &lt;code&gt;N&lt;/code&gt; being the size of the population (in this case, &lt;math display=&quot;inline&quot;&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;17&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;≈&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;130&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;). So in the case of that specific run, the transition occurred after &lt;math display=&quot;inline&quot;&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;17&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;·&lt;/mo&gt;&lt;mn&gt;2300&lt;/mn&gt;&lt;mo&gt;≈&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;300&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; interactions. However, there is a large amount of variability. Using their complexity metric as a proxy for the state transition, they report that 12% of runs reach it within 2000 epochs, 32% of runs reach it between 2000 and 16,000 epochs, and 56% of runs fail to reach it within 16,000 epochs.&lt;/p&gt;
&lt;p&gt;How do we know what is happening to the population of programs during a run of the experiment? The authors use two main methods. The first is that they estimate a complexity metric that they call &amp;quot;higher order entropy&amp;quot; and define as the difference between Shannon entropy and normalized Kolmogorov complexity, which amounts to&lt;/p&gt;
&lt;p&gt;&lt;math display=&quot;block&quot;&gt;&lt;mrow&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;munderover&gt;&lt;mo&gt;∑&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mn&gt;127&lt;/mn&gt;&lt;/munderover&gt;&lt;mrow&gt;&lt;mo fence=&quot;true&quot;&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mi class=&quot;mathup-function-ident&quot;&gt;log&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;msub&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo fence=&quot;true&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;mfrac&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mrow&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mo&gt;·&lt;/mo&gt;&lt;mn&gt;64&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;/math&gt;&lt;/p&gt;
&lt;p&gt;where &lt;math display=&quot;inline&quot;&gt;&lt;msub&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/math&gt; is the rate of occurence of the ascii code point &lt;math display=&quot;inline&quot;&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/math&gt; in the population, &lt;math display=&quot;inline&quot;&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/math&gt; is Kolmogorov complexity estimated by taking the length of the output of a state-of-the-art compression algorithm run on the population, and &lt;math display=&quot;inline&quot;&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mo&gt;·&lt;/mo&gt;&lt;mn&gt;64&lt;/mn&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;17&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;·&lt;/mo&gt;&lt;mn&gt;64&lt;/mn&gt;&lt;mo&gt;≈&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;400&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; is the total character length of the population.&lt;/p&gt;
&lt;p&gt;The higher order entropy metric is meant to &amp;quot;capture the amount of information that can only be explained by relations between different characters.&amp;quot; In their graph of the state transition, it starts low and then shoots up as the first self-replicators take over the population.&lt;/p&gt;
&lt;p&gt;The second method that the authors use to gain insight into what is happening during the experiment is by augmenting the underlying representation of individual Brainfuck commands with metadata that enables them to track the &amp;quot;source&amp;quot; of commands that have been copied from one location to another via operations such as &lt;code&gt;,&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt;. Specifically, each Brainfuck command is represented as a 64-bit integer that, at the time of initialization of the command, encodes not only the command&#39;s ascii code point but also the overall position of the command within the population and the current epoch. Together, the encoded &lt;code&gt;epoch&lt;/code&gt; and &lt;code&gt;position&lt;/code&gt; function as a kind of unique &amp;quot;token&amp;quot; ID, which gets copied along with the ascii code point by the copy operations &lt;code&gt;,&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;According to my description of the experiment so far, the encoded epoch would always be zero because commands are only initialized when they are sampled from a uniform distribution over ascii code points at the beginning of the experiment. However, the authors also did runs where there were random mutations, in which case the epoch could be nonzero. The idea that random mutations are unnecessary for the results was a fascinating aspect of the experiment for me, so I ignored them.&lt;/p&gt;
&lt;p&gt;The authors say that tracking the number of unique token IDs in the population, which starts at &lt;math display=&quot;inline&quot;&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;17&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;·&lt;/mo&gt;&lt;mn&gt;64&lt;/mn&gt;&lt;mo&gt;≈&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;400&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;000&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; at initialization and decreases as copy operations are executed, is a simple way to detect the state transition. When the state transition occurs, the higher order entropy rises and the number of unique tokens drops rapidly.&lt;/p&gt;
&lt;p&gt;In a follow up post, I&#39;m going to describe my implementation of the experiment in purely functional programming languages and some of the false starts I encountered along the way.&lt;/p&gt;
</content>
  </entry>
</feed>