VoxInput

Transcribe input from your microphone and turn it into key presses on a virtual keyboard. This allows you to use speech-to-text on any application or window system in Linux. In fact you can use it on the system console.

VoxInput is meant to be used with LocalAI, but it will function with any OpenAI compatible API that provides the transcription endpoint or realtime API.

Features

Speech-to-Text Daemon: Runs as a background process to listen for signals to start or stop recording audio.
Audio Capture: Records audio from the microphone or any other device, including audio you are listening to.
Transcription: Converts recorded audio into text using a local or remote transcription service.
Text Automation: Simulates typing the transcribed text into an application using dotool.
Voice Activity Detection: In realtime mode VoxInput uses VAD to detect speech segments and automatically transcribe them.
Visual Notification: In realtime mode, a GUI notification informs you when recording (VAD) has started or stopped.

Requirements

dotool (for simulating keyboard input)
The user that runs VoxInput daemon is in the input user group
You have the following udev rule

KERNEL=="uinput", GROUP="input", MODE="0620", OPTIONS+="static_node=uinput"

This can be set in your NixOS config as follows

services.udev.extraRules = ''
KERNEL=="uinput", GROUP="input", MODE="0620", OPTIONS+="static_node=uinput"
'';

Installation

Clone the repository:

git clone https://github.com/richiejp/VoxInput.git
cd VoxInput

Build the project:
```
go build -o voxinput
```
Ensure dotool is installed on your system and it can make key presses.
It makes sense to bind the record and write commands to keys using your window manager. For instance in my Sway config I have the following

bindsym $mod+Shift+t exec voxinput record
bindsym $mod+t exec voxinput write

Alternatively you can use the Nix flake.

Usage

Note: VOXINPUT_ vars take precedence vars with other prefixes. Unless you don't mind running VoxInput as root, then you also need to ensure the following is setup for dotool

OPENAI_API_KEY or VOXINPUT_API_KEY: Your OpenAI API key for Whisper transcription. If you have a local instance with no key, then just leave it unset.
OPENAI_BASE_URL or VOXINPUT_BASE_URL: The base URL of the OpenAI compatible API server: defaults to http://localhost:8080/v1
VOXINPUT_LANG or LANG: Language code for transcription (defaults to empty).
VOXINPUT_TRANSCRIPTION_MODEL: Transcription model (default: whisper-1).
VOXINPUT_TRANSCRIPTION_TIMEOUT: Timeout duration (default: 30s).
VOXINPUT_SHOW_STATUS: Show GUI notifications (yes/no, default: yes).
VOXINPUT_CAPTURE_DEVICE: Specific audio capture device name (run voxinput devices to list).
VOXINPUT_OUTPUT_FILE: Path to save the transcribed text to a file instead of typing it with dotool.
XDG_RUNTIME_DIR or VOXINPUT_RUNTIME_DIR: Used for the PID and state files, defaults to /run/voxinput if niether are present

Commands

listen: Start speech to text daemon.
- --replay: Play the audio just recorded for transcription (non-realtime mode only).
- --no-realtime: Use the HTTP API instead of the realtime API; disables VAD.
- --no-show-status: Don't show when recording has started or stopped.
- --output-file <path>: Save transcript to file instead of typing.
- --prompt <text>: Text used to condition model output. Could be previously transcribed text or uncommon words you expect to use
```
./voxinput listen
```
record: Tell existing listener to start recording audio. In realtime mode it also begins transcription.
```
./voxinput record
```
write or stop: Tell existing listener to stop recording audio and begin transcription if not in realtime mode. stop alias makes more sense in realtime mode.
```
./voxinput write
```
toggle: Toggle recording on/off (start recording if idle, stop if recording).
```
./voxinput toggle
```
status: Show whether the server is listening and if it's currently recording.
```
./voxinput status
```
devices: List capture devices.
```
./voxinput devices
```
help: Show help message.
```
./voxinput help
```
ver: Print version.
```
./voxinput ver
```

Example Realtime Workflow

Start the daemon in a terminal window:

OPENAI_BASE_URL=http://ai.local:8081/v1 OPENAI_WS_BASE_URL=ws://ai.local:8081/v1/realtime ./voxinput listen

Select a text box you want to speak into and use a global shortcut to run the following
```
./voxinput record
```
Begin speaking, when you pause for a second or two your speach will be transcribed and typed into the active application.
Send a signal to stop recording
```
./voxinput stop
```

Example Workflow

Start the daemon in a terminal window:

OPENAI_BASE_URL=http://ai.local:8081/v1 ./voxinput listen --no-realtime

Select a text box you want to speak into and use a global shortcut to run the following
```
./voxinput record
```
After speaking, send a signal to stop recording and transcribe:
```
./voxinput write
```
The transcribed text will be typed into the active application.

Example Workflow: Transcribing an Online Meeting or Video Stream

To create a transcript of an online meeting or video stream by capturing system audio:

List available capture devices:
```
./voxinput devices
```
Identify the monitor device, e.g., "Monitor of Built-in Audio Analog Stereo".

Start the daemon specifying the device and output file:

VOXINPUT_CAPTURE_DEVICE="Monitor of Built-in Audio Analog Stereo" ./voxinput listen --output-file meeting_transcript.txt

Note: Add --no-realtime if you prefer the HTTP API.

Start recording:
```
./voxinput record
```
Play your online meeting or video stream; the system audio will be captured.
Stop recording:
```
./voxinput stop
```
The transcript is now in meeting_transcript.txt.

Quick start with LocalAI

Follow https://localai.io/installation/ to install LocalAI, the simplest way is using Docker:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

Open http://localhost:8080 in your browser to access the LocalAI web interface and install the whisper-1 and silero-vad-ggml models.
Test out VoxInput:

VOXINPUT_TRANSCRIPTION_MODEL=whisper-1 VOXINPUT_TRANSCRIPTION_TIMEOUT=30s voxinput listen
voxinput record && sleep 30s && voxinput write

Displaying recording status

The realtime mode has a UI to display various actions being taken by VoxInput. However you can also read the status from the status file or using the status command, then display it via your desktop manager (e.g. waybar). For an example see the PR which added it.

TODO

Put playback behind a debug switch
Create a release
Realtime Transcription
GUI and system tray
Voice detection and activation (partial, see below)
Code words to start and stop transcription
Allow user to describe a button they want to press (requires submitting screen shot and transcription to LocalAGI)

Signals

SIGUSR1: Start recording audio.
SIGUSR2: Stop recording and transcribe audio.
SIGTERM: Stop the daemon.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

malgo for audio handling.
go-openai for OpenAI API integration.
numen and dotool, I did consider modifying numen to use LocalAI, but decided to go with a new tool for now.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
internal		internal
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
go.mod		go.mod
go.sum		go.sum
listen.go		listen.go
main.go		main.go
old.go		old.go
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VoxInput

Features

Requirements

Installation

Usage

Commands

Example Realtime Workflow

Example Workflow

Example Workflow: Transcribing an Online Meeting or Video Stream

Quick start with LocalAI

Displaying recording status

TODO

Signals

License

Acknowledgments

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

richiejp/VoxInput

Folders and files

Latest commit

History

Repository files navigation

VoxInput

Features

Requirements

Installation

Usage

Commands

Example Realtime Workflow

Example Workflow

Example Workflow: Transcribing an Online Meeting or Video Stream

Quick start with LocalAI

Displaying recording status

TODO

Signals

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages