Skip to main content

VAD-driven streaming voice dictation for macOS with editing commands, tech-term correction, and keyboard output

Project description

voice-command

PyPI CI License: MIT Python 3.12+ macOS

VAD-driven streaming voice dictation for macOS (Apple Silicon). Speaks into your mic, text appears in a terminal buffer or gets typed directly into any app.

All inference runs locally — Whisper for ASR, Silero for VAD, Qwen for tech-term correction. No cloud APIs.

Requirements

  • macOS with Apple Silicon
  • Python 3.12+
  • Microphone access (System Settings > Privacy > Microphone)
  • Accessibility access for type mode (System Settings > Privacy > Accessibility)

Install

From PyPI

pip install voice-command
# or
uv tool install voice-command

From source

git clone https://github.com/depoledna/voice-command.git
cd voice-command
uv sync

Models download automatically on first run (~300MB for Whisper + ~1GB for Qwen).

Usage

voice-cmd                # launch the TUI
voice-cmd --version      # print version and exit
voice-cmd --help         # usage

Dictation always types into the focused window (the TUI itself is auto-skipped to avoid feedback loops). The TUI mirrors what was transcribed so you can see and verbally edit it. From source: uv run python voice_cmd.py.

Configuration

Persistent settings live at ~/.config/voice-command/settings.json (or $XDG_CONFIG_HOME/voice-command/settings.json). The file is auto-created with defaults on first launch.

Key Default Description
device null Audio input device index. null → auto-detect
llm_correction false Run Qwen tech-term correction after ASR (downloads ~1GB)
vad_threshold 0.45 VAD speech threshold (0.10–0.95)
min_silence_ms 600 Silence (ms) required to end an utterance
inactivity_clear_seconds 5 Clear the TUI buffer + status message after this idle time. 0 disables auto-clear

Disabling llm_correction skips loading the ~1GB Qwen model entirely.

Hotkeys

The TUI shows a sticky 3-line header and accepts hotkeys at any time:

Key Action
P / Space Pause / resume listening (live)
L Toggle LLM tech-term correction (live)
D Pick audio device (live; mic is reattached on save)
V Adjust VAD threshold (live)
S Adjust min-silence (live)
? Show help + voice commands
Q / ^C Quit

All hotkey changes auto-save to settings.json.

Voice Commands

Command Action
period / comma / question mark Insert punctuation
new line Line break
new paragraph Double line break
scratch that Delete last ~5 words
delete last N words Delete last N words
undo Undo last action
clear all Clear buffer
stop listening Pause
start listening Resume
copy all Copy to clipboard
done Copy to clipboard and exit
show commands Show help overlay

Commands can appear inline with dictated text: "Send the email period new line Don't forget the attachment" produces two lines with proper punctuation.

Pipeline

  1. Audio - sounddevice captures mic input, resampled to 16kHz
  2. VAD - Silero VAD with hysteresis detects speech boundaries (32ms frames, pre-roll buffering)
  3. ASR - MLX Whisper (small, 8-bit) with dev-vocabulary prompt
  4. LLM - Qwen3 1.7B (4-bit) fixes tech terms: "fast api" -> "FastAPI", "type script" -> "TypeScript" (toggle off via L or llm_correction: false)
  5. Commands - Sentence splitting + leading/trailing command extraction
  6. Output - TUI buffer display or keystroke diff-typing via pynput

Output

Each utterance is typed into whichever app is currently focused via pynput. If your own terminal (the one running voice-cmd) is the frontmost window, typing is suppressed to avoid echoing into your shell. A 3-second countdown after launch lets you switch focus to the target app.

Benchmarks

# Compare ASR models (requires test fixtures in tests/fixtures/)
uv run python tests/benchmark.py

# Pipeline diagnostics
uv run python tests/diagnose_pipeline.py

Releasing

  1. Update the version in pyproject.toml
  2. Commit: git commit -am "chore: bump version to X.Y.Z"
  3. Tag: git tag vX.Y.Z
  4. Push: git push origin main --tags

The GitHub Actions workflow builds and publishes to PyPI automatically via trusted publishers (OIDC).

First-time PyPI setup

  1. Go to https://pypi.org/manage/account/publishing/
  2. Add a "pending publisher":
    • Package name: voice-command
    • Owner: depoledna
    • Repository: voice-command
    • Workflow: release.yml
    • Environment: pypi
  3. In the GitHub repo, go to Settings > Environments > create pypi

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_command-0.2.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voice_command-0.2.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file voice_command-0.2.0.tar.gz.

File metadata

  • Download URL: voice_command-0.2.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for voice_command-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bf8ab520a32eb5fc2286a3e936bfc06cbe704463644415d8388f827a77aef2f2
MD5 b049e28ad2864c3b0ab9a452d34de353
BLAKE2b-256 f7f010885fab7b272c09b815682c2becc758fd7e219f0604ff58844db8e8f3e5

See more details on using hashes here.

File details

Details for the file voice_command-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: voice_command-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for voice_command-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d2ec01902bdb8dd287abb0bf1f74b5a8513138d3ae61fe99fdf98e0ca06b71a9
MD5 ee187386c847817278b5c8f8d3b3b9fa
BLAKE2b-256 1d8db40ef250974db490ebbdbbe72567fb963fc77ece1ee52ca169446407e934

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page