Skip to main content

Voice-to-smart-paste pipeline for Linux

Project description

Talk-to-Tux

Voice-to-smart-paste pipeline for Linux. Hold a mouse button, speak, and the transcribed + LLM-rewritten text is pasted into your active application — formatted for the context you're in.

flowchart LR
    A["Hold Button"] --> B["Record Audio"]
    B --> C["STT\nhosted Groq Whisper or BYO chain"]
    C --> D["Rewrite\nhosted Groq Scout or BYO BAML"]
    D --> E["Smart Paste"]

    F["App Context\nwindow + AT-SPI text + optional screenshot"] --> D

How It Works

  1. Hold your mouse side button (or keyboard hotkey) and speak. The microphone stream keeps a local in-memory 2-second ring buffer, so any speech immediately before the button press is also captured
  2. Release — audio is checked for speech by VAD, then sent to the STT provider chain
  3. The transcription is rewritten by an LLM using your active window's context (app name, window title, AT-SPI widget text, optional screenshot)
  4. The result is pasted into the focused application using the correct shortcut
  5. Double-tap the side button after a paste to send Enter (e.g., submit a chat message)

Works on both X11 and Wayland (GNOME, tested on Ubuntu 24.04+).

Architecture

flowchart TB
    subgraph Trigger["Trigger Layer"]
        SB["Side Button / Keyboard Hotkey"]
    end

    subgraph Recording["Recording Phase"]
        SB -->|hold| REC["Audio Recorder\n(sounddevice + 2s ring buffer)"]
        SB -->|press| CTX0["Capture Context\nwindow + AT-SPI + screenshot"]
    end

    subgraph Processing["Processing Phase"]
        REC -->|release| VAD["VAD Gate\n(Silero)"]
        VAD -->|raw WAV| STT{"API Mode"}
        STT -->|hosted| HSTT["Supabase v1-stt\nGroq Whisper"]
        STT -->|BYO| BSTT["STT Chain\ndefault: groq, openai, google\noptional: gpu_server, local_whisper, elevenlabs"]
        HSTT --> TX["Transcription"]
        BSTT --> TX
        CTX0 --> RP["Rewrite Prompt\n6-layer context"]
        TX --> RP
        RP --> RW{"API Mode"}
        RW -->|hosted| HRW["Supabase v1-rewrite\nGroq Scout text\noptional visual model"]
        RW -->|BYO| BRW["BAML SmartRewrite\nLocalAI, Groq, Gemini, Ollama fallback"]
    end

    subgraph Output["Paste Phase"]
        HRW --> TT["Tooltip / Notification\n(confirm or auto-paste)"]
        BRW --> TT
        TT --> PASTE["Paster\nxclip/wl-copy + xdotool/wtype"]
    end

Prerequisites

OS: Linux with GNOME (X11 or Wayland). Python: 3.11+. STT backend: hosted beta account, or a BYO key/local server for Groq, OpenAI, Google, ElevenLabs, gpu_server, or local_whisper.

System packages

Tool(s) Group Ubuntu/Debian (apt) Arch (pacman) Fedora (dnf)
libportaudio2 Audio libportaudio2 portaudio portaudio
grim Screenshots (Wayland) grim grim grim
scrot Screenshots (X11/XWayland) scrot scrot scrot
xclip Clipboard (X11) xclip xclip xclip
wl-copy, wl-paste Clipboard (Wayland) wl-clipboard wl-clipboard wl-clipboard
xdotool Keystroke injection (X11) xdotool xdotool xdotool
wtype Keystroke injection (Wayland/wlroots) wtype wtype wtype
ydotool + ydotoold Keystroke injection (GNOME Wayland) see note below ydotool ydotool
dbus-send, busctl, gdbus D-Bus utilities dbus / systemd dbus / systemd dbus / systemd
notify-send Desktop notifications libnotify-bin libnotify libnotify
evtest Input device debug evtest evtest evtest
pgrep Process checks procps procps-ng procps-ng

ydotool on Ubuntu/Debian — build v1.0+ from source

Ubuntu's apt ships ydotool 0.1.8, which has no daemon and produces garbage key injection. You need v1.0+ built from source:

# Build dependencies
sudo apt install cmake libevdev-dev libudev-dev

# Clone and build
git clone https://github.com/ReimuNotMoe/ydotool
cd ydotool && cmake -B build && cmake --build build && sudo cmake --install build

# Enable the daemon and grant /dev/uinput access
systemctl --user enable --now ydotoold
sudo usermod -aG input $USER   # re-login required
# or add udev rule: echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/99-uinput.rules

Arch and Fedora ship a working ydotool via their package managers.

Self-verify

uv run talk-to-tux --doctor

Quick Start

git clone https://github.com/viperjuice/talk-to-tux.git
cd talk-to-tux
uv sync --all-groups

# First run launches the setup wizard. Choose hosted beta or BYO-key mode.
uv run talk-to-tux

On first run, the setup wizard writes ~/.config/talk-to-tux/config.toml. It can also install a desktop launcher and optionally enable login autostart so daily use does not require a terminal.

For production-style local use, run without --debug:

uv run talk-to-tux

For launcher and startup integration:

uv run talk-to-tux desktop status
uv run talk-to-tux desktop install-launcher
uv run talk-to-tux desktop enable-autostart
uv run talk-to-tux desktop disable-autostart
uv run talk-to-tux desktop remove-launcher

--debug enables debug logging and the debug popup. Leave it off for normal beta/production use. Hosted mode signs in with GitHub OAuth. BYO-key mode tells you where to put provider keys in ~/.config/talk-to-tux/secrets.env. .env.example is a reference file; do not copy it wholesale unless you want every sample override.

Trigger Modes and Key Mapping

Default: Mouse Side Buttons (hold-to-record)

Button evdev Code Action
BTN_SIDE (thumb back) 275 Either button starts recording
BTN_EXTRA (thumb forward) 276 Release ALL buttons to stop

The device is grabbed exclusively so side buttons don't trigger browser back/forward. All other mouse events (movement, clicks, scroll) are forwarded transparently via uinput.

Alternative: Keyboard Hotkey

Key Combo evdev Names Action
Ctrl + Super (left) KEY_LEFTCTRL+KEY_LEFTMETA Toggle recording

Note: The Super key may trigger GNOME Activities. Disable with: gsettings set org.gnome.mutter overlay-key ''

Customizing the Trigger

Option 1: TOML config (~/.config/talk-to-tux/config.toml)

[trigger]
mode = "mouse"           # "auto", "mouse", or "keyboard"
record_mode = "hold"     # "hold" (release to stop) or "toggle" (tap/tap)

[trigger.mouse]
button_codes = [275, 276]          # any evdev button codes
device_name = "Logitech G502"     # match by name substring (stable across reboots and USB replug)
# device_path = "/dev/input/event5"  # or explicit path (fragile)
grab = true

[trigger.keyboard]
hotkey = "KEY_LEFTCTRL+KEY_LEFTMETA"

Option 2: Environment variables

TTT_TRIGGER_MODE=keyboard
TTT_HOTKEY=KEY_RIGHTCTRL
TTT_RECORD_MODE=toggle
# Or nested format:
TTT_TRIGGER__MOUSE__BUTTON_CODES='[275, 276]'
TTT_TRIGGER__MOUSE__DEVICE_NAME="Logitech"

Option 3: CLI flags

uv run talk-to-tux --trigger keyboard --record-mode toggle

Finding Your Button Codes

# List input devices
sudo evtest

# Pick your mouse, press buttons, note the codes:
#   Event: type 1 (EV_KEY), code 275 (BTN_SIDE), value 1

Configuration

Configuration is loaded with this precedence (highest first):

  1. CLI arguments (--trigger mouse, --debug, etc.)
  2. Environment variables (TTT_* prefix)
  3. ~/.config/talk-to-tux/secrets.env (CWD .env intentionally not loaded — prevents rogue .env in a project dir from overriding secrets)
  4. TOML config (~/.config/talk-to-tux/config.toml)

Key Settings

Section Setting Default Description
api_mode value byo_key (auto-hosted when a token exists) Hosted account vs bring-your-own-provider keys
stt providers groq,openai,google BYO STT fallback chain (tried in order)
stt.gpu_server url http://localhost:8000 Self-hosted Whisper server URL
rewrite enabled true Enable LLM smart rewrite
rewrite local_ai_url http://ai:8002/v1 LocalAI / vLLM-compatible rewrite endpoint
rewrite ollama_base_url http://localhost:11434/v1 Ollama-compatible rewrite fallback endpoint
context screenshot_enabled true Include screenshot in LLM context
ducking enabled true Reduce other apps' volume while recording
ducking factor 0.15 Duck to 15% of original volume
learning enabled true Store local correction-learning artifacts
tooltip enabled false Show confirm-before-paste tooltip (disabled = auto-paste)
tooltip use_notifications true Use desktop notifications (vs GTK tooltip)
paste enabled true Auto-paste into active window
indicator enabled true Show system tray indicator

Model Selection Workflow

Talk-to-Tux now treats model settings as dropdowns, not free-text fields. The Providers tab exposes catalog-backed STT and rewrite model selectors, and the Local Inference tab exposes local model selectors once endpoint discovery has something safe to offer.

  • Local endpoint URLs stay editable text fields. Paste the STT or rewrite base URL into the Local Inference tab, then probe the endpoint to refresh bounded health plus discovered model choices.
  • uv run talk-to-tux settings --check-local-inference --json reports the same sanitized discovery summary from the CLI, including reachable/degraded status and bounded model counts.
  • Rewrite model choices are limited to visual-capable models because the rewrite pipeline can include screenshot context. Text-only local discoveries stay visible as read-only feedback, but they are not selectable for rewrite.
  • Unsupported legacy model ids are preserved non-destructively when older configs are loaded. The settings UI surfaces the current unsupported value, and the fix is to replace it with a supported dropdown selection rather than typing a new arbitrary model string.

Per-App Rules

Customize behavior per application in config.toml:

[[app]]
match = "Google-chrome"
match_title = "*ChatGPT*"         # optional title filter (glob or ~regex)
paste_shortcut = "ctrl+shift+v"   # override paste shortcut
rewrite_hint = "Conversational tone, no markdown"

[[app]]
match = "Code"
rewrite_hint = "Generate code in the language of the active file"

[[app]]
match = "kitty"
is_terminal = true
paste_shortcut = "ctrl+shift+v"

Default rules for common apps (browsers, terminals, editors, chat apps) are shipped in src/talk_to_tux/data/default_app_rules.toml. User rules in config.toml take priority.

API Keys

In BYO-key mode, store API keys in ~/.config/talk-to-tux/secrets.env (CWD .env is not loaded):

TTT_OPENAI_API_KEY=sk-...
TTT_GROQ_API_KEY=gsk_...
TTT_GOOGLE_API_KEY=AIza...
TTT_ELEVENLABS_API_KEY=...

GPU Server Deployment

The STT server runs faster-whisper on NVIDIA GPUs.

cd server
uv sync
uv run ttt-server --host 0.0.0.0 --port 8000

# Or Docker:
docker build -f deploy/Dockerfile.server -t ttt-server .
docker run --gpus all -p 8000:8000 ttt-server

Systemd service files are in deploy/.

Development

uv sync --all-groups          # install all deps including dev
uv run pytest tests/ -q       # run desktop tests
cd server && uv run pytest tests/ -q  # run GPU server tests
make lint                     # ruff check
make format                   # ruff format
uv run baml-cli generate      # regenerate BAML client after .baml changes

See CONTRIBUTING.md for the full developer guide.

Privacy & data flow

Talk-to-Tux captures your voice, screen, and active-app context to power the voice-to-paste pipeline. What leaves your machine depends on the API mode you pick during setup:

  • Hosted mode — recorded WAV audio and metadata are sent to Supabase Edge Functions for Groq Whisper STT. Rewrite requests send the transcript, active window, static/dynamic/internal context, app-specific hints, and optional screenshot bytes to Supabase. The hosted rewrite backend uses Groq Scout for text by default; Groq receives image bytes only when a hosted visual rewrite model is configured. Your account email/GitHub identity, quota rows, and audit data are retained per the privacy policy.
  • BYO-key mode — the same data is sent only to the providers whose keys you configure (TTT_OPENAI_API_KEY, TTT_GROQ_API_KEY, etc.); nothing hits our servers.

The local debug run cache (audio, transcripts, screenshots, rewrites) is off by default in both modes. Enable it only when you need to diagnose a bug:

TTT_CACHE_ENABLED=true uv run talk-to-tux
# or set [cache] enabled = true in ~/.config/talk-to-tux/config.toml

Correction-learning and feedback artifacts are stored separately under ~/.config/talk-to-tux/ and are not affected by the cache toggle.

CLI Reference

uv run talk-to-tux [COMMAND] [OPTIONS]

Daemon options (no subcommand):
  --trigger {auto,mouse,keyboard}   Trigger mode
  --record-mode {toggle,hold}       Recording mode
  --no-indicator                    Disable system tray
  --no-tooltip                      Disable tooltip/notifications
  --no-validation                   Disable recording validation sound
  --retry [RUN_ID]                  Retry from a cached run (default: latest)
  --show-config                     Print resolved config and exit
  --doctor                          Run diagnostics and exit
  --setup                           Run interactive first-run setup wizard
  --migrate-config                  Convert .env to config.toml
  --debug                           Enable debug logging

Hosted-mode subcommands (beta):
  login                             Sign in to hosted mode (GitHub OAuth)
  logout                            Clear saved hosted-mode token
  whoami                            Show signed-in account + tier + token expiry
  usage                             Show current quota (STT hours, rewrite calls)
  switch-mode {hosted, byo-key}     Switch between hosted and BYO-key API modes

Settings subcommand:
  settings [--json]                 Show resolved non-secret settings
  settings --set PATH=VALUE         Update allowlisted non-secret settings
  settings --reset-learning         Remove local global learning artifacts
  settings --reset-learning-app APP_ID
  settings --learning-export PATH
  settings --learning-import PATH
  settings --learning-reset SCOPE
  settings --learning-forget TARGET

The tray settings window now reports whether each saved change applied live, rebuilt runtime providers, or requires a restart. talk-to-tux settings --set prints the same apply mode metadata for local writes, but it only updates local config on disk; it does not hot-patch another running Talk-to-Tux process. Secret saves are still separate: use the tray settings window BYOK key controls or edit ~/.config/talk-to-tux/secrets.env directly. The CLI settings --set path is intentionally limited to allowlisted non-secret values in ~/.config/talk-to-tux/config.toml.

Settings Recovery

If a settings change leaves Talk-to-Tux pointed at a bad local endpoint or a restart-required combination, recover with the same storage split the app uses:

  1. Run uv run talk-to-tux settings --json to inspect the current non-secret state, or add --check-local-inference to see bounded endpoint health results.
  2. Fix or remove the bad non-secret value in ~/.config/talk-to-tux/config.toml, then restart Talk-to-Tux if the saved change was marked restart_required.
  3. Fix, replace, or clear BYOK provider keys only in ~/.config/talk-to-tux/secrets.env.
  4. Run uv run talk-to-tux --doctor if desktop integration or provider setup still looks wrong after the config/secrets fix.

On a fresh install with no config.toml, no TTT_API_MODE env var, no secrets.env, and no stored hosted token, the first invocation auto-launches the setup wizard (same as running --setup). The wizard asks you to pick hosted (GitHub OAuth, quota-managed) or byo-key (provide your own OpenAI/Groq/ElevenLabs keys in secrets.env). switch-mode byo-key never creates or modifies secrets.env — you edit it yourself.

Running as a Service

To start Talk-to-Tux automatically on login and restart on crash:

# Copy the service file
cp deploy/talk-to-tux.service ~/.config/systemd/user/

# Enable and start
systemctl --user enable --now talk-to-tux

# Check status / logs
systemctl --user status talk-to-tux
journalctl --user -u talk-to-tux -f

The service auto-restarts within 3 seconds if the app crashes. The GNOME Shell extension also auto-hides the recording overlay after 30 seconds if the app stops responding.

License

AGPL-3.0-only — see LICENSE. Hosted beta service terms are tracked separately on the Talk-to-Tux website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

talk_to_tux-0.2.1.tar.gz (412.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

talk_to_tux-0.2.1-py3-none-any.whl (459.3 kB view details)

Uploaded Python 3

File details

Details for the file talk_to_tux-0.2.1.tar.gz.

File metadata

  • Download URL: talk_to_tux-0.2.1.tar.gz
  • Upload date:
  • Size: 412.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for talk_to_tux-0.2.1.tar.gz
Algorithm Hash digest
SHA256 16cf009a2a19c68a49dbd8d25838056eb271a05f85698bb17a0e7908ca29899b
MD5 c83b107efce89374341c2a761c58a492
BLAKE2b-256 bd356297dea43683f9d8188496a16c0b0c5b84baae5c2242e169cbbb2ea43c28

See more details on using hashes here.

Provenance

The following attestation bundles were made for talk_to_tux-0.2.1.tar.gz:

Publisher: release.yml on ViperJuice/talk-to-tux

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file talk_to_tux-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: talk_to_tux-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 459.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for talk_to_tux-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b727ca7cb14e97959fbf0d841993b0f360fcc517d9f8fe7069d9c76aa7ccc77b
MD5 d632118adc6fd4307a9cba76671789bf
BLAKE2b-256 fed5f72775edb7b260560f5e648d31b4d537e1edf5cfc43096ae2c444ef3b53f

See more details on using hashes here.

Provenance

The following attestation bundles were made for talk_to_tux-0.2.1-py3-none-any.whl:

Publisher: release.yml on ViperJuice/talk-to-tux

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page