Skip to main content

Local desktop voice and computer-control agent with MCP tools

Project description

VoiceUse

A local desktop voice agent that controls your computer hands-free. All AI inference is cloud-based; the agent itself runs natively on your machine and controls the OS.

Features

  • Wake word ("Computer") or hotkey (hold Right Ctrl) activation
  • Voice Activity Detection — knows when you stop speaking
  • Speaks back with TTS for confirmations, errors, and status updates
  • Cross-platform window control, typing, and screenshots (Windows primary, Linux secondary, macOS best-effort)
  • Multi-monitor support — screenshots only the monitor containing the target window
  • Safety layer — spoken confirmation before destructive actions (close, quit, delete, system commands, etc.)
  • Vision-powered clicking — uses Codex CLI or Anthropic Computer Use API to locate UI elements from screenshots
  • Grok Voice plugin — optional end-to-end voice via the xAI Realtime API (replaces the default STT→LLM→TTS pipeline)

Quick Start

1. Prerequisites

  • Python 3.10+
  • API keys for the cloud services you plan to use:
    • GROQ_API_KEY — required for STT and primary LLM
    • OPENAI_API_KEY — optional fallback LLM
    • CEREBRAS_API_KEY — optional, for using Cerebras as primary or fallback LLM
    • ANTHROPIC_API_KEY — optional, only if using Anthropic for vision
    • XAI_API_KEY — optional, only if using the Grok Voice plugin

2. Install

MCP computer-control tools only (recommended first install):

pipx install voice-computer-use-agent

This installs the global MCP server command:

voiceuse-computer-control-mcp

Then register it with an MCP-capable agent. For Codex CLI:

codex mcp add voiceuse-computer-control -- voiceuse-computer-control-mcp

Full voice assistant install:

pipx install "voice-computer-use-agent[all]"

Use this when you want the microphone, hotkey, STT, TTS, and realtime voice plugin dependencies installed into the pipx environment.

Local development install:

# Clone or download the repository
cd voiceuse

# Create a virtual environment (recommended)
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

# Install the package and all runtime dependencies
pip install -e .

# Or install with dev dependencies (tests, lint, type-check)
pip install -e ".[dev]"

3. Set API keys

Linux / macOS:

export GROQ_API_KEY="gsk_..."
export OPENAI_API_KEY="sk-..."        # optional fallback
export CEREBRAS_API_KEY="csk_..."     # optional Cerebras LLM
export ANTHROPIC_API_KEY="sk-ant-..." # optional vision
export XAI_API_KEY="xai-..."          # optional Grok Voice

Windows (PowerShell):

$env:GROQ_API_KEY="gsk_..."
$env:OPENAI_API_KEY="sk-..."

4. Run

# Normal run
python -m voiceuse

# Dry-run mode — no API calls, uses mock responses (great for first-time validation)
python -m voiceuse --dry-run

# Check that all dependencies are present
python -m voiceuse --check-install

# Enable rotating file logs
python -m voiceuse --log-file voiceuse.log

# Verbose debug output
python -m voiceuse --verbose

The first run creates a default config.yaml in the working directory if one does not exist.

5. Using the agent

  1. Hold Right Ctrl and speak, then release to submit.
  2. Or say "Computer" (if wake word is enabled) and speak until VAD detects silence.
  3. The agent transcribes your command, plans actions with the LLM, executes them, and speaks the result.

Configuration (config.yaml)

All runtime settings live in config.yaml. A default file is generated automatically.

audio:
  sample_rate: 16000
  hotkey: "right ctrl"
  wake_word: "computer"        # free Porcupine keywords: computer, jarvis, alexa, etc.
  wake_word_model_path: null

stt:
  provider: groq
  model: whisper-large-v3
  api_key: null          # falls back to GROQ_API_KEY env var

llm:
  provider: groq          # "groq", "cerebras", or "openai"
  model: llama-3.3-70b-versatile
  api_key: null           # falls back to GROQ_API_KEY env var
  fallback_provider: openai
  fallback_model: gpt-4o-mini
  fallback_api_key: null  # falls back to OPENAI_API_KEY env var
  cerebras_api_key: null  # falls back to CEREBRAS_API_KEY env var

tts:
  provider: edge
  voice: en-US-AriaNeural
  enabled: true

computer_use:
  provider: codex          # "codex" (Codex CLI, OAuth) or "anthropic" (API key)
  api_key: null            # only needed for anthropic; codex uses `codex login`

agent:
  backend: external_agent   # "native" or "external_agent"
  runner: codex_cli         # first external runner implementation
  command: codex
  working_directory: "."
  timeout_seconds: 300
  model: null
  sandbox: null
  skip_git_repo_check: true

safety:
  confirm_destructive: true
  destructive_keywords:
    - close
    - quit
    - delete
    - remove
    - kill
    - terminate
    - shutdown
    - reboot
    - format
    - rm -rf
    - type password
    - enter password
    - input password
  confirmation_timeout_seconds: 10

app:
  preferred_browser: chrome
  preferred_terminal: cmd
  codex_app_name: Codex
  dry_run: false           # overridden by --dry-run CLI flag
  aliases:                 # spoken name → exact Windows app name
    comet: "Comet Browser"
    # vscode: "Visual Studio Code"
    # edge: "Microsoft Edge"

plugins:
  grok_voice:
    enabled: false
    api_key: null          # falls back to XAI_API_KEY env var
    model: grok-voice-think-fast-1.0
    voice: Eve
    instructions: "You are a desktop voice assistant..."
    sample_rate: 24000
    turn_detection_type: server_vad
    input_audio_transcription_model: grok-2-audio

Key points:

  • api_key: null means "read from the environment variable."
  • --dry-run on the CLI forces app.dry_run: true for that run.
  • agent.backend: external_agent keeps VoiceUse as the voice shell and sends desktop work to an MCP-capable action agent.
  • plugins.grok_voice.enabled: true replaces the default STT→LLM→TTS pipeline with the xAI Realtime WebSocket.

PyPI / pipx Packaging

VoiceUse ships two console commands:

voiceuse
voiceuse-computer-control-mcp

Install options:

# Lightweight desktop-control MCP server
pipx install voice-computer-use-agent

# Full voice assistant with audio, STT, TTS, LLM, vision, and realtime extras
pipx install "voice-computer-use-agent[all]"

Publishing is handled by .github/workflows/publish-pypi.yml on version tags. Before tagging a release, configure PyPI Trusted Publishing for this GitHub repository with the pypi environment.

App Aliases

VoiceUse passes your currently open windows and app aliases to the LLM before every command. This means the LLM knows what's running and can resolve nicknames like "comet" → "Comet Browser".

Add aliases in config.yaml:

app:
  aliases:
    comet: "Comet Browser"
    vscode: "Visual Studio Code"
    edge: "Microsoft Edge"

How it works:

  1. You say: "Open Comet"
  2. Whisper transcribes: "Open comment" (STT error)
  3. Cerebras receives your open windows list + aliases
  4. Cerebras knows "comment" is close to "Comet Browser" and emits open_app("Comet Browser")
  5. find_window uses fuzzy matching (difflib) as a final safety net

Grok Voice Plugin (Optional)

The Grok Voice plugin uses the xAI Realtime API to stream audio end-to-end (STT + LLM + TTS in one WebSocket). When enabled, the default Brain/Whisper/edge-tts pipeline is disabled.

To enable:

  1. Set XAI_API_KEY environment variable.
  2. Edit config.yaml:
    plugins:
      grok_voice:
        enabled: true
        voice: Eve   # Eve, Ara, Leo, Rex, Sal
    
  3. Run python -m voiceuse as normal.

The plugin streams 24 kHz PCM audio to xAI and plays back assistant responses directly via PyAudio. It supports the same OS control tools as the default pipeline.

Per-OS Setup Notes

Windows (Primary)

  • pywin32 is required for robust window management and is installed automatically via requirements.txt on Windows.
  • Install ffmpeg and add it to PATH for ffplay TTS playback (optional but recommended).
  • If pyaudio fails to install, use a pre-built wheel:
    pip install pipwin
    pipwin install pyaudio
    

Linux

# Debian / Ubuntu
sudo apt-get update
sudo apt-get install -y \
    python3-pyaudio portaudio19-dev \
    python3-xlib xdotool wmctrl ffmpeg

# Arch
sudo pacman -S python-pyaudio portaudio xdotool wmctrl ffmpeg
  • xdotool and wmctrl are used for window management.
  • If you run Wayland, xdotool may not work; switch to X11 or use XWayland.

macOS (Best-effort)

brew install portaudio ffmpeg
  • Window management uses AppleScript and Quartz APIs (if pyobjc-framework-Quartz is installed).
  • afplay is used as a TTS playback fallback.

Vision Setup (Optional)

VoiceUse can click UI elements described in natural language using computer vision.

Codex CLI (default provider):

# macOS / Linux
brew install openai/codex/codex
# or
npm install -g @openai/codex
  • Authenticate with codex login (uses your ChatGPT Plus/Pro subscription via OAuth).
  • No API key neededcomputer_use.api_key should stay null.

Anthropic (alternative provider):

  • Set ANTHROPIC_API_KEY.
  • Change computer_use.provider to anthropic in config.yaml.

Safety

Before any destructive action (close, quit, delete, system commands, password fields, etc.), the agent:

  1. Speaks a confirmation prompt.
  2. Listens for your spoken response.
  3. Proceeds only if you say yes, yep, yeah, or sure.
  4. Cancels on no, nope, cancel, timeout (10 s), or any other response.

System commands run through an allow-list by default (shell=False). If a command is not in the allow-list, it is blocked with an error message.

Troubleshooting

Symptom Fix
pyaudio install fails Install PortAudio system library first (see per-OS setup)
No audio playback Install ffmpeg (for ffplay) or mpv
Wake word not detected Set PORCUPINE_ACCESS_KEY if using a custom model; built-in "computer" keyword works without a key
Codex CLI not found Install with npm install -g @openai/codex or brew install openai/codex/codex
Window focus fails on Linux Make sure xdotool is installed and you are on X11 (not Wayland)
Low confidence on clicks Increase lighting, reduce monitor scaling, or rephrase the description
STT / LLM calls hang Check your API keys and network connection; run with --verbose for details
Grok Voice plugin won't start Ensure XAI_API_KEY is set and websockets is installed

Development

# Run tests
pytest -v

# Lint
ruff check voiceuse tests
ruff format voiceuse tests

# Type check
mypy voiceuse

# Build wheel
python -m build

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_computer_use_agent-0.1.3.tar.gz (100.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voice_computer_use_agent-0.1.3-py3-none-any.whl (92.0 kB view details)

Uploaded Python 3

File details

Details for the file voice_computer_use_agent-0.1.3.tar.gz.

File metadata

  • Download URL: voice_computer_use_agent-0.1.3.tar.gz
  • Upload date:
  • Size: 100.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voice_computer_use_agent-0.1.3.tar.gz
Algorithm Hash digest
SHA256 abb6a1565cd55f914690081d36a99c4d648be14cfb1cbef65e63e75828b3255e
MD5 ca505324bc4417492a917d3da0dcc8d7
BLAKE2b-256 114490fac51ee22e8e0b545075057b6787dba320651a2a8106a528e54a550c05

See more details on using hashes here.

Provenance

The following attestation bundles were made for voice_computer_use_agent-0.1.3.tar.gz:

Publisher: publish-pypi.yml on jarmen423/voice-computer-use-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voice_computer_use_agent-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for voice_computer_use_agent-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2eaad4754cde603010d648fee86d5a0fc87372669f072c7f1f21789b8f98893d
MD5 4686c72a7b3b3681059343998bf368b5
BLAKE2b-256 405aaa57deb514a44262be8d38fb811e62a8bbeb61fb4d68395833345d88bc8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for voice_computer_use_agent-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on jarmen423/voice-computer-use-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page