Local desktop voice and computer-control agent with MCP tools
Project description
VoiceUse
A local desktop voice agent that controls your computer hands-free. All AI inference is cloud-based; the agent itself runs natively on your machine and controls the OS.
Features
- Wake word ("Computer") or hotkey (hold Right Ctrl) activation
- Voice Activity Detection — knows when you stop speaking
- Speaks back with TTS for confirmations, errors, and status updates
- Cross-platform window control, typing, and screenshots (Windows primary, Linux secondary, macOS best-effort)
- Multi-monitor support — screenshots only the monitor containing the target window
- Safety layer — spoken confirmation before destructive actions (close, quit, delete, system commands, etc.)
- Vision-powered clicking — uses Codex CLI or Anthropic Computer Use API to locate UI elements from screenshots
- Grok Voice plugin — optional end-to-end voice via the xAI Realtime API (replaces the default STT→LLM→TTS pipeline)
Quick Start
1. Prerequisites
- Python 3.10+
- API keys for the cloud services you plan to use:
GROQ_API_KEY— required for STT and primary LLMOPENAI_API_KEY— optional fallback LLMCEREBRAS_API_KEY— optional, for using Cerebras as primary or fallback LLMANTHROPIC_API_KEY— optional, only if using Anthropic for visionXAI_API_KEY— optional, only if using the Grok Voice plugin
2. Install
MCP computer-control tools only (recommended first install):
pipx install voice-computer-use-agent
This installs the global MCP server command:
voiceuse-computer-control-mcp
Then register it with an MCP-capable agent. For Codex CLI:
codex mcp add voiceuse-computer-control -- voiceuse-computer-control-mcp
Full voice assistant install:
pipx install "voice-computer-use-agent[all]"
Use this when you want the microphone, hotkey, STT, TTS, and realtime voice plugin dependencies installed into the pipx environment.
Local development install:
# Clone or download the repository
cd voiceuse
# Create a virtual environment (recommended)
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# Install the package and all runtime dependencies
pip install -e .
# Or install with dev dependencies (tests, lint, type-check)
pip install -e ".[dev]"
3. Set API keys
Linux / macOS:
export GROQ_API_KEY="gsk_..."
export OPENAI_API_KEY="sk-..." # optional fallback
export CEREBRAS_API_KEY="csk_..." # optional Cerebras LLM
export ANTHROPIC_API_KEY="sk-ant-..." # optional vision
export XAI_API_KEY="xai-..." # optional Grok Voice
Windows (PowerShell):
$env:GROQ_API_KEY="gsk_..."
$env:OPENAI_API_KEY="sk-..."
4. Run
# Normal run
python -m voiceuse
# Dry-run mode — no API calls, uses mock responses (great for first-time validation)
python -m voiceuse --dry-run
# Check that all dependencies are present
python -m voiceuse --check-install
# Enable rotating file logs
python -m voiceuse --log-file voiceuse.log
# Verbose debug output
python -m voiceuse --verbose
The first run creates a default config.yaml in the working directory if one does not exist.
5. Using the agent
- Hold Right Ctrl and speak, then release to submit.
- Or say "Computer" (if wake word is enabled) and speak until VAD detects silence.
- The agent transcribes your command, plans actions with the LLM, executes them, and speaks the result.
Configuration (config.yaml)
All runtime settings live in config.yaml. A default file is generated automatically.
audio:
sample_rate: 16000
hotkey: "right ctrl"
wake_word: "computer" # free Porcupine keywords: computer, jarvis, alexa, etc.
wake_word_model_path: null
stt:
provider: groq
model: whisper-large-v3
api_key: null # falls back to GROQ_API_KEY env var
llm:
provider: groq # "groq", "cerebras", or "openai"
model: llama-3.3-70b-versatile
api_key: null # falls back to GROQ_API_KEY env var
fallback_provider: openai
fallback_model: gpt-4o-mini
fallback_api_key: null # falls back to OPENAI_API_KEY env var
cerebras_api_key: null # falls back to CEREBRAS_API_KEY env var
tts:
provider: edge
voice: en-US-AriaNeural
enabled: true
computer_use:
provider: codex # "codex" (Codex CLI, OAuth) or "anthropic" (API key)
api_key: null # only needed for anthropic; codex uses `codex login`
agent:
backend: external_agent # "native" or "external_agent"
runner: codex_cli # first external runner implementation
command: codex
working_directory: "."
timeout_seconds: 300
model: null
sandbox: null
skip_git_repo_check: true
safety:
confirm_destructive: true
destructive_keywords:
- close
- quit
- delete
- remove
- kill
- terminate
- shutdown
- reboot
- format
- rm -rf
- type password
- enter password
- input password
confirmation_timeout_seconds: 10
app:
preferred_browser: chrome
preferred_terminal: cmd
codex_app_name: Codex
dry_run: false # overridden by --dry-run CLI flag
aliases: # spoken name → exact Windows app name
comet: "Comet Browser"
# vscode: "Visual Studio Code"
# edge: "Microsoft Edge"
plugins:
grok_voice:
enabled: false
api_key: null # falls back to XAI_API_KEY env var
model: grok-voice-think-fast-1.0
voice: Eve
instructions: "You are a desktop voice assistant..."
sample_rate: 24000
turn_detection_type: server_vad
input_audio_transcription_model: grok-2-audio
Key points:
api_key: nullmeans "read from the environment variable."--dry-runon the CLI forcesapp.dry_run: truefor that run.agent.backend: external_agentkeeps VoiceUse as the voice shell and sends desktop work to an MCP-capable action agent.plugins.grok_voice.enabled: truereplaces the default STT→LLM→TTS pipeline with the xAI Realtime WebSocket.
PyPI / pipx Packaging
VoiceUse ships two console commands:
voiceuse
voiceuse-computer-control-mcp
Install options:
# Lightweight desktop-control MCP server
pipx install voice-computer-use-agent
# Full voice assistant with audio, STT, TTS, LLM, vision, and realtime extras
pipx install "voice-computer-use-agent[all]"
Publishing is handled by .github/workflows/publish-pypi.yml on version tags.
Before tagging a release, configure PyPI Trusted Publishing for this GitHub
repository with the pypi environment.
App Aliases
VoiceUse passes your currently open windows and app aliases to the LLM before every command. This means the LLM knows what's running and can resolve nicknames like "comet" → "Comet Browser".
Add aliases in config.yaml:
app:
aliases:
comet: "Comet Browser"
vscode: "Visual Studio Code"
edge: "Microsoft Edge"
How it works:
- You say: "Open Comet"
- Whisper transcribes: "Open comment" (STT error)
- Cerebras receives your open windows list + aliases
- Cerebras knows "comment" is close to "Comet Browser" and emits
open_app("Comet Browser") find_windowuses fuzzy matching (difflib) as a final safety net
Grok Voice Plugin (Optional)
The Grok Voice plugin uses the xAI Realtime API to stream audio end-to-end (STT + LLM + TTS in one WebSocket). When enabled, the default Brain/Whisper/edge-tts pipeline is disabled.
To enable:
- Set
XAI_API_KEYenvironment variable. - Edit
config.yaml:plugins: grok_voice: enabled: true voice: Eve # Eve, Ara, Leo, Rex, Sal
- Run
python -m voiceuseas normal.
The plugin streams 24 kHz PCM audio to xAI and plays back assistant responses directly via PyAudio. It supports the same OS control tools as the default pipeline.
Per-OS Setup Notes
Windows (Primary)
pywin32is required for robust window management and is installed automatically viarequirements.txton Windows.- Install ffmpeg and add it to PATH for
ffplayTTS playback (optional but recommended). - If
pyaudiofails to install, use a pre-built wheel:pip install pipwin pipwin install pyaudio
Linux
# Debian / Ubuntu
sudo apt-get update
sudo apt-get install -y \
python3-pyaudio portaudio19-dev \
python3-xlib xdotool wmctrl ffmpeg
# Arch
sudo pacman -S python-pyaudio portaudio xdotool wmctrl ffmpeg
xdotoolandwmctrlare used for window management.- If you run Wayland,
xdotoolmay not work; switch to X11 or use XWayland.
macOS (Best-effort)
brew install portaudio ffmpeg
- Window management uses AppleScript and Quartz APIs (if
pyobjc-framework-Quartzis installed). afplayis used as a TTS playback fallback.
Vision Setup (Optional)
VoiceUse can click UI elements described in natural language using computer vision.
Codex CLI (default provider):
# macOS / Linux
brew install openai/codex/codex
# or
npm install -g @openai/codex
- Authenticate with
codex login(uses your ChatGPT Plus/Pro subscription via OAuth). - No API key needed —
computer_use.api_keyshould staynull.
Anthropic (alternative provider):
- Set
ANTHROPIC_API_KEY. - Change
computer_use.providertoanthropicinconfig.yaml.
Safety
Before any destructive action (close, quit, delete, system commands, password fields, etc.), the agent:
- Speaks a confirmation prompt.
- Listens for your spoken response.
- Proceeds only if you say yes, yep, yeah, or sure.
- Cancels on no, nope, cancel, timeout (10 s), or any other response.
System commands run through an allow-list by default (shell=False). If a command is not in the allow-list, it is blocked with an error message.
Troubleshooting
| Symptom | Fix |
|---|---|
pyaudio install fails |
Install PortAudio system library first (see per-OS setup) |
| No audio playback | Install ffmpeg (for ffplay) or mpv |
| Wake word not detected | Set PORCUPINE_ACCESS_KEY if using a custom model; built-in "computer" keyword works without a key |
| Codex CLI not found | Install with npm install -g @openai/codex or brew install openai/codex/codex |
| Window focus fails on Linux | Make sure xdotool is installed and you are on X11 (not Wayland) |
| Low confidence on clicks | Increase lighting, reduce monitor scaling, or rephrase the description |
| STT / LLM calls hang | Check your API keys and network connection; run with --verbose for details |
| Grok Voice plugin won't start | Ensure XAI_API_KEY is set and websockets is installed |
Development
# Run tests
pytest -v
# Lint
ruff check voiceuse tests
ruff format voiceuse tests
# Type check
mypy voiceuse
# Build wheel
python -m build
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voice_computer_use_agent-0.1.3.tar.gz.
File metadata
- Download URL: voice_computer_use_agent-0.1.3.tar.gz
- Upload date:
- Size: 100.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abb6a1565cd55f914690081d36a99c4d648be14cfb1cbef65e63e75828b3255e
|
|
| MD5 |
ca505324bc4417492a917d3da0dcc8d7
|
|
| BLAKE2b-256 |
114490fac51ee22e8e0b545075057b6787dba320651a2a8106a528e54a550c05
|
Provenance
The following attestation bundles were made for voice_computer_use_agent-0.1.3.tar.gz:
Publisher:
publish-pypi.yml on jarmen423/voice-computer-use-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voice_computer_use_agent-0.1.3.tar.gz -
Subject digest:
abb6a1565cd55f914690081d36a99c4d648be14cfb1cbef65e63e75828b3255e - Sigstore transparency entry: 1452735403
- Sigstore integration time:
-
Permalink:
jarmen423/voice-computer-use-agent@f660bb7c53622f1f57f48d9cd1bd93daf03c721e -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/jarmen423
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@f660bb7c53622f1f57f48d9cd1bd93daf03c721e -
Trigger Event:
push
-
Statement type:
File details
Details for the file voice_computer_use_agent-0.1.3-py3-none-any.whl.
File metadata
- Download URL: voice_computer_use_agent-0.1.3-py3-none-any.whl
- Upload date:
- Size: 92.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2eaad4754cde603010d648fee86d5a0fc87372669f072c7f1f21789b8f98893d
|
|
| MD5 |
4686c72a7b3b3681059343998bf368b5
|
|
| BLAKE2b-256 |
405aaa57deb514a44262be8d38fb811e62a8bbeb61fb4d68395833345d88bc8b
|
Provenance
The following attestation bundles were made for voice_computer_use_agent-0.1.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on jarmen423/voice-computer-use-agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voice_computer_use_agent-0.1.3-py3-none-any.whl -
Subject digest:
2eaad4754cde603010d648fee86d5a0fc87372669f072c7f1f21789b8f98893d - Sigstore transparency entry: 1452735490
- Sigstore integration time:
-
Permalink:
jarmen423/voice-computer-use-agent@f660bb7c53622f1f57f48d9cd1bd93daf03c721e -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/jarmen423
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@f660bb7c53622f1f57f48d9cd1bd93daf03c721e -
Trigger Event:
push
-
Statement type: