Local desktop voice and computer-control agent with MCP tools

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

VoiceUse

A local desktop voice agent that controls your computer hands-free. All AI inference is cloud-based; the agent itself runs natively on your machine and controls the OS.

Features

Wake word ("Computer") or hotkey (hold Right Ctrl) activation
Voice Activity Detection — knows when you stop speaking
Speaks back with TTS for confirmations, errors, and status updates
Cross-platform window control, typing, and screenshots (Windows primary, Linux secondary, macOS best-effort)
Multi-monitor support — screenshots only the monitor containing the target window
Safety layer — spoken confirmation before destructive actions (close, quit, delete, system commands, etc.)
Vision-powered clicking — uses Codex CLI or Anthropic Computer Use API to locate UI elements from screenshots
Grok Voice plugin — optional end-to-end voice via the xAI Realtime API (replaces the default STT→LLM→TTS pipeline)

Quick Start

1. Prerequisites

Python 3.10+
API keys for the cloud services you plan to use:
- GROQ_API_KEY — required for STT and primary LLM
- OPENAI_API_KEY — optional fallback LLM
- CEREBRAS_API_KEY — optional, for using Cerebras as primary or fallback LLM
- ANTHROPIC_API_KEY — optional, only if using Anthropic for vision
- XAI_API_KEY — optional, only if using the Grok Voice plugin

2. Install

MCP computer-control tools only (recommended first install):

pipx install voice-computer-use-agent

This installs the global MCP server command:

voiceuse-computer-control-mcp

Then register it with an MCP-capable agent. For Codex CLI:

codex mcp add voiceuse-computer-control -- voiceuse-computer-control-mcp

Full voice assistant install:

pipx install "voice-computer-use-agent[all]"

Use this when you want the microphone, hotkey, STT, TTS, and realtime voice plugin dependencies installed into the pipx environment.

Local development install:

# Clone or download the repository
cd voiceuse

# Create a virtual environment (recommended)
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

# Install the package and all runtime dependencies
pip install -e .

# Or install with dev dependencies (tests, lint, type-check)
pip install -e ".[dev]"

3. Set API keys

Linux / macOS:

export GROQ_API_KEY="gsk_..."
export OPENAI_API_KEY="sk-..."        # optional fallback
export CEREBRAS_API_KEY="csk_..."     # optional Cerebras LLM
export ANTHROPIC_API_KEY="sk-ant-..." # optional vision
export XAI_API_KEY="xai-..."          # optional Grok Voice

Windows (PowerShell):

$env:GROQ_API_KEY="gsk_..."
$env:OPENAI_API_KEY="sk-..."

4. Run

# Normal run
python -m voiceuse

# Dry-run mode — no API calls, uses mock responses (great for first-time validation)
python -m voiceuse --dry-run

# Check that all dependencies are present
python -m voiceuse --check-install

# Enable rotating file logs
python -m voiceuse --log-file voiceuse.log

# Verbose debug output
python -m voiceuse --verbose

The first run creates a default config.yaml in the working directory if one does not exist.

5. Using the agent

Hold Right Ctrl and speak, then release to submit.
Or say "Computer" (if wake word is enabled) and speak until VAD detects silence.
The agent transcribes your command, plans actions with the LLM, executes them, and speaks the result.

Configuration (`config.yaml`)

All runtime settings live in config.yaml. A default file is generated automatically.

audio:
  sample_rate: 16000
  hotkey: "right ctrl"
  wake_word: "computer"        # free Porcupine keywords: computer, jarvis, alexa, etc.
  wake_word_model_path: null

stt:
  provider: groq
  model: whisper-large-v3
  api_key: null          # falls back to GROQ_API_KEY env var

llm:
  provider: groq          # "groq", "cerebras", or "openai"
  model: llama-3.3-70b-versatile
  api_key: null           # falls back to GROQ_API_KEY env var
  fallback_provider: openai
  fallback_model: gpt-4o-mini
  fallback_api_key: null  # falls back to OPENAI_API_KEY env var
  cerebras_api_key: null  # falls back to CEREBRAS_API_KEY env var

tts:
  provider: edge
  voice: en-US-AriaNeural
  enabled: true

computer_use:
  provider: codex          # "codex" (Codex CLI, OAuth) or "anthropic" (API key)
  api_key: null            # only needed for anthropic; codex uses `codex login`

agent:
  backend: external_agent   # "native" or "external_agent"
  runner: codex_cli         # first external runner implementation
  command: codex
  working_directory: "."
  timeout_seconds: 300
  model: null
  sandbox: null
  skip_git_repo_check: true

safety:
  confirm_destructive: true
  destructive_keywords:
    - close
    - quit
    - delete
    - remove
    - kill
    - terminate
    - shutdown
    - reboot
    - format
    - rm -rf
    - type password
    - enter password
    - input password
  confirmation_timeout_seconds: 10

app:
  preferred_browser: chrome
  preferred_terminal: cmd
  codex_app_name: Codex
  dry_run: false           # overridden by --dry-run CLI flag
  aliases:                 # spoken name → exact Windows app name
    comet: "Comet Browser"
    # vscode: "Visual Studio Code"
    # edge: "Microsoft Edge"

plugins:
  grok_voice:
    enabled: false
    api_key: null          # falls back to XAI_API_KEY env var
    model: grok-voice-think-fast-1.0
    voice: Eve
    instructions: "You are a desktop voice assistant..."
    sample_rate: 24000
    turn_detection_type: server_vad
    input_audio_transcription_model: grok-2-audio

Key points:

api_key: null means "read from the environment variable."
--dry-run on the CLI forces app.dry_run: true for that run.
agent.backend: external_agent keeps VoiceUse as the voice shell and sends desktop work to an MCP-capable action agent.
plugins.grok_voice.enabled: true replaces the default STT→LLM→TTS pipeline with the xAI Realtime WebSocket.

PyPI / pipx Packaging

VoiceUse ships two console commands:

voiceuse
voiceuse-computer-control-mcp

Install options:

# Lightweight desktop-control MCP server
pipx install voice-computer-use-agent

# Full voice assistant with audio, STT, TTS, LLM, vision, and realtime extras
pipx install "voice-computer-use-agent[all]"

Publishing is handled by .github/workflows/publish-pypi.yml on version tags. Before tagging a release, configure PyPI Trusted Publishing for this GitHub repository with the pypi environment.

App Aliases

VoiceUse passes your currently open windows and app aliases to the LLM before every command. This means the LLM knows what's running and can resolve nicknames like "comet" → "Comet Browser".

Add aliases in config.yaml:

app:
  aliases:
    comet: "Comet Browser"
    vscode: "Visual Studio Code"
    edge: "Microsoft Edge"

How it works:

You say: "Open Comet"
Whisper transcribes: "Open comment" (STT error)
Cerebras receives your open windows list + aliases
Cerebras knows "comment" is close to "Comet Browser" and emits open_app("Comet Browser")
find_window uses fuzzy matching (difflib) as a final safety net

Grok Voice Plugin (Optional)

The Grok Voice plugin uses the xAI Realtime API to stream audio end-to-end (STT + LLM + TTS in one WebSocket). When enabled, the default Brain/Whisper/edge-tts pipeline is disabled.

To enable:

Set XAI_API_KEY environment variable.

Edit config.yaml:

plugins:
  grok_voice:
    enabled: true
    voice: Eve   # Eve, Ara, Leo, Rex, Sal

Run python -m voiceuse as normal.

The plugin streams 24 kHz PCM audio to xAI and plays back assistant responses directly via PyAudio. It supports the same OS control tools as the default pipeline.

Per-OS Setup Notes

Windows (Primary)

pywin32 is required for robust window management and is installed automatically via requirements.txt on Windows.
Install ffmpeg and add it to PATH for ffplay TTS playback (optional but recommended).
If pyaudio fails to install, use a pre-built wheel:
```
pip install pipwin
pipwin install pyaudio
```

Linux

# Debian / Ubuntu
sudo apt-get update
sudo apt-get install -y \
    python3-pyaudio portaudio19-dev \
    python3-xlib xdotool wmctrl ffmpeg

# Arch
sudo pacman -S python-pyaudio portaudio xdotool wmctrl ffmpeg

xdotool and wmctrl are used for window management.
If you run Wayland, xdotool may not work; switch to X11 or use XWayland.

macOS (Best-effort)

brew install portaudio ffmpeg

Window management uses AppleScript and Quartz APIs (if pyobjc-framework-Quartz is installed).
afplay is used as a TTS playback fallback.

Vision Setup (Optional)

VoiceUse can click UI elements described in natural language using computer vision.

Codex CLI (default provider):

# macOS / Linux
brew install openai/codex/codex
# or
npm install -g @openai/codex

Authenticate with codex login (uses your ChatGPT Plus/Pro subscription via OAuth).
No API key needed — computer_use.api_key should stay null.

Anthropic (alternative provider):

Set ANTHROPIC_API_KEY.
Change computer_use.provider to anthropic in config.yaml.

Safety

Before any destructive action (close, quit, delete, system commands, password fields, etc.), the agent:

Speaks a confirmation prompt.
Listens for your spoken response.
Proceeds only if you say yes, yep, yeah, or sure.
Cancels on no, nope, cancel, timeout (10 s), or any other response.

System commands run through an allow-list by default (shell=False). If a command is not in the allow-list, it is blocked with an error message.

Troubleshooting

Symptom	Fix
`pyaudio` install fails	Install PortAudio system library first (see per-OS setup)
No audio playback	Install `ffmpeg` (for `ffplay`) or `mpv`
Wake word not detected	Set `PORCUPINE_ACCESS_KEY` if using a custom model; built-in "computer" keyword works without a key
Codex CLI not found	Install with `npm install -g @openai/codex` or `brew install openai/codex/codex`
Window focus fails on Linux	Make sure `xdotool` is installed and you are on X11 (not Wayland)
Low confidence on clicks	Increase lighting, reduce monitor scaling, or rephrase the description
STT / LLM calls hang	Check your API keys and network connection; run with `--verbose` for details
Grok Voice plugin won't start	Ensure `XAI_API_KEY` is set and `websockets` is installed

Development

# Run tests
pytest -v

# Lint
ruff check voiceuse tests
ruff format voiceuse tests

# Type check
mypy voiceuse

# Build wheel
python -m build

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jarmen423

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_computer_use_agent-0.1.3.tar.gz (100.0 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voice_computer_use_agent-0.1.3-py3-none-any.whl (92.0 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file voice_computer_use_agent-0.1.3.tar.gz.

File metadata

Download URL: voice_computer_use_agent-0.1.3.tar.gz
Upload date: May 6, 2026
Size: 100.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voice_computer_use_agent-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`abb6a1565cd55f914690081d36a99c4d648be14cfb1cbef65e63e75828b3255e`
MD5	`ca505324bc4417492a917d3da0dcc8d7`
BLAKE2b-256	`114490fac51ee22e8e0b545075057b6787dba320651a2a8106a528e54a550c05`

See more details on using hashes here.

Provenance

The following attestation bundles were made for voice_computer_use_agent-0.1.3.tar.gz:

Publisher: publish-pypi.yml on jarmen423/voice-computer-use-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: voice_computer_use_agent-0.1.3.tar.gz
- Subject digest: abb6a1565cd55f914690081d36a99c4d648be14cfb1cbef65e63e75828b3255e
- Sigstore transparency entry: 1452735403
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: jarmen423/voice-computer-use-agent@f660bb7c53622f1f57f48d9cd1bd93daf03c721e
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/jarmen423
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f660bb7c53622f1f57f48d9cd1bd93daf03c721e
- Trigger Event: push

File details

Details for the file voice_computer_use_agent-0.1.3-py3-none-any.whl.

File metadata

Download URL: voice_computer_use_agent-0.1.3-py3-none-any.whl
Upload date: May 6, 2026
Size: 92.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voice_computer_use_agent-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2eaad4754cde603010d648fee86d5a0fc87372669f072c7f1f21789b8f98893d`
MD5	`4686c72a7b3b3681059343998bf368b5`
BLAKE2b-256	`405aaa57deb514a44262be8d38fb811e62a8bbeb61fb4d68395833345d88bc8b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for voice_computer_use_agent-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on jarmen423/voice-computer-use-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: voice_computer_use_agent-0.1.3-py3-none-any.whl
- Subject digest: 2eaad4754cde603010d648fee86d5a0fc87372669f072c7f1f21789b8f98893d
- Sigstore transparency entry: 1452735490
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: jarmen423/voice-computer-use-agent@f660bb7c53622f1f57f48d9cd1bd93daf03c721e
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/jarmen423
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f660bb7c53622f1f57f48d9cd1bd93daf03c721e
- Trigger Event: push

voice-computer-use-agent 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

VoiceUse

Features

Quick Start

1. Prerequisites

2. Install

3. Set API keys

4. Run

5. Using the agent

Configuration (config.yaml)

PyPI / pipx Packaging

App Aliases

Grok Voice Plugin (Optional)

Per-OS Setup Notes

Windows (Primary)

Linux

macOS (Best-effort)

Vision Setup (Optional)

Safety

Troubleshooting

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Configuration (`config.yaml`)