Skip to main content

An evaluation arena for realtime voice agents.

Project description

VoxArena

An evaluation arena for realtime voice agents.

License: MIT Python 3.11+ Built with Pipecat PRs Welcome

VoxArena is a reproducible benchmarking harness for realtime voice agents. Run the same scripted conversation across Gemini Live, OpenAI Realtime, and other Pipecat-supported providers — and compare them apples-to-apples on latency, tool-call accuracy, and hallucinations.

Drop it into your CI pipeline, your dev loop, or the bundled control panel.

📚 Interactive Documentation Guide: A visual dashboard guide with an interactive CLI command builder is located in the docs/ directory (hostable on GitHub Pages).


🚀 CI & Pipeline Integration

VoxArena ships a voxarena CLI designed for headless use in your build pipeline. It returns a non-zero exit code when metrics fall below thresholds you define, and emits JUnit XML for native CI reporting.

pip install voxarena

voxarena run \
  --provider gemini \
  --script ./script/utterances.json \
  --min-tool-accuracy 0.9 \
  --max-hallucinations 0 \
  --max-avg-ttfa-ms 1500 \
  --output result.json \
  --junit voxarena.xml
# exit 0 if every threshold passes, 1 otherwise

Compare two providers in one shot

voxarena compare \
  --gemini-model gemini-3.1-flash-live-preview \
  --openai-model gpt-realtime-2 \
  --num-turns 5 \
  --min-tool-accuracy 0.9 \
  --output compare.json

GitHub Actions

- name: Voice agent regression check
  env:
    GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
  run: |
    pip install voxarena
    voxarena run --provider gemini \
      --min-tool-accuracy 0.92 --max-hallucinations 0 \
      --junit voxarena.xml --quiet

- uses: mikepenz/action-junit-report@v4
  if: always()
  with:
    report_paths: voxarena.xml

Subcommands

Command What it does
voxarena run Single-provider scripted run; exits 0/1 against thresholds.
voxarena compare Runs Gemini and OpenAI in parallel against the same script.
voxarena report Generates a markdown comparison report from past runs.

Run voxarena <command> --help for the full flag set.


🖥️ Web Control Panel UI (Zero Setup)

You can configure credentials, build test scripts, and run the benchmark suite entirely from your web browser:

pip install voxarena
voxarena ui

This starts a local server and automatically opens the dashboard in your default browser at http://127.0.0.1:8000.

From the UI, you can:

  • Set Up API Keys: Add and save Google Gemini and OpenAI API keys securely in the local database.
  • Select Models: Pick from preloaded Gemini and OpenAI realtime models, or write in your own custom model identifiers.
  • Edit Test Utterances: Create, edit, and delete turns in your test scripts using the interactive visual list editor (no raw YAML/JSON formatting needed).
  • Run & Inspect: Start live comparison runs and watch real-time transcripts, metrics, audio playbacks, and tool-call correctness side-by-side.

Note: If you run voxarena ui in a clean, empty directory, it will automatically bootstrap default script files and pre-recorded audio so you can run benchmarks immediately.


Features

  • 🎙️ Provider-agnostic agent — one Pipecat pipeline drives every provider; swap models without re-implementing your agent
  • 🔁 Scripted conversations — multi-turn JSON or YAML scripts with pre-recorded WAV inputs and expected tool calls / response content
  • 📊 Automated scoring — tool-call correctness, response matching, hallucination counts, time-to-first-audio, interruption-stop latency
  • 🆚 Side-by-side comparisons — run multiple providers in parallel against the same script
  • 🗄️ Persistent run history — JSON manifests on disk, indexed in SQLite
  • 🖥️ Web control panel — React UI to launch runs, watch live status, browse results, and edit scripts
  • 🧩 Extensible — add a new provider by implementing one adapter class

Architecture

VoxArena Architecture

Local Dev (with UI)

git clone https://github.com/simkeyur/vox-arena.git
cd vox-arena
cp .env.example .env  # add GOOGLE_API_KEY / OPENAI_API_KEY

python3 -m venv .venv && source .venv/bin/activate
pip install -e .

uvicorn voxarena.main:app --reload --port 8000

Then in another terminal:

cd ui && npm install && npm run dev

Open the control panel at http://localhost:5173.

Bring Your Own Agent

The demo ships with the "Saffron Leaf" restaurant agent so you can run end-to-end on day one. To evaluate your own:

  1. Replace the system prompt and tool schemas in voxarena/agent.py
  2. Implement (or stub) your tools in voxarena/tools.py
  3. Re-record script/audio/*.wav and update script/utterances.yaml to reflect your real workload
  4. Run the arena as normal — every provider gets scored against your scripts

Scripted Conversations

Conversations live in either JSON or YAML script files (e.g., script/utterances.json or script/utterances.yaml). Each turn pairs an utterance ID with an expect block describing the correct tool call and/or response content:

- id: u04
  text: "Are you open on Sundays?"
  expect:
    tool: get_hours
    args:
      day: sunday
    response_contains:
      - "closed"

The harness plays script/audio/{id}.wav into the pipeline and scores the agent's actual tool calls and transcript against expect.

Configuration

Variable Description
GOOGLE_API_KEY / OPENAI_API_KEY Provider credentials
GEMINI_MODEL / OPENAI_MODEL Realtime model under test
GEMINI_EVAL_MODEL / OPENAI_EVAL_MODEL Cheaper text models for grading
PORT FastAPI server port
BASE_DIR Override workdir (CLI: --workdir)

Contributing

To add a new provider: implement an adapter in voxarena/providers/ following the pattern in gemini.py / openai.py, wire it into voxarena/harness.py and voxarena/config.py, and open a PR.

For bugs and feature requests, please open an issue.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxarena-0.1.8.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxarena-0.1.8-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file voxarena-0.1.8.tar.gz.

File metadata

  • Download URL: voxarena-0.1.8.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxarena-0.1.8.tar.gz
Algorithm Hash digest
SHA256 7614cd35536a4cb68a7ad403bde141bc16bca9208b2a50f3d6f41be3d1847cc7
MD5 dc23c4a4a9ac696da75582fda56102da
BLAKE2b-256 2972d1bb6b24ac0ec49c87ddbe1f42e9f0e8e0b4bd5bd46ccadd5c3f9859244b

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxarena-0.1.8.tar.gz:

Publisher: publish.yml on simkeyur/vox-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voxarena-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: voxarena-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxarena-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 734e1c26c5365a6c5140c930478088ef3f01e5687538bb4493a33e9adf3703ac
MD5 0dfc3ac4d776569b26785b0acd1e25f7
BLAKE2b-256 64f8986b1322f7a4c288ece20fb947c33e95f887ab43bdf42b2e16e940d8c66c

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxarena-0.1.8-py3-none-any.whl:

Publisher: publish.yml on simkeyur/vox-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page