An evaluation arena for realtime voice agents.
Project description
An evaluation arena for realtime voice agents.
VoxArena is a reproducible benchmarking harness for realtime voice agents. Run the same scripted conversation across Gemini Live, OpenAI Realtime, and other Pipecat-supported providers — and compare them apples-to-apples on latency, tool-call accuracy, and hallucinations.
Drop it into your CI pipeline, your dev loop, or the bundled control panel.
🚀 CI & Pipeline Integration
VoxArena ships a voxarena CLI designed for headless use in your build pipeline. It returns a non-zero exit code when metrics fall below thresholds you define, and emits JUnit XML for native CI reporting.
pip install voxarena
voxarena run \
--provider gemini \
--script ./script/utterances.yaml \
--min-tool-accuracy 0.9 \
--max-hallucinations 0 \
--max-avg-ttfa-ms 1500 \
--output result.json \
--junit voxarena.xml
# exit 0 if every threshold passes, 1 otherwise
Compare two providers in one shot
voxarena compare \
--gemini-model gemini-3.1-flash-live-preview \
--openai-model gpt-realtime-2 \
--num-turns 5 \
--min-tool-accuracy 0.9 \
--output compare.json
GitHub Actions
- name: Voice agent regression check
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
run: |
pip install voxarena
voxarena run --provider gemini \
--min-tool-accuracy 0.92 --max-hallucinations 0 \
--junit voxarena.xml --quiet
- uses: mikepenz/action-junit-report@v4
if: always()
with:
report_paths: voxarena.xml
Subcommands
| Command | What it does |
|---|---|
voxarena run |
Single-provider scripted run; exits 0/1 against thresholds. |
voxarena compare |
Runs Gemini and OpenAI in parallel against the same script. |
voxarena report |
Generates a markdown comparison report from past runs. |
Run voxarena <command> --help for the full flag set.
Features
- 🎙️ Provider-agnostic agent — one Pipecat pipeline drives every provider; swap models without re-implementing your agent
- 🔁 Scripted conversations — multi-turn YAML scripts with pre-recorded WAV inputs and expected tool calls / response content
- 📊 Automated scoring — tool-call correctness, response matching, hallucination counts, time-to-first-audio, interruption-stop latency
- 🆚 Side-by-side comparisons — run multiple providers in parallel against the same script
- 🗄️ Persistent run history — JSON manifests on disk, indexed in SQLite
- 🖥️ Web control panel — React UI to launch runs, watch live status, browse results, and edit scripts
- 🧩 Extensible — add a new provider by implementing one adapter class
Architecture
Local Dev (with UI)
git clone https://github.com/simkeyur/vox-arena.git
cd vox-arena
cp .env.example .env # add GOOGLE_API_KEY / OPENAI_API_KEY
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
uvicorn voxarena.main:app --reload --port 8000
Then in another terminal:
cd ui && npm install && npm run dev
Open the control panel at http://localhost:5173.
Bring Your Own Agent
The demo ships with the "Saffron Leaf" restaurant agent so you can run end-to-end on day one. To evaluate your own:
- Replace the system prompt and tool schemas in
voxarena/agent.py - Implement (or stub) your tools in
voxarena/tools.py - Re-record
script/audio/*.wavand updatescript/utterances.yamlto reflect your real workload - Run the arena as normal — every provider gets scored against your scripts
Scripted Conversations
Conversations live in script/utterances.yaml. Each turn pairs an utterance id with an expect block describing the correct tool call and/or response content:
- id: u04
text: "Are you open on Sundays?"
expect:
tool: get_hours
args:
day: sunday
response_contains:
- "closed"
The harness plays script/audio/{id}.wav into the pipeline and scores the agent's actual tool calls and transcript against expect.
Configuration
| Variable | Description |
|---|---|
GOOGLE_API_KEY / OPENAI_API_KEY |
Provider credentials |
GEMINI_MODEL / OPENAI_MODEL |
Realtime model under test |
GEMINI_EVAL_MODEL / OPENAI_EVAL_MODEL |
Cheaper text models for grading |
PORT |
FastAPI server port |
BASE_DIR |
Override workdir (CLI: --workdir) |
Contributing
To add a new provider: implement an adapter in voxarena/providers/ following the pattern in gemini.py / openai.py, wire it into voxarena/harness.py and voxarena/config.py, and open a PR.
For bugs and feature requests, please open an issue.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voxarena-0.1.4.tar.gz.
File metadata
- Download URL: voxarena-0.1.4.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
815abc394acdc3c49336e6c91033a72ddc1a8bd399f5d897378b327b6940f789
|
|
| MD5 |
29b0a3857aeb35fd3626dd0eb6a46170
|
|
| BLAKE2b-256 |
8d9f69504e27487fb70bc536f34f2bc44dbcd6d208a46d4ecfcbf71d464503cd
|
Provenance
The following attestation bundles were made for voxarena-0.1.4.tar.gz:
Publisher:
publish.yml on simkeyur/vox-arena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voxarena-0.1.4.tar.gz -
Subject digest:
815abc394acdc3c49336e6c91033a72ddc1a8bd399f5d897378b327b6940f789 - Sigstore transparency entry: 1810280627
- Sigstore integration time:
-
Permalink:
simkeyur/vox-arena@8ed324f1fdce582e5e5c5b775d0647f8a3d5c746 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/simkeyur
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8ed324f1fdce582e5e5c5b775d0647f8a3d5c746 -
Trigger Event:
push
-
Statement type:
File details
Details for the file voxarena-0.1.4-py3-none-any.whl.
File metadata
- Download URL: voxarena-0.1.4-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c6a8b2e238cd74e0416c7304b9faa585df4d5bef50939155adb98dcef4c6c61
|
|
| MD5 |
9d2b9efc54e892389ca575a2575f88eb
|
|
| BLAKE2b-256 |
5155523e07853e9feca13056d0ab75dfcbba5a891e24481b1845d5bc55d0ce40
|
Provenance
The following attestation bundles were made for voxarena-0.1.4-py3-none-any.whl:
Publisher:
publish.yml on simkeyur/vox-arena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voxarena-0.1.4-py3-none-any.whl -
Subject digest:
0c6a8b2e238cd74e0416c7304b9faa585df4d5bef50939155adb98dcef4c6c61 - Sigstore transparency entry: 1810280650
- Sigstore integration time:
-
Permalink:
simkeyur/vox-arena@8ed324f1fdce582e5e5c5b775d0647f8a3d5c746 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/simkeyur
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8ed324f1fdce582e5e5c5b775d0647f8a3d5c746 -
Trigger Event:
push
-
Statement type: