Reproducible CLI benchmark for audio ML models.

These details have not been verified by PyPI

Project links

Project description

audiobench

audiobench is a reproducible CLI benchmark for audio ML models. It emphasizes failure modes that clean-set scores hide, and records auditable run artifacts (run_hash, per-condition metrics, and suite-specific evidence such as hallucination findings).

Docs site: https://thenirock.github.io/audiobench/
Repository: https://github.com/THENIROCK/audiobench

Install

Python 3.10+. A virtual environment keeps dependencies isolated; activate it in each new shell.

From PyPI

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install audiobench
audiobench --help

Optional extras:

pip install "audiobench[gui]"    # local Gradio UI (audiobench --gui)
pip install "audiobench[clap]"   # LAION-CLAP adapter for ab/sound-id
pip install "audiobench[qwen]"   # local Qwen2-Audio adapter

From source (development)

For hacking on the repo, clone and install in editable mode so changes under src/ apply immediately:

git clone https://github.com/THENIROCK/audiobench.git
cd audiobench
python -m venv .venv
source .venv/bin/activate
pip install -e .
audiobench --help

Use pip install -e ".[gui]" (and .[clap], .[qwen]) for the same extras from a checkout.

If you hit ModuleNotFoundError: No module named 'audiobench' on macOS + Python 3.13 after an editable install, see the workaround in docs/quickstart.md.

Local GUI (optional)

For structuring multi-suite testing sessions (matrix YAML) and browsing run results by session, there is a small local Gradio app:

pip install "audiobench[gui]"   # or: pip install -e ".[gui]" from a clone
audiobench --gui

It opens a local browser tab with a Test Builder (compose matrix.yaml) and a Results view (sessions = directories under results/ containing summary.json). The other CLI verbs (run, compare, inspect, gate, push) are available as simple forms, or you can stay in the terminal — the GUI is a thin wrapper, not a replacement. See docs/guides/gui.md for details.

Core CLI commands

# Discover suites and adapters
audiobench list
audiobench list-models
audiobench info ab/asr-hallucination

# Run a suite and write JSON artifact
audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination.json

# Run several (suite, model) cells at once and aggregate results
audiobench run-matrix \
    --suite ab/sound-id --model heuristic-v0 --model heuristic-weak \
    --profile demo-fast --output-dir results/matrix

# Compare two run artifacts (same suite)
audiobench compare results/run-a.json results/run-b.json

# Inspect a single record (per-clip for ASR, per-mixture for sound-id)
audiobench inspect results/hallucination.json --clip 1
audiobench inspect results/sound-id.json --mixture 1

# Gate a run artifact against thresholds (exits non-zero on failure)
audiobench gate results/hallucination.json --max-hallucination-rate 0.1

# Publish a run artifact to the leaderboard dataset
hf auth login
audiobench push results/hallucination.json

Shipped suites

Task suites (model = transcriber / sound-event identifier):

Suite	Purpose	Headline metric
`ab/asr-robust`	Speech recognition under perturbations (`clean`, noise, bandlimit, reverb)	`weighted_mean_wer` (lower is better)
`ab/asr-hallucination`	Non-speech ASR hallucination stress test (`silence`, `music`, `noise`) with statistical findings	`weighted_hallucination_rate` (lower is better) + finding validation status
`ab/sound-id`	Sound-event identification on labeled mixtures	`weighted_recall` (higher is better) + `weighted_fpr` (lower is better)

Signal suites (model = AudioProcessor adapter that takes (audio, sr) and returns processed audio):

Suite	Purpose	Headline metric
`ab/fidelity-roundtrip`	Audio fidelity under round-trip (procedural sweep, noise, impulses, low-level + near-clip tones; identity / band-limit / +3 dB conditions)	`weighted_si_sdr_db` (higher is better), `max_true_peak_dbtp` (lower is better)
`ab/psychoacoustic-masking`	Whether the processor preserves audible tones and leaves masked tones masked (tone-in-noise fixtures)	`masking_respect_score` (higher is better)
`ab/phase-coherence`	Stereo polarity, inter-channel correlation, M/S round-trip, sub-sample delay preservation	`phase_coherence_score`, `mean_polarity_score` (both higher is better)

Temporal task suites (frame-level scoring with IoU-matched events or Hungarian-aligned speakers):

Suite	Purpose	Headline metric
`ab/sed-urban`	Sound event detection on labeled urban-noise soundscapes (sirens, dog barks, alarms, engines, glass)	`event_f1_iou50` (higher is better), `segment_f1_1s` (higher is better)
`ab/diarization-cw`	Speaker diarization on procedural conversations (DER decomposed into miss / FA / confusion with 0.25 s collar)	`der` (lower is better), `mean_speaker_count_error` (lower is better)

Run each suite quickly:

audiobench run ab/asr-robust --model whisper-tiny
audiobench run ab/asr-hallucination --model whisper-tiny
audiobench run ab/sound-id --model heuristic-v0

audiobench run ab/fidelity-roundtrip      --model passthrough
audiobench run ab/psychoacoustic-masking  --model passthrough
audiobench run ab/phase-coherence         --model passthrough

audiobench run ab/sed-urban       --model oracle-sed
audiobench run ab/diarization-cw  --model oracle-diarization

The signal suites ship three reference adapters out of the box:

passthrough — identity (the upper bound; should pass every check).
passthrough-quantize8 — 8-bit quantizer (visibly degrades fidelity, useful as a regression demo).
polarity-flip-right — flips the right channel polarity (fails phase coherence on purpose).

Implement your own by writing an adapter that fulfils AudioProcessor (one process(audio, sample_rate) -> (audio_out, sample_rate_out) method) and either registering it in models/signal_registry.py or publishing it as an audiobench.signal_models entry point.

The temporal suites use task-specific adapter contracts:

SED — implement SEDAdapter with detect(audio, sample_rate) -> list[{"label", "start_s", "end_s"}] and register via models/sed_registry.py or audiobench.sed_models entry points. Bundled adapters: oracle-sed (sanity upper bound), oracle-sed-jittered (boundary-jittered regression demo), null-sed (worst case).
Diarization — implement DiarizationAdapter with diarize(audio, sample_rate) -> list[{"speaker_id", "start_s", "end_s"}] and register via models/diarization_registry.py or audiobench.diarization_models entry points. Bundled adapters: oracle-diarization, merged-diarization (all speakers collapsed — confusion regression), single-speaker (one big turn — FA + confusion).

Both oracle baselines accept a set_oracle_hint(ground_truth) call from the suite runner so they can answer the procedural manifest perfectly; production adapters simply ignore that hint and rely on the audio.

Findings and validation flow (`ab/asr-hallucination`)

Each run includes ranked findings with:

effect size (effect_size)
bootstrap confidence interval (ci_lower, ci_upper)
Benjamini-Hochberg corrected p-value (adjusted_p_value)
validation status (validated, candidate, rejected)

Useful flow:

audiobench run ab/asr-hallucination --model whisper-tiny --output results/hallucination-whisper.json
audiobench run ab/asr-hallucination --model my-asr-model --output results/hallucination-my-model.json
audiobench compare results/hallucination-whisper.json results/hallucination-my-model.json

For publishable-claim policy and reproducibility checklist, see docs/guides/repro-launch-flow.md.

Leaderboard publish flow

audiobench push uploads a run artifact to:

submissions/<suite-with-/-replaced-by-__>/<run_hash>.json

If --repo is omitted, push auto-targets:

<your-username>/audiobench-leaderboard-submissions

Example:

hf auth login
audiobench push results/hallucination-my-model.json --pretty-json

Optional flags:

--repo my-org/audiobench-leaderboard-submissions
--space <org-or-user>/<space-name>
--notes "..." --tags "cpu,baseline"
--dry-run to validate payload without upload

The Space app scaffold for a hosted leaderboard is in spaces/leaderboard/.

Models and adapters

List adapters:

audiobench list-models
audiobench list-models --suite ab/asr-hallucination

Built-in ab/sound-id adapters include heuristic-v0, heuristic-weak, clap-base, and qwen2-audio-7b. Install optional extras when needed:

pip install "audiobench[clap]"
pip install "audiobench[qwen]"

Bring-your-own-model guide: docs/guides/bring-your-own-model.md.

Methodology and docs

Quickstart: docs/quickstart.md
Suites overview: docs/suites/index.md
ab/asr-hallucination reference: docs/suites/asr-hallucination.md
Reproducibility guarantees: docs/reference/reproducibility.md
Repro launch policy for findings: docs/guides/repro-launch-flow.md
Leaderboard integration guide: docs/guides/hf-leaderboard.md

Extras for `ab/sound-id`

# Pack discovery and availability
audiobench list-packs
audiobench info ab/sound-id --pack demo

# Prompt controls
audiobench prompts show
audiobench prompts export results/my_prompts.yaml

# Mixture authoring
audiobench run ab/sound-id --mix "siren+dog_bark" --model heuristic-v0
audiobench run ab/sound-id --recipes examples/scenarios/factory_floor.yaml --model heuristic-v0
audiobench mix preview --recipes examples/scenarios/factory_floor.yaml --name factory_alarm --output preview.wav

# Per-mixture debugging
audiobench inspect results/run.json --mixture 1

CI integration

audiobench gate evaluates a run JSON against thresholds and exits non-zero when any check fails, so it can fail PRs on regressions. Thresholds can come from CLI flags or a YAML/JSON file with suite-aware sections.

# Inline thresholds (sound-id headline metrics)
audiobench gate results/sound-id.json --min-recall 0.6 --max-fpr 0.1

# Inline thresholds (ASR robust, including per-condition caps)
audiobench gate results/asr-robust.json \
    --max-wer 30 \
    --max-wer-condition clean=10 \
    --max-wer-condition noise-cafe-10db=40

# File-based thresholds
audiobench gate results/run.json --thresholds gate.yaml --json

Example gate.yaml:

asr_robust:
  max_weighted_mean_wer: 30.0
  max_wer:
    clean: 10.0
    noise-cafe-10db: 40.0
asr_hallucination:
  max_weighted_hallucination_rate: 0.10
  max_non_speech_hallucination_rate: 0.15
sound_id:
  min_weighted_recall: 0.60
  max_weighted_fpr: 0.10
  min_components_understood: 20
fidelity_roundtrip:
  min_weighted_si_sdr_db: 80.0
  max_true_peak_dbtp: 6.0
  max_mean_loudness_delta_lu: 1.0
psychoacoustic_masking:
  min_masking_respect_score: 0.99
  max_inaudible_energy_delta_db: 3.0
phase_coherence:
  min_phase_coherence_score: 0.99
  min_mean_polarity_score: 0.99
sed_urban:
  min_event_f1: 0.6
  min_segment_f1: 0.6
  min_event_recall: 0.5
diarization_cw:
  max_der: 0.25
  max_speaker_count_error: 0.5
  max_miss_rate: 0.15
  max_false_alarm_rate: 0.15

Inline shortcuts cover the most common signal and task checks too: --min-si-sdr, --max-true-peak, --min-masking-respect, --min-phase-coherence, --min-polarity, --min-event-f1, --min-segment-f1, --max-der, --max-speaker-count-error.

gate also accepts --junit out.xml to emit a JUnit-style report (one testcase per check), which GitHub Actions, GitLab, Jenkins, and other CI systems can parse natively.

The repo ships a reference workflow at .github/workflows/ci.yml that runs the test suite and then a smoke audiobench run-matrix with --gate and --junit against ab/sound-id on the bundled demo pack.

Orchestration: `run-matrix`

audiobench run-matrix runs many (suite, model) cells in one invocation, writes a per-cell run JSON, and emits an aggregated summary. The same matrix can be built and run from the Test Builder tab of audiobench --gui. With --gate each cell is also threshold-checked; with --junit the result becomes a single XML report consumable by CI dashboards. The command exits non-zero if any cell errors out or any gate check fails.

# Cartesian over repeated --suite / --model flags
audiobench run-matrix \
    --suite ab/sound-id \
    --model heuristic-v0 --model heuristic-weak \
    --profile demo-fast \
    --output-dir results/matrix \
    --gate gate.yaml \
    --junit results/matrix/junit.xml

# Or fully declarative via a matrix YAML
audiobench run-matrix --matrix matrix.yaml

Example matrix.yaml:

output_dir: results/matrix
seed: 1337
cells:
  - suite: ab/sound-id
    model: heuristic-v0
    profile: demo-fast
  - suite: ab/sound-id
    model: heuristic-weak
    profile: demo-fast
  - suite: ab/asr-robust
    model: tiny
    conditions: [clean]
    limit: 2
gate:
  sound_id:
    min_weighted_recall: 0.4
  asr_robust:
    max_weighted_mean_wer: 80.0

The aggregated summary lands at <output-dir>/summary.json with one entry per cell (run hash, headline metrics, gate result, duration) so downstream tooling can render scorecards without re-reading each run artifact.

Inspect: per-record forensics

audiobench inspect opens a single record from a run JSON. The flag depends on the suite:

# ab/asr-robust and ab/asr-hallucination: per clip
audiobench inspect results/asr-robust.json --clip 1
audiobench inspect results/asr-hallucination.json --clip 3

# ab/sound-id: per mixture
audiobench inspect results/sound-id.json --mixture 12

ASR clip view shows the reference, every condition's hypothesis, per-clip WER, latency, and flags (empty, hallucination, error). Sound-id mixture view shows ground-truth components, per-label yes/no probes, and (when ensembling) per-paraphrase breakdowns.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 20, 2026

0.1.1

May 7, 2026

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiobench-0.2.0.tar.gz (1.7 MB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audiobench-0.2.0-py3-none-any.whl (1.7 MB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file audiobench-0.2.0.tar.gz.

File metadata

Download URL: audiobench-0.2.0.tar.gz
Upload date: May 20, 2026
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for audiobench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0cc6c307b748cc8820d3b29c321a178d3c470ac6f27602a475786220bc7bd027`
MD5	`2d05fcc2b7f923025276a244c9f41afb`
BLAKE2b-256	`ae86e96376a0d07f5471d9722a6c212a354117b1b26d562c05ea8560368ec7f8`

See more details on using hashes here.

File details

Details for the file audiobench-0.2.0-py3-none-any.whl.

File metadata

Download URL: audiobench-0.2.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 1.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for audiobench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c9990f3042a33e22d9eeb4d6d2c6860af45a6d8029143bd1b4a4fbd11e25a69`
MD5	`7071520af34ddbb5a7793baeb0ed026c`
BLAKE2b-256	`fa0ff0fadb339f8ecc432648fc2d432da15a11d84c4597f240d16725168ae0a0`

See more details on using hashes here.

audiobench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

audiobench

Install

From PyPI

From source (development)

Local GUI (optional)

Core CLI commands

Shipped suites

Findings and validation flow (ab/asr-hallucination)

Leaderboard publish flow

Models and adapters

Methodology and docs

Extras for ab/sound-id

CI integration

Orchestration: run-matrix

Inspect: per-record forensics

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Findings and validation flow (`ab/asr-hallucination`)

Extras for `ab/sound-id`

Orchestration: `run-matrix`