Reproducible CLI benchmark for audio ML models.
Project description
audiobench MVP CLI
audiobench is a reproducible CLI benchmark for audio ML models. It demonstrates the core idea from audiobench.dev: a single clean-set metric hides failure modes, so the benchmark reports performance across realistic perturbations and mixtures.
Suites in this MVP
ab/asr-robust— speech recognition under noise, bandlimiting, and reverb. Default model: Whisper.ab/sound-id— sound-event identification on mixtures of labeled clips. Default model: a bundled heuristic baseline; real models via CLAP and Qwen2-Audio adapters.
Quickstart
python -m venv .venv
source .venv/bin/activate
pip install -e .
Troubleshooting: ModuleNotFoundError: No module named 'audiobench' on macOS
If audiobench --help raises ModuleNotFoundError: No module named 'audiobench' immediately after pip install -e ., this is a known macOS + pip + Python 3.13 site.py interaction (Python issue #127012 / pip issue #13153): pip-installed files inherit a com.apple.provenance xattr that carries the UF_HIDDEN flag, and Python 3.13's site.py skips .pth files with that flag, so the editable-install pointer never lands on sys.path. Clear the flag on the venv's site-packages:
chflags -R nohidden .venv/lib/python3.13/site-packages
ab/asr-robust (speech)
audiobench run ab/asr-robust --model whisper-tiny
Conditions: clean, noise-cafe-10db, noise-pink-5db, bandlimited-8k, reverb-medium. Reports per-condition WER and weighted mean.
ab/sound-id (sound events)
For each mixture of labeled clips, the model is asked once per candidate label using the bundled prompt set (canonical wording: "Do you hear a {label}?"). The exact wording, version, and any ensemble setting are pinned in the run hash — see the Prompt protocol section below.
The benchmark scores how many components of the mixture were correctly identified.
audiobench run ab/sound-id --model heuristic-v0
Conditions are mixture sizes:
solo— N=1 (sanity)pair— N=2triple— N=3quad— N=4
For every (pack, condition) row, the benchmark reports:
- recall — of the sounds actually in the mixture, what fraction did the model correctly say "yes" to? (1.0 = caught every component; lower = missed some.)
- precision — of the times the model said "yes", what fraction were actually present? (1.0 = no false alarms; lower = it claims to hear things that aren't there.)
- F1 — a single combined score that blends recall and precision; useful when you want one number.
- FPR (false-positive rate) — for sounds that are NOT in the mixture (distractors), how often does the model still say "yes"? (0.0 = never hallucinates; higher = it answers "yes" too eagerly.)
Headline number: components understood: X / Y — across every mixture, X is how many ground-truth components the model identified out of Y total. This is the number you'd quote in a tweet.
Demo: compare two models on ab/sound-id
audiobench run ab/sound-id --model heuristic-v0 --output results/sound-id-heuristic.json
audiobench run ab/sound-id --model heuristic-weak --output results/sound-id-weak.json
audiobench compare results/sound-id-heuristic.json results/sound-id-weak.json
compare dispatches on the suite id in each run JSON, so the same command works for ab/asr-robust (lower-WER-wins) and ab/sound-id (higher-recall-wins, lower-FPR-wins).
For live presentation, use the demo-fast profile (~30 mixtures, finishes in under 90s on a laptop):
audiobench run ab/sound-id --profile demo-fast --model heuristic-v0 --output results/demo-heuristic.json
audiobench run ab/sound-id --profile demo-fast --model heuristic-weak --output results/demo-weak.json
audiobench compare results/demo-heuristic.json results/demo-weak.json
Packs
Each ab/sound-id run targets one or more packs. Each pack defines a label set and source dataset(s).
| Pack | Source | Labels | License |
|---|---|---|---|
demo |
Procedural (bundled, no download) | siren, alarm, dog_bark, engine, glass_breaking, baby_cry, coughing, water, vacuum, speech | bundled |
core |
FSD50K (single-positive PP filter) | ~80 high-confidence classes from the AudioSet ontology | CC-BY 4.0 / CC0 (user-supplied) |
home |
DESED synthetic subset | alarm_bell, cat, dishes, frying, blender, water, speech, vacuum, dog, electric_shaver | open (user-supplied) |
cabin |
FSD50K + UrbanSound8K | engine, traffic, baby_cry, music, speech, car_horn, siren, drilling | non-commercial research |
security |
UrbanSound8K | gun_shot, siren, car_horn, dog_bark, jackhammer | non-commercial research |
health |
ESC-50 medical subset | coughing, sneezing, breathing, snoring, crying_baby | non-clinical scope |
The demo pack runs with no downloads and powers the headline demo. Other packs require user-supplied data at ~/.cache/audiobench/sound_id/<source>/; see the Bringing your own data section below.
audiobench list-packs
audiobench info ab/sound-id
audiobench info ab/sound-id --pack home
How users create mixtures
Three layers, additive.
Default — canned, seeded mixture set per pack. Zero authoring:
audiobench run ab/sound-id --pack demo --model heuristic-v0
Inline --mix — one mixture per flag, +-separated labels:
audiobench run ab/sound-id --mix "siren+glass_breaking+baby_cry" --model heuristic-v0
audiobench run ab/sound-id --mix "engine+baby_cry" --mix "engine+baby_cry+music" --model heuristic-v0
Recipe file (YAML or JSON) — repeatable scenarios with per-source dB levels and optional pinned source files:
mixtures:
- name: factory_alarm
labels: [siren, glass_breaking]
snr_db: 0
- name: cabin_baby_over_engine
label_levels:
engine: 0
baby_cry: -3
vacuum: -6
audiobench run ab/sound-id --recipes scenarios/factory_floor.yaml --model heuristic-v0
When --mix or --recipes is used, results land under a custom condition. The run hash includes the canonicalized mixture spec so any custom run is bit-reproducible.
Prompt protocol
Every ab/sound-id run uses a versioned prompt set that lives at src/audiobench/data/sound_id/prompts.yaml. The version, the parser version, the ensemble size, and a hash of the paraphrase list all feed into run_hash, so two runs with different prompt configurations cannot silently be confused.
Default behavior
By default the suite asks one prompt per probe — the canonical wording "Do you hear a {label}?". The run summary records prompt_version=yesno-v1, parser=v1, ensemble=off.
Inspect or export the bundled prompts
audiobench prompts show
audiobench prompts export results/my_prompts.yaml
prompts export writes a starter file you can edit; pass it back with --prompts.
Custom prompts (--prompts)
Edit the exported YAML, bump the version so old runs aren't confused with the new ones, then point the runner at the file:
version: my-clean-room-v1
parser_version: v1
paraphrases:
- "Do you hear a {label}?"
- "Listen carefully. Is a {label} present? Reply yes or no."
audiobench run ab/sound-id --prompts results/my_prompts.yaml --model heuristic-v0
Schema:
version(required) — opaque label folded intorun_hash. Any change in wording should bump it.parser_version(optional, defaultv1) — pinned to the yes/no parser inaudiobench.probes. Leave it atv1unless the parser also changes.paraphrases(required, ≥ 1) — every entry must contain the literal placeholder{label}.
Prompt ensembles (--prompt-ensemble N)
Reduce wording sensitivity by asking N paraphrases per probe and taking a majority vote. The vote is recorded along with each individual paraphrase answer in the run JSON.
audiobench run ab/sound-id --model qwen2-audio-7b --prompt-ensemble 5
N must be ≤ the number of paraphrases in the prompts file (5 in the bundled set). The first N paraphrases are used in order.
Comparison guard rails
audiobench compare refuses to compare two ab/sound-id runs whose prompt_version, parser_version, or prompt_ensemble disagree:
$ audiobench compare results/run-bundled.json results/run-ensemble.json
Invalid value: runs disagree on prompt_ensemble: A=None vs B=3.
Re-run with matching prompts, or pass --allow-mismatched-prompt.
Pass --allow-mismatched-prompt to override (the comparison header annotates the mismatch).
Mixture preview
Render a mixture WAV without running probes — useful for demo prep, debugging levels, and authoring recipes:
audiobench mix preview --labels siren,glass_breaking,baby_cry --output preview.wav
audiobench mix preview --recipes scenarios/factory_floor.yaml --name cabin_baby_over_engine --output cabin.wav
Per-mixture forensic view
audiobench inspect results/sound-id-heuristic.json --mixture 12
mixture 12 (pack=demo, condition=triple)
ground truth: siren, glass_breaking, dog_bark
source clips:
siren demo://siren@0
glass_breaking demo://glass_breaking@0
dog_bark demo://dog_bark@0
model: heuristic-v0
prompts: version=yesno-v1, parser=v1, ensemble=off (single prompt), source=bundled
yes responses:
siren ✓
dog_bark ✓
glass_breaking ✗ FALSE NEGATIVE
chainsaw ✗ FALSE POSITIVE (distractor)
car_horn ✗ (distractor, correct)
recall : 2/3 = 0.67
precision : 2/3 = 0.67
components understood: 2 of 3
When the run was made with --prompt-ensemble N, inspect also prints a per-paraphrase breakdown showing each rendered prompt and the model's individual yes/no for it.
Models
ab/sound-id ships four model adapters:
heuristic-v0(bundled, CPU) — the strong bundled baseline. See How the bundled heuristics work below.heuristic-weak(bundled, CPU) — a deliberately weaker variant of the same algorithm soaudiobench comparehas something to show out of the box. See How the bundled heuristics work below.clap-base— LAION-CLAP zero-shot. Requirespip install laion-clap(lazy import). First run downloads weights.qwen2-audio-7b— Qwen2-Audio-Instruct via HuggingFacetransformers. Requires GPU (~16 GB VRAM) locally, or setAUDIOBENCH_QWEN_ENDPOINT=https://...to point at a remote inference endpoint. See docs/models/qwen2-audio.md for the endpoint contract, a deployable Modal recipe, a free Google Colab + Cloudflared alternative, and Apple Silicon notes.
Add your own model adapter in src/audiobench/models/ and register it in src/audiobench/models/registry.py.
How the bundled heuristics work
The bundled heuristics aren't ML models — they're a deterministic spectral matcher. They exist so audiobench runs end-to-end on a fresh laptop with no GPU, no weight downloads, and no network. The algorithm:
- Fingerprint the input audio. Pad to the next power of two, take an FFT, and bin power into 24 log-spaced frequency bands from 50 Hz to 7.5 kHz. Apply
log1pto compress dynamic range and L2-normalize. The result is a 24-D unit vector that captures the audio's spectral shape. - Pre-compute one fingerprint per known label by running the same recipe on the canonical procedural clip for each of the demo pack's 10 labels (
siren,engine,dog_bark, …). These reference fingerprints are built once at import time. - Score the probe. For a question "Do you hear a {label}?", compute the cosine similarity between the input fingerprint and the reference for
{label}(target_score), and the mean cosine similarity to every other known label (baseline). The decision metric is the discriminative marginmargin = target_score − baseline. Using a margin (rather than the raw similarity) keeps false positives down: in a quad mixture every reference still has decent absolute similarity, but only the components that are actually present beat the rest by a clear margin. - Threshold the margin. Answer "yes" if
margin >= margin_threshold, else "no".
The two adapters differ only in two parameters:
margin_threshold |
noise_amplitude |
|
|---|---|---|
heuristic-v0 |
0.20 |
0.0 |
heuristic-weak |
0.30 |
0.10 |
- A higher
margin_thresholdmakes the model more conservative: it answers "yes" only when the target label's match clearly stands out from every other reference.heuristic-weak's0.30is well above the typical margin for a true positive in a quad mixture, so it misses many components there — that's the recall hit you'll see incompare. - A non-zero
noise_amplitudeadds a small per-decision jitter (±0.10here, deterministically derived from a SHA-1 of the label, margin, and audio length, so runs are still reproducible). This letsheuristic-weakflip near-threshold decisions, simulating a noisy classifier without breaking the run hash.
Because both heuristics are pure functions of the audio fingerprint and the prompt, they're CPU-only, deterministic, and finish a full --profile demo-fast run in well under a second. They're not meant to be competitive with CLAP or Qwen2-Audio — they're meant to make the harness honestly demonstrable on its own.
Bringing your own data
The demo pack runs out of the box. The other packs reference real datasets that you supply:
~/.cache/audiobench/sound_id/
fsd50k/
FSD50K.dev_audio/...
urbansound8k/
audio/fold1/...
desed/
synthetic21_train/soundscapes/...
esc50/
audio/...
audiobench info ab/sound-id --pack <name> prints the expected layout. If files are missing, the suite skips that pack with a clear message rather than failing.
Other CLI commands
# Pre-download the Whisper checkpoint so the next `run` doesn't pay for it.
audiobench warmup --model whisper-tiny
# List every suite this build knows about (stable + in-design).
audiobench list
# Print metadata for a suite: clip count, conditions, expected layout.
audiobench info ab/asr-robust
# Run only two ASR conditions and print the run JSON instead of a table.
audiobench run ab/asr-robust --model whisper-tiny --conditions clean,bandlimited-8k --pretty-json
# Print the bundled prompts.yaml (version, parser_version, paraphrase list).
audiobench prompts show
# Copy the bundled prompts.yaml to a path you can edit, then pass it via `--prompts`.
audiobench prompts export results/my_prompts.yaml
# Compare two ab/sound-id runs even if their prompt_version / ensemble settings differ.
# By default `compare` refuses mismatched prompts to keep numbers honest.
audiobench compare results/a.json results/b.json --allow-mismatched-prompt
# Local-only "push" stub: prints a signed payload (suite, revision, run_hash,
# payload_sha256). No network traffic in MVP mode.
audiobench push results/sound-id-heuristic.json --pretty-json
Reproducibility guarantees
- Manifest, mixture, and probe seeds are fixed.
- The mixer is deterministic (peak-normalize, RMS-match, sum).
- The prompt set is versioned and pinned in
run_hash(prompt_version,parser_version,prompt_ensemble, plus a SHA-256 over the canonicalized paraphrase list). - Every run writes a JSON artifact with:
- suite, revision, model, seed, config
- per-clip / per-mixture hypotheses (with per-paraphrase answers when ensembling)
- per-condition metrics and weighted mean
run_hash(SHA-256 over canonicalized run payload, including mixture spec and prompt config)
audiobench pushis suite-agnostic and works for both suites; it only readssuite,revision,run_hash.
Extend this MVP
- Add a model adapter in
src/audiobench/models/and register it inmodels/registry.py. - Add a perturbation in
src/audiobench/perturbations.pyforab/asr-robust. - Add a pack in
src/audiobench/data/sound_id/packs/forab/sound-id. - Author a custom prompt set:
audiobench prompts export my_prompts.yaml, edit, then run with--prompts my_prompts.yaml.
Scope limits
ab/sound-idships with a proceduraldemopack so it runs end-to-end without network. Real-data packs (core,home,cabin,security,health) require user-supplied datasets.- CLAP and Qwen2-Audio adapters lazy-import heavy deps and are documented but not bundled.
pushis local-only and does not send network traffic.- CPU-first workflow; the bundled heuristic and CLAP models are the CPU-friendly paths.
- English-only data in
ab/asr-robust.
Building and deploying the docs
The full docs live at docs/ and render as a Material-for-MkDocs site. The [docs] extra pulls in mkdocs, mkdocs-material, and pymdown-extensions.
Run locally with hot reload:
pip install -e ".[docs]"
mkdocs serve
# open http://127.0.0.1:8000
mkdocs serve watches docs/ and mkdocs.yml; saving any file rebuilds and refreshes the browser tab.
Build a static site (output in site/, gitignored):
mkdocs build --strict
--strict turns warnings into errors. Useful in CI to catch broken internal links before they ship.
Deploy to GitHub Pages:
mkdocs gh-deploy --force
This builds the site and pushes it to a gh-pages branch. Then in the GitHub repo settings, under Pages → Source, point at the gh-pages branch. The site lives at https://<user>.github.io/audiobench/ (configured in mkdocs.yml as site_url). For a CI-driven deploy, the official GitHub Action does the same thing on every push to main.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiobench-0.1.1.tar.gz.
File metadata
- Download URL: audiobench-0.1.1.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b46a142bcbe2170435b1e93e3f1c90b47ad2a52f827cac32453375b7081935e0
|
|
| MD5 |
0345bcf1c49192ea74d62064c69eb7b3
|
|
| BLAKE2b-256 |
15525923505806c2c7e085b1a835a1fbbcc27bc1d29e2692aba48b5189464c73
|
File details
Details for the file audiobench-0.1.1-py3-none-any.whl.
File metadata
- Download URL: audiobench-0.1.1-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20170b5b54e31e05ab4b031c30b4be5276c3eea77b515fe6f5099ca9f51ac914
|
|
| MD5 |
55d7e27f1829af4f33e29d9812b2fa30
|
|
| BLAKE2b-256 |
7bb986220b75a919eb33d484c7338e5bc29ef3630c87d0d8648cfb7adc04908c
|