Benchmark suite for evaluating pitch and acoustic perception in audio language models
Project description
PitchBench -- Python Package
Benchmark suite for evaluating pitch and acoustic perception in audio language models (ALMs). Probes pitch identification, temporal localisation, chord recognition, melodic contour, robustness to audio effects, and more — reporting per-format accuracy (MIDI, SPN, doremi, Hz) to expose where verbal decoding fails.
Setup
1. Install FluidSynth
| Platform | Command |
|---|---|
| Linux / WSL | sudo apt install fluidsynth |
| macOS | brew install fluid-synth |
| Windows | choco install fluidsynth or download from fluidsynth.org |
A GM soundfont is also required. Place any .sf2 or .sf3 in data/soundfonts/ — no further configuration needed.
Linux / WSL:
sudo apt install fluid-soundfont-gm
mkdir -p data/soundfonts
cp /usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/
macOS / cross-platform:
curl -L "http://ftp.debian.org/debian/pool/main/f/fluid-soundfont/fluid-soundfont-gm_3.1-5.3_all.deb" \
-o /tmp/fluid.deb
cd /tmp && ar x fluid.deb && tar xf data.tar.* ./usr/share/sounds/sf2/FluidR3_GM.sf2
mkdir -p data/soundfonts && cp /tmp/usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/
SF2 resolution order: PITCHBENCH_SF2 env var → data/soundfonts/ → /usr/share/sounds/sf2/FluidR3_GM.sf2.
2. Install PitchBench
python3 -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate # Windows
pip install pitchbench
For development tools:
uv sync --extra dev # installs pytest + pytest-mock
Requires Python ≥ 3.12.
3. Configure a model backend
OpenRouter — add to .env:
OPENROUTER_KEY=sk-or-...
Use --model openrouter/<provider>/<model>. "Latest" alias slugs require a ~ prefix, e.g. openrouter/~google/gemini-flash-latest.
DashScope — add to .env:
DASHSCOPE_API_KEY=sk-...
Use --model dashscope/<model>.
Local server — included in the base pip install pitchbench dependencies:
--model http://localhost:8001 --name my-local-model
Usage
PitchBench separates stimulus generation from model evaluation. Generate once, evaluate any number of models.
Generate
pitchbench generate all
pitchbench generate a # single category
pitchbench generate a1 # single experiment
Writes WAV files and a Parquet dataset to data/generated/<exp>/. Re-running is safe: existing rows are skipped.
data/generated/pitchbench_a1_single_pitch_id/
*.wav # one file per condition
_questions.parquet # HuggingFace-compatible schema
_questions.csv # human-readable companion
Evaluate
pitchbench --list
pitchbench evaluate all --model openrouter/<provider>/<model>
pitchbench evaluate a1 --model openrouter/<provider>/<model>
# Quick test (20 stimuli, stratified)
pitchbench evaluate a1 --model openrouter/<provider>/<model> --sample-n 20 --sample-seed 0
Results land in results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/.
In tmux (recommended for long runs):
tmux new-session -d -s mymodel "cd /home/hsnu2/PitchBench && source .venv/bin/activate && \
pitchbench evaluate all --model openrouter/<provider>/<model> 2>&1 | tee logs/eval_mymodel_\$(date +%Y%m%d_%H%M%S).log"
Analyze
pitchbench analyze --preset q1 --model openrouter/<provider>/<model>
Writes to results/analysis/<model_slug>/<run>/ and generates a summary of the ablations.
Batch (multiple models):
python -m pitchbench.analysis.run_analysis q1 \
--models openrouter/<provider>/<model1> \
openrouter/<provider>/<model2> \
dashscope/<model3>
Output
results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/
pitchbench_<exp>/
results_<model>.json # metadata, git commit, per-item responses
results_<model>.csv # one row per stimulus
accuracies_<model>.csv # aggregate metrics
plots/ # per-format CI plots
results/evaluation/summary/ is populated automatically after each run (aggregated_accuracies.csv, accuracy plots by instrument / notation / note).
Experiment categories
28 experiments across 7 categories.
Category A — Single-pitch identification
| ID | Script | What it tests |
|---|---|---|
| a1 | single_pitch_id |
Baseline: identify one sustained note across all sources and formats |
| a2 | single_pitch_by_loudness |
Pitch accuracy under different loudness levels |
| a3 | single_pitch_by_duration |
Pitch accuracy under different durations |
Category B — Time-localised pitch
| ID | Script | What it tests |
|---|---|---|
| b1 | single_pitch_within_silence |
Hidden note in a silent clip |
| b2 | pitch_at_timestamp |
Which pitch is playing at a queried timestamp in a multi-note sequence |
| b3 | timestamp_single_pitch |
Detect when a single note starts/ends |
| b4 | timestamp_specific_pitch |
Onset/offset of a named target pitch among distractors |
| b5 | timestamp_multiple_pitches |
Full timing transcription of all notes in a sequence |
Category C — Chords and simultaneous pitches
| ID | Script | What it tests |
|---|---|---|
| c1 | chord_count_pitches |
Count distinct pitches in a chord (dyads, triads, 7ths) |
| c2 | chord_dyad_interval |
Name the interval (semitone count) between two simultaneous tones |
| c3 | chord_quality |
Name chord quality and/or root+quality from a sounding chord |
| c4 | chord_pitches |
List every note in a chord |
Category D — Sequences, contour, intervals
| ID | Script | What it tests |
|---|---|---|
| d1 | sequence_count_pitches |
Count distinct pitches in a sequential passage |
| d2 | dyad_lower_higher_difference |
Binary higher/lower judgment |
| d3 | contour_discrete |
Output up/down tokens for each step-wise transition |
| d4 | contour_continuous |
Output up/down tokens for each monotonic movement |
| d5 | sequence_ranking_by_pitch |
Rank sequential tones from lowest to highest |
| d6 | sequence_dyad_interval |
Name the melodic interval between two sequential notes |
| d7a | pitch_with_reference |
Pitch identification given a labelled reference tone (variant used in the evaluation) |
| d7b | pitch_with_reference_split |
Same, but reference and target in separate audio clips |
| d7c | pitch_with_reference |
Variant with different set of pitches |
| d7d | pitch_with_reference_split |
Split-clip variant with a different set of pitches |
| d8 | sequence_pitches |
Transcribe all pitches in a note sequence in order |
Category E — Robustness
| ID | Script | What it tests |
|---|---|---|
| e1 | audio_effects |
Pitch under effects like high-pass/low-pass filtering, distortion, reverb, chorus |
| e2 | background |
Pitch over real-world backgrounds (crowd, rain, bells, street) at varying loudness levels |
| e3 | harmonic_saturation |
Pitch with harmonic saturation / overdrive |
| e4 | time_stretching |
Pitch with time-stretching (speed change without pitch shift) |
| e5 | vibrato |
Pitch with vibrato at varying rates and depths |
| e6 | slightly_off |
Detuned tones (up to 45 % of a semitone) |
Category F — Polyphony
| ID | Script | What it tests |
|---|---|---|
| f1 | melodic_line_atonal |
Transcribe one designated voice from 2–3 simultaneous synthetic voices (atonal) |
| f2 | melodic_line_tonal |
Same task over tonal melodic material from Bach chorales |
Category Y — Format variants
| ID | Script | What it tests |
|---|---|---|
| y1 | single_pitch_id_mcq |
MCQ variant of a1: pick the correct pitch from labelled options |
Stimulus engine
- Waveforms (always available):
sine,sawtooth,square,triangle - GM instruments (FluidSynth):
piano,electric_keyboard,guitar,flute,trumpet,trombone,clarinet,oboe,violin,cello,organ,bass,synth_lead,synth_pad,voice - Backgrounds (e2):
white_noisesynthesised; real recordings indata/preloaded/background/<name>.mp3
Repository layout
src/pitchbench/
config.py # runtime constants and env wiring
configs/
benchmark_config.py # paper-locked data-gen parameters
analysis_config.py # analysis-mode presets/overrides
plot_config.py # labels/colors for figures
user_config.py # user-overridable settings
sound/
engine.py # central audio synthesis engine
model/
query.py # ALM query facade (local/OpenRouter/DashScope)
dispatcher.py # provider-aware request dispatch
cost.py # API usage/cost tracking
experiments/
run.py # pitchbench CLI
scripts/ # one .py per experiment (32 total)
helpers/
cat_a.py ... cat_f.py # per-category generate + evaluate helpers
audit.py # condition/result audit utilities
data.py # Parquet I/O
music.py # pitch/notation conversion helpers
plots.py # experiment plotting helpers
results.py # result writers, aggregators, summary CSVs
sampling.py # stratified sampling
setup.py # experiment setup helpers
timing_layout.py # timing-grid utilities
analysis/
analyze_a1.py # a1 line plots + heatmaps
analyze.py # core analysis pipeline
a1.py # alternate A1 analysis entrypoint
ablation.py # ablation summaries
combine.py # combine multi-run CSV outputs
overview.py # overview plots/tables
run_analysis.py # batch analysis CLI
data/
preloaded/ # background recordings (gitignored)
generated/ # created by `pitchbench generate`
results/ # created by evaluate/analyze runs
Set PITCHBENCH_ROOT to override the project root for data/ and results/.
Environment variables
| Variable | Default | Purpose |
|---|---|---|
PITCHBENCH_ROOT |
. |
Project root (data/, results/) |
PITCHBENCH_SF2 |
auto-discovered | GM soundfont path override |
PITCHBENCH_LOCAL_URL |
http://localhost:8001 |
Default local model server |
OPENROUTER_KEY |
— | OpenRouter API key |
PITCHBENCH_CONCURRENCY_OPENROUTER |
20 |
Max parallel OpenRouter calls |
PITCHBENCH_CONCURRENCY_DASHSCOPE |
10 |
Max parallel DashScope calls |
LOCAL_CONCURRENCY |
2 |
Max parallel local server calls |
Tests
uv run pytest
77 tests covering CLI resolution, Parquet I/O, result aggregation, sampling, and API routing. No network calls or audio generation required.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pitchbench-0.1.5.tar.gz.
File metadata
- Download URL: pitchbench-0.1.5.tar.gz
- Upload date:
- Size: 179.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfe048a0c84621c1cb94a5722f1a399188d016136a916c1504bcf8f0e4ff1bba
|
|
| MD5 |
981d8bdb40f4a7c7da0dc2d81c8493a1
|
|
| BLAKE2b-256 |
da90e88133b686fd6ae63eaf79b02e625da543f28418f90f9524d4ba0550fe2d
|
File details
Details for the file pitchbench-0.1.5-py3-none-any.whl.
File metadata
- Download URL: pitchbench-0.1.5-py3-none-any.whl
- Upload date:
- Size: 235.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2f6237e6451b6e6b033587898ec51f9ea44dcb8384f672469f0de8609605914
|
|
| MD5 |
96e0d6a65d9bc35ddaf1b0bd079a0b94
|
|
| BLAKE2b-256 |
8a9f41909adf7a83c5577e47f0df489b716e9f161d7287a2d7a8cbdf954021f4
|