Skip to main content

Benchmark suite for evaluating pitch and acoustic perception in audio language models

Project description

PitchBench -- Python Package

Benchmark suite for evaluating pitch and acoustic perception in audio language models (ALMs). Probes pitch identification, temporal localisation, chord recognition, melodic contour, robustness to audio effects, and more — reporting per-format accuracy (MIDI, SPN, doremi, Hz) to expose where verbal decoding fails.


Setup

1. Install FluidSynth

Platform Command
Linux / WSL sudo apt install fluidsynth
macOS brew install fluid-synth
Windows choco install fluidsynth or download from fluidsynth.org

A GM soundfont is also required. Place any .sf2 or .sf3 in data/soundfonts/ — no further configuration needed.

Linux / WSL:

sudo apt install fluid-soundfont-gm
mkdir -p data/soundfonts
cp /usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

macOS / cross-platform:

curl -L "http://ftp.debian.org/debian/pool/main/f/fluid-soundfont/fluid-soundfont-gm_3.1-5.3_all.deb" \
     -o /tmp/fluid.deb
cd /tmp && ar x fluid.deb && tar xf data.tar.* ./usr/share/sounds/sf2/FluidR3_GM.sf2
mkdir -p data/soundfonts && cp /tmp/usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

SF2 resolution order: PITCHBENCH_SF2 env var → data/soundfonts//usr/share/sounds/sf2/FluidR3_GM.sf2.

2. Install PitchBench

python3 -m venv .venv
source .venv/bin/activate        # Linux / macOS
# .venv\Scripts\activate         # Windows
pip install pitchbench

For development tools:

uv sync --extra dev              # installs pytest + pytest-mock

Requires Python ≥ 3.12.

3. Configure a model backend

OpenRouter — add to .env:

OPENROUTER_KEY=sk-or-...

Use --model openrouter/<provider>/<model>. "Latest" alias slugs require a ~ prefix, e.g. openrouter/~google/gemini-flash-latest.

DashScope — add to .env:

DASHSCOPE_API_KEY=sk-...

Use --model dashscope/<model>.

Local server — included in the base pip install pitchbench dependencies:

--model http://localhost:8001 --name my-local-model

Usage

PitchBench separates stimulus generation from model evaluation. Generate once, evaluate any number of models.

Generate

pitchbench generate all
pitchbench generate a          # single category
pitchbench generate a1         # single experiment

Writes WAV files and a Parquet dataset to data/generated/<exp>/. Re-running is safe: existing rows are skipped. The default set of experiment parameters generates more than 100,000 audio files, which can take several hours. The dataset can be easily extended by allowing greater coverage by the experiment parameters through a custom experiment config. Run:

pitchbench generate all --source your_config.py

An example config can be found on the original Github repository of this package called PitchBench.

data/generated/pitchbench_a1_single_pitch_id/
  *.wav               # one file per condition
  _questions.parquet  # HuggingFace-compatible schema
  _questions.csv      # human-readable companion

Evaluate

pitchbench --list   
pitchbench evaluate all --model openrouter/<provider>/<model>
pitchbench evaluate a1  --model openrouter/<provider>/<model>

# Quick test (20 stimuli, stratified)
pitchbench evaluate a1  --model openrouter/<provider>/<model> --sample-n 20 --sample-seed 0

Results land in results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/.

In tmux (recommended for long runs):

tmux new-session -d -s mymodel "cd /home/hsnu2/PitchBench && source .venv/bin/activate && \
  pitchbench evaluate all --model openrouter/<provider>/<model> 2>&1 | tee logs/eval_mymodel_\$(date +%Y%m%d_%H%M%S).log"

Analyze

pitchbench analyze --preset q1 --model openrouter/<provider>/<model>

Writes to results/analysis/<model_slug>/<run>/ and generates a summary of the ablations.

Batch (multiple models):

python -m pitchbench.analysis.run_analysis q1 \
    --models openrouter/<provider>/<model1> \
             openrouter/<provider>/<model2> \
             dashscope/<model3>

Output

results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/
  pitchbench_<exp>/
    results_<model>.json      # metadata, git commit, per-item responses
    results_<model>.csv       # one row per stimulus
    accuracies_<model>.csv    # aggregate metrics
    plots/                    # per-format CI plots

results/evaluation/summary/ is populated automatically after each run (aggregated_accuracies.csv, accuracy plots by instrument / notation / note).


Experiment categories

28 experiments across 7 categories.

Category A — Single-pitch identification

ID Script What it tests
a1 single_pitch_id Baseline: identify one sustained note across all sources and formats
a2 single_pitch_by_loudness Pitch accuracy under different loudness levels
a3 single_pitch_by_duration Pitch accuracy under different durations

Category B — Time-localised pitch

ID Script What it tests
b1 single_pitch_within_silence Hidden note in a silent clip
b2 pitch_at_timestamp Which pitch is playing at a queried timestamp in a multi-note sequence
b3 timestamp_single_pitch Detect when a single note starts/ends
b4 timestamp_specific_pitch Onset/offset of a named target pitch among distractors
b5 timestamp_multiple_pitches Full timing transcription of all notes in a sequence

Category C — Chords and simultaneous pitches

ID Script What it tests
c1 chord_count_pitches Count distinct pitches in a chord (dyads, triads, 7ths)
c2 chord_dyad_interval Name the interval (semitone count) between two simultaneous tones
c3 chord_quality Name chord quality and/or root+quality from a sounding chord
c4 chord_pitches List every note in a chord

Category D — Sequences, contour, intervals

ID Script What it tests
d1 sequence_count_pitches Count distinct pitches in a sequential passage
d2 dyad_lower_higher_difference Binary higher/lower judgment
d3 contour_discrete Output up/down tokens for each step-wise transition
d4 contour_continuous Output up/down tokens for each monotonic movement
d5 sequence_ranking_by_pitch Rank sequential tones from lowest to highest
d6 sequence_dyad_interval Name the melodic interval between two sequential notes
d7a pitch_with_reference Pitch identification given a labelled reference tone (variant used in the evaluation)
d7b pitch_with_reference_split Same, but reference and target in separate audio clips
d7c pitch_with_reference Variant with different set of pitches
d7d pitch_with_reference_split Split-clip variant with a different set of pitches
d8 sequence_pitches Transcribe all pitches in a note sequence in order

Category E — Robustness

ID Script What it tests
e1 audio_effects Pitch under effects like high-pass/low-pass filtering, distortion, reverb, chorus
e2 background Pitch over real-world backgrounds (crowd, rain, bells, street) at varying loudness levels
e3 harmonic_saturation Pitch with harmonic saturation / overdrive
e4 time_stretching Pitch with time-stretching (speed change without pitch shift)
e5 vibrato Pitch with vibrato at varying rates and depths
e6 slightly_off Detuned tones (up to 45 % of a semitone)

Category F — Polyphony

ID Script What it tests
f1 melodic_line_atonal Transcribe one designated voice from 2–3 simultaneous synthetic voices (atonal)
f2 melodic_line_tonal Same task over tonal melodic material from Bach chorales

Category Y — Format variants

ID Script What it tests
y1 single_pitch_id_mcq MCQ variant of a1: pick the correct pitch from labelled options

Stimulus engine

  • Waveforms (always available): sine, sawtooth, square, triangle
  • GM instruments (FluidSynth): piano, electric_keyboard, guitar, flute, trumpet, trombone, clarinet, oboe, violin, cello, organ, bass, synth_lead, synth_pad, voice
  • Backgrounds (e2): white_noise synthesised; real recordings in data/preloaded/background/<name>.mp3

Repository layout

src/pitchbench/
  config.py                    # runtime constants and env wiring
  configs/
    benchmark_config.py        # paper-locked data-gen parameters
    analysis_config.py         # analysis-mode presets/overrides
    plot_config.py             # labels/colors for figures
    user_config.py             # user-overridable settings
  sound/
    engine.py                  # central audio synthesis engine
  model/
    query.py                   # ALM query facade (local/OpenRouter/DashScope)
    dispatcher.py              # provider-aware request dispatch
    cost.py                    # API usage/cost tracking
  experiments/
    run.py                     # pitchbench CLI
    scripts/                   # one .py per experiment (32 total)
    helpers/
      cat_a.py ... cat_f.py    # per-category generate + evaluate helpers
      audit.py                 # condition/result audit utilities
      data.py                  # Parquet I/O
      music.py                 # pitch/notation conversion helpers
      plots.py                 # experiment plotting helpers
      results.py               # result writers, aggregators, summary CSVs
      sampling.py              # stratified sampling
      setup.py                 # experiment setup helpers
      timing_layout.py         # timing-grid utilities
  analysis/
    analyze_a1.py              # a1 line plots + heatmaps
    analyze.py                 # core analysis pipeline
    a1.py                      # alternate A1 analysis entrypoint
    ablation.py                # ablation summaries
    combine.py                 # combine multi-run CSV outputs
    overview.py                # overview plots/tables
    run_analysis.py            # batch analysis CLI
data/
  preloaded/                   # background recordings (gitignored)
  generated/                   # created by `pitchbench generate`
results/                       # created by evaluate/analyze runs

Set PITCHBENCH_ROOT to override the project root for data/ and results/.


Environment variables

Variable Default Purpose
PITCHBENCH_ROOT . Project root (data/, results/)
PITCHBENCH_SF2 auto-discovered GM soundfont path override
PITCHBENCH_LOCAL_URL http://localhost:8001 Default local model server
OPENROUTER_KEY OpenRouter API key
PITCHBENCH_CONCURRENCY_OPENROUTER 20 Max parallel OpenRouter calls
PITCHBENCH_CONCURRENCY_DASHSCOPE 10 Max parallel DashScope calls
LOCAL_CONCURRENCY 2 Max parallel local server calls

Tests

uv run pytest

77 tests covering CLI resolution, Parquet I/O, result aggregation, sampling, and API routing. No network calls or audio generation required.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pitchbench-0.1.6.tar.gz (179.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pitchbench-0.1.6-py3-none-any.whl (235.8 kB view details)

Uploaded Python 3

File details

Details for the file pitchbench-0.1.6.tar.gz.

File metadata

  • Download URL: pitchbench-0.1.6.tar.gz
  • Upload date:
  • Size: 179.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.6.tar.gz
Algorithm Hash digest
SHA256 749c1ac2e0b23d95f5cf27862386bd22b3ec86b7c17ec98eb3e635e2bc5caf6a
MD5 c71d8342f651d0b28749d4a89de176e6
BLAKE2b-256 97fa5ba27b999f820af2b2f0f3f1977660a368f7e3dea6fb2469cba84f80b650

See more details on using hashes here.

File details

Details for the file pitchbench-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: pitchbench-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 235.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 e64bf111d7c739fd748e5915a870a8293abeeb1541dfd1f58c3e0e06e662b494
MD5 aef5ceda064fb729a6e0b947057968f9
BLAKE2b-256 f9bc3bf2b8f1d9ee11969a17e0d4084c4d7838b3cec1b42cd7298e0507cc6cdc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page