Skip to main content

Benchmark suite for evaluating pitch and acoustic perception in audio language models

Project description

PitchBench -- Python Package

Benchmark suite for evaluating pitch and acoustic perception in audio language models (ALMs). Probes pitch identification, temporal localisation, chord recognition, melodic contour, robustness to audio effects, and more — reporting per-format accuracy (MIDI, SPN, doremi, Hz) to expose where verbal decoding fails.


Setup

1. Install FluidSynth

Platform Command
Linux / WSL sudo apt install fluidsynth
macOS brew install fluid-synth
Windows choco install fluidsynth or download from fluidsynth.org

A GM soundfont is also required. Place any .sf2 or .sf3 in data/soundfonts/ — no further configuration needed.

Linux / WSL:

sudo apt install fluid-soundfont-gm
mkdir -p data/soundfonts
cp /usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

macOS / cross-platform:

curl -L "http://ftp.debian.org/debian/pool/main/f/fluid-soundfont/fluid-soundfont-gm_3.1-5.3_all.deb" \
     -o /tmp/fluid.deb
cd /tmp && ar x fluid.deb && tar xf data.tar.* ./usr/share/sounds/sf2/FluidR3_GM.sf2
mkdir -p data/soundfonts && cp /tmp/usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

SF2 resolution order: PITCHBENCH_SF2 env var → data/soundfonts//usr/share/sounds/sf2/FluidR3_GM.sf2.

2. Install PitchBench

python3 -m venv .venv
source .venv/bin/activate        # Linux / macOS
# .venv\Scripts\activate         # Windows
pip install pitchbench

For development tools:

uv sync --extra dev              # installs pytest + pytest-mock

Requires Python ≥ 3.12.

3. Configure a model backend

OpenRouter — add to .env:

OPENROUTER_KEY=sk-or-...

Use --model openrouter/<provider>/<model>. "Latest" alias slugs require a ~ prefix, e.g. openrouter/~google/gemini-flash-latest.

DashScope — add to .env:

DASHSCOPE_API_KEY=sk-...

Use --model dashscope/<model>.

Local server — included in the base pip install pitchbench dependencies:

--model http://localhost:8001 --name my-local-model

Usage

PitchBench separates stimulus generation from model evaluation. Generate once, evaluate any number of models.

Generate

pitchbench generate all
pitchbench generate a          # single category
pitchbench generate a1         # single experiment

Writes WAV files and a Parquet dataset to data/generated/<exp>/. Re-running is safe: existing rows are skipped.

data/generated/pitchbench_a1_single_pitch_id/
  *.wav               # one file per condition
  _questions.parquet  # HuggingFace-compatible schema
  _questions.csv      # human-readable companion

Evaluate

pitchbench --list   
pitchbench evaluate all --model openrouter/<provider>/<model>
pitchbench evaluate a1  --model openrouter/<provider>/<model>

# Quick test (20 stimuli, stratified)
pitchbench evaluate a1  --model openrouter/<provider>/<model> --sample-n 20 --sample-seed 0

Results land in results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/.

In tmux (recommended for long runs):

tmux new-session -d -s mymodel "cd /home/hsnu2/PitchBench && source .venv/bin/activate && \
  pitchbench evaluate all --model openrouter/<provider>/<model> 2>&1 | tee logs/eval_mymodel_\$(date +%Y%m%d_%H%M%S).log"

Analyze

pitchbench analyze --preset q1 --model openrouter/<provider>/<model>

Writes to results/analysis/<model_slug>/<run>/ and generates a summary of the ablations.

Batch (multiple models):

python -m pitchbench.analysis.run_analysis q1 \
    --models openrouter/<provider>/<model1> \
             openrouter/<provider>/<model2> \
             dashscope/<model3>

Output

results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/
  pitchbench_<exp>/
    results_<model>.json      # metadata, git commit, per-item responses
    results_<model>.csv       # one row per stimulus
    accuracies_<model>.csv    # aggregate metrics
    plots/                    # per-format CI plots

results/evaluation/summary/ is populated automatically after each run (aggregated_accuracies.csv, accuracy plots by instrument / notation / note).


Experiment categories

28 experiments across 7 categories.

Category A — Single-pitch identification

ID Script What it tests
a1 single_pitch_id Baseline: identify one sustained note across all sources and formats
a2 single_pitch_by_loudness Pitch accuracy under different loudness levels
a3 single_pitch_by_duration Pitch accuracy under different durations

Category B — Time-localised pitch

ID Script What it tests
b1 single_pitch_within_silence Hidden note in a silent clip
b2 pitch_at_timestamp Which pitch is playing at a queried timestamp in a multi-note sequence
b3 timestamp_single_pitch Detect when a single note starts/ends
b4 timestamp_specific_pitch Onset/offset of a named target pitch among distractors
b5 timestamp_multiple_pitches Full timing transcription of all notes in a sequence

Category C — Chords and simultaneous pitches

ID Script What it tests
c1 chord_count_pitches Count distinct pitches in a chord (dyads, triads, 7ths)
c2 chord_dyad_interval Name the interval (semitone count) between two simultaneous tones
c3 chord_quality Name chord quality and/or root+quality from a sounding chord
c4 chord_pitches List every note in a chord

Category D — Sequences, contour, intervals

ID Script What it tests
d1 sequence_count_pitches Count distinct pitches in a sequential passage
d2 dyad_lower_higher_difference Binary higher/lower judgment
d3 contour_discrete Output up/down tokens for each step-wise transition
d4 contour_continuous Output up/down tokens for each monotonic movement
d5 sequence_ranking_by_pitch Rank sequential tones from lowest to highest
d6 sequence_dyad_interval Name the melodic interval between two sequential notes
d7a pitch_with_reference Pitch identification given a labelled reference tone (variant used in the evaluation)
d7b pitch_with_reference_split Same, but reference and target in separate audio clips
d7c pitch_with_reference Variant with different set of pitches
d7d pitch_with_reference_split Split-clip variant with a different set of pitches
d8 sequence_pitches Transcribe all pitches in a note sequence in order

Category E — Robustness

ID Script What it tests
e1 audio_effects Pitch under effects like high-pass/low-pass filtering, distortion, reverb, chorus
e2 background Pitch over real-world backgrounds (crowd, rain, bells, street) at varying loudness levels
e3 harmonic_saturation Pitch with harmonic saturation / overdrive
e4 time_stretching Pitch with time-stretching (speed change without pitch shift)
e5 vibrato Pitch with vibrato at varying rates and depths
e6 slightly_off Detuned tones (up to 45 % of a semitone)

Category F — Polyphony

ID Script What it tests
f1 melodic_line_atonal Transcribe one designated voice from 2–3 simultaneous synthetic voices (atonal)
f2 melodic_line_tonal Same task over tonal melodic material from Bach chorales

Category Y — Format variants

ID Script What it tests
y1 single_pitch_id_mcq MCQ variant of a1: pick the correct pitch from labelled options

Stimulus engine

  • Waveforms (always available): sine, sawtooth, square, triangle
  • GM instruments (FluidSynth): piano, electric_keyboard, guitar, flute, trumpet, trombone, clarinet, oboe, violin, cello, organ, bass, synth_lead, synth_pad, voice
  • Backgrounds (e2): white_noise synthesised; real recordings in data/preloaded/background/<name>.mp3

Repository layout

src/pitchbench/
  config.py                    # runtime constants and env wiring
  configs/
    benchmark_config.py        # paper-locked data-gen parameters
    analysis_config.py         # analysis-mode presets/overrides
    plot_config.py             # labels/colors for figures
    user_config.py             # user-overridable settings
  sound/
    engine.py                  # central audio synthesis engine
  model/
    query.py                   # ALM query facade (local/OpenRouter/DashScope)
    dispatcher.py              # provider-aware request dispatch
    cost.py                    # API usage/cost tracking
  experiments/
    run.py                     # pitchbench CLI
    scripts/                   # one .py per experiment (32 total)
    helpers/
      cat_a.py ... cat_f.py    # per-category generate + evaluate helpers
      audit.py                 # condition/result audit utilities
      data.py                  # Parquet I/O
      music.py                 # pitch/notation conversion helpers
      plots.py                 # experiment plotting helpers
      results.py               # result writers, aggregators, summary CSVs
      sampling.py              # stratified sampling
      setup.py                 # experiment setup helpers
      timing_layout.py         # timing-grid utilities
  analysis/
    analyze_a1.py              # a1 line plots + heatmaps
    analyze.py                 # core analysis pipeline
    a1.py                      # alternate A1 analysis entrypoint
    ablation.py                # ablation summaries
    combine.py                 # combine multi-run CSV outputs
    overview.py                # overview plots/tables
    run_analysis.py            # batch analysis CLI
data/
  preloaded/                   # background recordings (gitignored)
  generated/                   # created by `pitchbench generate`
results/                       # created by evaluate/analyze runs

Set PITCHBENCH_ROOT to override the project root for data/ and results/.


Environment variables

Variable Default Purpose
PITCHBENCH_ROOT . Project root (data/, results/)
PITCHBENCH_SF2 auto-discovered GM soundfont path override
PITCHBENCH_LOCAL_URL http://localhost:8001 Default local model server
OPENROUTER_KEY OpenRouter API key
PITCHBENCH_CONCURRENCY_OPENROUTER 20 Max parallel OpenRouter calls
PITCHBENCH_CONCURRENCY_DASHSCOPE 10 Max parallel DashScope calls
LOCAL_CONCURRENCY 2 Max parallel local server calls

Tests

uv run pytest

77 tests covering CLI resolution, Parquet I/O, result aggregation, sampling, and API routing. No network calls or audio generation required.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pitchbench-0.1.5.tar.gz (179.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pitchbench-0.1.5-py3-none-any.whl (235.6 kB view details)

Uploaded Python 3

File details

Details for the file pitchbench-0.1.5.tar.gz.

File metadata

  • Download URL: pitchbench-0.1.5.tar.gz
  • Upload date:
  • Size: 179.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.5.tar.gz
Algorithm Hash digest
SHA256 bfe048a0c84621c1cb94a5722f1a399188d016136a916c1504bcf8f0e4ff1bba
MD5 981d8bdb40f4a7c7da0dc2d81c8493a1
BLAKE2b-256 da90e88133b686fd6ae63eaf79b02e625da543f28418f90f9524d4ba0550fe2d

See more details on using hashes here.

File details

Details for the file pitchbench-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: pitchbench-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 235.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c2f6237e6451b6e6b033587898ec51f9ea44dcb8384f672469f0de8609605914
MD5 96e0d6a65d9bc35ddaf1b0bd079a0b94
BLAKE2b-256 8a9f41909adf7a83c5577e47f0df489b716e9f161d7287a2d7a8cbdf954021f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page