Benchmark suite for evaluating pitch and acoustic perception in audio language models

These details have not been verified by PyPI

Project description

PitchBench -- Python Package

Benchmark suite for evaluating pitch and acoustic perception in audio language models (ALMs). Probes pitch identification, temporal localisation, chord recognition, melodic contour, robustness to audio effects, and more — reporting per-format accuracy (MIDI, SPN, doremi, Hz) to expose where verbal decoding fails.

Setup

1. Install FluidSynth

Platform	Command
Linux / WSL	`sudo apt install fluidsynth`
macOS	`brew install fluid-synth`
Windows	`choco install fluidsynth` or download from fluidsynth.org

A GM soundfont is also required. Place any .sf2 or .sf3 in data/soundfonts/ — no further configuration needed.

Linux / WSL:

sudo apt install fluid-soundfont-gm
mkdir -p data/soundfonts
cp /usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

macOS / cross-platform:

curl -L "http://ftp.debian.org/debian/pool/main/f/fluid-soundfont/fluid-soundfont-gm_3.1-5.3_all.deb" \
     -o /tmp/fluid.deb
cd /tmp && ar x fluid.deb && tar xf data.tar.* ./usr/share/sounds/sf2/FluidR3_GM.sf2
mkdir -p data/soundfonts && cp /tmp/usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/

SF2 resolution order: PITCHBENCH_SF2 env var → data/soundfonts/ → /usr/share/sounds/sf2/FluidR3_GM.sf2.

2. Install PitchBench

python3 -m venv .venv
source .venv/bin/activate        # Linux / macOS
# .venv\Scripts\activate         # Windows
pip install pitchbench

For development tools:

uv sync --extra dev              # installs pytest + pytest-mock

Requires Python ≥ 3.12.

3. Configure a model backend

OpenRouter — add to .env:

OPENROUTER_KEY=sk-or-...

Use --model openrouter/<provider>/<model>. "Latest" alias slugs require a ~ prefix, e.g. openrouter/~google/gemini-flash-latest.

DashScope — add to .env:

DASHSCOPE_API_KEY=sk-...

Use --model dashscope/<model>.

Local server — included in the base pip install pitchbench dependencies:

--model http://localhost:8001 --name my-local-model

Usage

PitchBench separates stimulus generation from model evaluation. Generate once, evaluate any number of models.

Generate

pitchbench generate all
pitchbench generate a          # single category
pitchbench generate a1         # single experiment

Writes WAV files and a Parquet dataset to data/generated/<exp>/. Re-running is safe: existing rows are skipped.

data/generated/pitchbench_a1_single_pitch_id/
  *.wav               # one file per condition
  _questions.parquet  # HuggingFace-compatible schema
  _questions.csv      # human-readable companion

Evaluate

pitchbench --list   
pitchbench evaluate all --model openrouter/<provider>/<model>
pitchbench evaluate a1  --model openrouter/<provider>/<model>

# Quick test (20 stimuli, stratified)
pitchbench evaluate a1  --model openrouter/<provider>/<model> --sample-n 20 --sample-seed 0

Results land in results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/.

In tmux (recommended for long runs):

tmux new-session -d -s mymodel "cd /home/hsnu2/PitchBench && source .venv/bin/activate && \
  pitchbench evaluate all --model openrouter/<provider>/<model> 2>&1 | tee logs/eval_mymodel_\$(date +%Y%m%d_%H%M%S).log"

Analyze

pitchbench analyze --preset q1 --model openrouter/<provider>/<model>

Writes to results/analysis/<model_slug>/<run>/ and generates a summary of the ablations.

Batch (multiple models):

python -m pitchbench.analysis.run_analysis q1 \
    --models openrouter/<provider>/<model1> \
             openrouter/<provider>/<model2> \
             dashscope/<model3>

Output

results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/
  pitchbench_<exp>/
    results_<model>.json      # metadata, git commit, per-item responses
    results_<model>.csv       # one row per stimulus
    accuracies_<model>.csv    # aggregate metrics
    plots/                    # per-format CI plots

results/evaluation/summary/ is populated automatically after each run (aggregated_accuracies.csv, accuracy plots by instrument / notation / note).

Experiment categories

28 experiments across 7 categories.

Category A — Single-pitch identification

ID	Script	What it tests
a1	`single_pitch_id`	Baseline: identify one sustained note across all sources and formats
a2	`single_pitch_by_loudness`	Pitch accuracy under different loudness levels
a3	`single_pitch_by_duration`	Pitch accuracy under different durations

Category B — Time-localised pitch

ID	Script	What it tests
b1	`single_pitch_within_silence`	Hidden note in a silent clip
b2	`pitch_at_timestamp`	Which pitch is playing at a queried timestamp in a multi-note sequence
b3	`timestamp_single_pitch`	Detect when a single note starts/ends
b4	`timestamp_specific_pitch`	Onset/offset of a named target pitch among distractors
b5	`timestamp_multiple_pitches`	Full timing transcription of all notes in a sequence

Category C — Chords and simultaneous pitches

ID	Script	What it tests
c1	`chord_count_pitches`	Count distinct pitches in a chord (dyads, triads, 7ths)
c2	`chord_dyad_interval`	Name the interval (semitone count) between two simultaneous tones
c3	`chord_quality`	Name chord quality and/or root+quality from a sounding chord
c4	`chord_pitches`	List every note in a chord

Category D — Sequences, contour, intervals

ID	Script	What it tests
d1	`sequence_count_pitches`	Count distinct pitches in a sequential passage
d2	`dyad_lower_higher_difference`	Binary higher/lower judgment
d3	`contour_discrete`	Output up/down tokens for each step-wise transition
d4	`contour_continuous`	Output up/down tokens for each monotonic movement
d5	`sequence_ranking_by_pitch`	Rank sequential tones from lowest to highest
d6	`sequence_dyad_interval`	Name the melodic interval between two sequential notes
d7a	`pitch_with_reference`	Pitch identification given a labelled reference tone (variant used in the evaluation)
d7b	`pitch_with_reference_split`	Same, but reference and target in separate audio clips
d7c	`pitch_with_reference`	Variant with different set of pitches
d7d	`pitch_with_reference_split`	Split-clip variant with a different set of pitches
d8	`sequence_pitches`	Transcribe all pitches in a note sequence in order

Category E — Robustness

ID	Script	What it tests
e1	`audio_effects`	Pitch under effects like high-pass/low-pass filtering, distortion, reverb, chorus
e2	`background`	Pitch over real-world backgrounds (crowd, rain, bells, street) at varying loudness levels
e3	`harmonic_saturation`	Pitch with harmonic saturation / overdrive
e4	`time_stretching`	Pitch with time-stretching (speed change without pitch shift)
e5	`vibrato`	Pitch with vibrato at varying rates and depths
e6	`slightly_off`	Detuned tones (up to 45 % of a semitone)

Category F — Polyphony

ID	Script	What it tests
f1	`melodic_line_atonal`	Transcribe one designated voice from 2–3 simultaneous synthetic voices (atonal)
f2	`melodic_line_tonal`	Same task over tonal melodic material from Bach chorales

Category Y — Format variants

ID	Script	What it tests
y1	`single_pitch_id_mcq`	MCQ variant of a1: pick the correct pitch from labelled options

Stimulus engine

Waveforms (always available): sine, sawtooth, square, triangle
GM instruments (FluidSynth): piano, electric_keyboard, guitar, flute, trumpet, trombone, clarinet, oboe, violin, cello, organ, bass, synth_lead, synth_pad, voice
Backgrounds (e2): white_noise synthesised; real recordings in data/preloaded/background/<name>.mp3

Repository layout

src/pitchbench/
  config.py                    # runtime constants and env wiring
  configs/
    benchmark_config.py        # paper-locked data-gen parameters
    analysis_config.py         # analysis-mode presets/overrides
    plot_config.py             # labels/colors for figures
    user_config.py             # user-overridable settings
  sound/
    engine.py                  # central audio synthesis engine
  model/
    query.py                   # ALM query facade (local/OpenRouter/DashScope)
    dispatcher.py              # provider-aware request dispatch
    cost.py                    # API usage/cost tracking
  experiments/
    run.py                     # pitchbench CLI
    scripts/                   # one .py per experiment (32 total)
    helpers/
      cat_a.py ... cat_f.py    # per-category generate + evaluate helpers
      audit.py                 # condition/result audit utilities
      data.py                  # Parquet I/O
      music.py                 # pitch/notation conversion helpers
      plots.py                 # experiment plotting helpers
      results.py               # result writers, aggregators, summary CSVs
      sampling.py              # stratified sampling
      setup.py                 # experiment setup helpers
      timing_layout.py         # timing-grid utilities
  analysis/
    analyze_a1.py              # a1 line plots + heatmaps
    analyze.py                 # core analysis pipeline
    a1.py                      # alternate A1 analysis entrypoint
    ablation.py                # ablation summaries
    combine.py                 # combine multi-run CSV outputs
    overview.py                # overview plots/tables
    run_analysis.py            # batch analysis CLI
data/
  preloaded/                   # background recordings (gitignored)
  generated/                   # created by `pitchbench generate`
results/                       # created by evaluate/analyze runs

Set PITCHBENCH_ROOT to override the project root for data/ and results/.

Environment variables

Variable	Default	Purpose
`PITCHBENCH_ROOT`	`.`	Project root (data/, results/)
`PITCHBENCH_SF2`	auto-discovered	GM soundfont path override
`PITCHBENCH_LOCAL_URL`	`http://localhost:8001`	Default local model server
`OPENROUTER_KEY`	—	OpenRouter API key
`PITCHBENCH_CONCURRENCY_OPENROUTER`	`20`	Max parallel OpenRouter calls
`PITCHBENCH_CONCURRENCY_DASHSCOPE`	`10`	Max parallel DashScope calls
`LOCAL_CONCURRENCY`	`2`	Max parallel local server calls

Tests

uv run pytest

77 tests covering CLI resolution, Parquet I/O, result aggregation, sampling, and API routing. No network calls or audio generation required.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

May 12, 2026

This version

0.1.5

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pitchbench-0.1.5.tar.gz (179.0 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pitchbench-0.1.5-py3-none-any.whl (235.6 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file pitchbench-0.1.5.tar.gz.

File metadata

Download URL: pitchbench-0.1.5.tar.gz
Upload date: May 12, 2026
Size: 179.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`bfe048a0c84621c1cb94a5722f1a399188d016136a916c1504bcf8f0e4ff1bba`
MD5	`981d8bdb40f4a7c7da0dc2d81c8493a1`
BLAKE2b-256	`da90e88133b686fd6ae63eaf79b02e625da543f28418f90f9524d4ba0550fe2d`

See more details on using hashes here.

File details

Details for the file pitchbench-0.1.5-py3-none-any.whl.

File metadata

Download URL: pitchbench-0.1.5-py3-none-any.whl
Upload date: May 12, 2026
Size: 235.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pitchbench-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2f6237e6451b6e6b033587898ec51f9ea44dcb8384f672469f0de8609605914`
MD5	`96e0d6a65d9bc35ddaf1b0bd079a0b94`
BLAKE2b-256	`8a9f41909adf7a83c5577e47f0df489b716e9f161d7287a2d7a8cbdf954021f4`

See more details on using hashes here.

pitchbench 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PitchBench -- Python Package

Setup

1. Install FluidSynth

2. Install PitchBench

3. Configure a model backend

Usage

Generate

Evaluate

Analyze

Output

Experiment categories

Category A — Single-pitch identification

Category B — Time-localised pitch

Category C — Chords and simultaneous pitches

Category D — Sequences, contour, intervals

Category E — Robustness

Category F — Polyphony

Category Y — Format variants

Stimulus engine

Repository layout

Environment variables

Tests

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes