Skip to main content

Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts)

Project description

wav2caption

PyPI Python License CI CodeQL Downloads

Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.

live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section

Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).


Why this exists

Most "audio → tag" tools stop at a flat list of instruments. When you feed that into a prompt-conditioned music model, the arrangement gets lost — instruments are named but their role is missing, and dynamics are dropped entirely. wav2caption was factored out of a production pipeline that captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and it keeps two things other tools don't:

  • Role grouping. drums and bass are not just instruments; they are the rhythm and bass roles. A section that also has strings + brass gets tagged as "string section, brass section" rather than five indistinguishable labels.
  • Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.

Install

pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"

Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. The wav2caption code itself is Apache-2.0.

Models

The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:

mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb

# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1  discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a  mtg_jamendo_instrument-discogs-effnet-1.pb
EOF

The same check is available programmatically:

from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())

or automatically on every analyze(...) call by setting WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.

⚠️ Supply-chain note. The .pb files are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS from essentia.upf.edu and verify the digests before first load.

Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.

Quick start

CLI

wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5

Example output

On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:

=== song.wav ===
duration: 3:32  tempo: 132.9 BPM  key: Eb major (conf 0.87)  danceability: 1.10

[ detected instruments ]
  drums                0.402  ################
  electricguitar       0.308  ############
  bass                 0.286  ###########
  guitar               0.274  ##########
  piano                0.222  ########
  acousticguitar       0.177  #######
  synthesizer          0.176  #######
  violin               0.126  #####
  ...

[ role scores ]
  rhythm             0.468
  acoustic_guitar    0.450
  harmony            0.377
  lead_guitar        0.308
  bass               0.286
  strings            0.219
  synth              0.176
  brass              0.118
  vocal              0.067
  woodwind           0.061

[ sections ]
  0:20-0:30  loud=1301  bright=1019Hz  Eb major
    roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
    features: metallic percussion accents, string harmonies, brass accents
  0:30-0:40  loud=1224  bright=1278Hz  Eb major
    roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
    features: metallic percussion accents, staccato stabs

[ caption ]
  live drums, electric guitar, piano, bass, string section, acoustic guitar,
  Eb major, 133 BPM, dynamic build-up, breakdown section

Python

from wav2caption import analyze, build_caption

result = analyze("song.wav")
print(build_caption(result))

for s in result.sections:
    roles = {r: name for r, (name, _score) in s.roles.items()}
    print(f"{s.start:>5.1f}s  {roles}  {s.features}")

AnalysisResult is a typed dataclass:

@dataclass
class AnalysisResult:
    path: Path
    duration_sec: float
    bpm: float
    key: str
    scale: str  # "major" | "minor"
    key_confidence: float
    danceability: float
    detected_instruments: list[tuple[str, float]]   # (label, probability)
    role_scores: dict[str, float]                   # aggregated per role
    sections: list[Section]

Role taxonomy

role instruments
rhythm drums, drummachine, beat, percussion, bongo
bass bass, acousticbassguitar, doublebass
harmony piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion
lead_guitar electricguitar
acoustic_guitar acousticguitar, classicalguitar, guitar
strings strings, violin, viola, cello, orchestra
brass brass, trumpet, trombone, horn, saxophone
woodwind flute, clarinet, oboe
synth synthesizer, pad, sampler, computer
bells bell, harp, harmonica
vocal voice

The mapping is intentionally opinionated and biased toward production arrangement labels rather than strict orchestration (e.g. guitar goes to acoustic_guitar because the MTG-Jamendo label is ambiguous and the acoustic interpretation is safer for caption conditioning). Override ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].

Project layout

src/wav2caption/
    __init__.py       # public API
    analyzer.py       # analyze() + build_caption() + dataclasses
    constants.py      # INSTRUMENTS, ROLE_MAP, get_role()
    models.py         # model-path discovery
    cli.py            # wav2caption console script
tests/                # no-Essentia unit tests

Development

git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src

The unit tests intentionally do not require Essentia, so CI stays fast and free of TensorFlow. Real-audio smoke tests belong in examples/.

License

  • Source code: Apache 2.0 (see LICENSE).
  • Runtime dep Essentia: AGPL-3.0 (opt-in via pip install "wav2caption[essentia]").
  • Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).

Full third-party notices: NOTICE.md.

If you need a commercial pipeline you will have to either license Essentia from MTG-UPF or swap in a different backend. The Apache-2.0-licensed code in this repo is backend-agnostic enough that a torch / onnxruntime port is straightforward — PRs welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wav2caption-0.1.2.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wav2caption-0.1.2-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file wav2caption-0.1.2.tar.gz.

File metadata

  • Download URL: wav2caption-0.1.2.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wav2caption-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0940f40fa2bbc99e7d0ccf3c7d1a1c82101a7b7a4cc3634ac94be85c1891da99
MD5 5a0e97afaf8fea1035c580af7571c9da
BLAKE2b-256 f24f8a7c2d1f677978a5a2aa63e4b9c60e22aa1617a455733f60291121e43b23

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.2.tar.gz:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wav2caption-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: wav2caption-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for wav2caption-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 38639b68a7d18586f22be37bfec2d675489901b5aefab4b119e95047d5e456b2
MD5 4ffbbdcad5cbc4f38901067a0874bb9c
BLAKE2b-256 448718ff298630cd25d36451d08ae10c43d3c3f11b50b6d0dccce10f2ffcd775

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.2-py3-none-any.whl:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page