Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts)
Project description
wav2caption
Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).
Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.
live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, 4/4, dynamic build-up, breakdown section
Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).
Why this exists
Most "audio → tag" tools stop at a flat list of instruments. When you feed
that into a prompt-conditioned music model, the arrangement gets lost —
instruments are named but their role is missing, and dynamics are dropped
entirely. wav2caption was factored out of a production pipeline that
captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and
it keeps two things other tools don't:
- Role grouping.
drumsandbassare not just instruments; they are the rhythm and bass roles. A section that also hasstrings+brassgets tagged as "string section, brass section" rather than five indistinguishable labels. - Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.
Install
pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"
Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on
wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. Thewav2captioncode itself is Apache-2.0.
Models
The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:
mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb
# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1 discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a mtg_jamendo_instrument-discogs-effnet-1.pb
EOF
The same check is available programmatically:
from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())
or automatically on every analyze(...) call by setting
WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.
⚠️ Supply-chain note. The
.pbfiles are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS fromessentia.upf.eduand verify the digests before first load.
Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.
Quick start
CLI
wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5
Example output
On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:
=== song.wav ===
duration: 3:32 tempo: 132.9 BPM key: Eb major (conf 0.87) danceability: 1.10
[ detected instruments ]
drums 0.402 ################
electricguitar 0.308 ############
bass 0.286 ###########
guitar 0.274 ##########
piano 0.222 ########
acousticguitar 0.177 #######
synthesizer 0.176 #######
violin 0.126 #####
...
[ role scores ]
rhythm 0.468
acoustic_guitar 0.450
harmony 0.377
lead_guitar 0.308
bass 0.286
strings 0.219
synth 0.176
brass 0.118
vocal 0.067
woodwind 0.061
[ sections ]
0:20-0:30 loud=1301 bright=1019Hz Eb major
roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
features: metallic percussion accents, string harmonies, brass accents
0:30-0:40 loud=1224 bright=1278Hz Eb major
roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
features: metallic percussion accents, staccato stabs
[ caption ]
live drums, electric guitar, piano, bass, string section, acoustic guitar,
Eb major, 133 BPM, 4/4, dynamic build-up, breakdown section
Python
from wav2caption import analyze, build_caption
result = analyze("song.wav")
print(build_caption(result))
for s in result.sections:
roles = {r: name for r, (name, _score) in s.roles.items()}
print(f"{s.start:>5.1f}s {roles} {s.features}")
AnalysisResult is a typed dataclass:
@dataclass
class AnalysisResult:
path: Path
duration_sec: float
bpm: float
key: str
scale: str # "major" | "minor"
key_confidence: float
danceability: float
detected_instruments: list[tuple[str, float]] # (label, probability)
role_scores: dict[str, float] # aggregated per role
sections: list[Section]
Role taxonomy
| role | instruments |
|---|---|
rhythm |
drums, drummachine, beat, percussion, bongo |
bass |
bass, acousticbassguitar, doublebass |
harmony |
piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion |
lead_guitar |
electricguitar |
acoustic_guitar |
acousticguitar, classicalguitar, guitar |
strings |
strings, violin, viola, cello, orchestra |
brass |
brass, trumpet, trombone, horn, saxophone |
woodwind |
flute, clarinet, oboe |
synth |
synthesizer, pad, sampler, computer |
bells |
bell, harp, harmonica |
vocal |
voice |
The mapping is intentionally opinionated and biased toward production
arrangement labels rather than strict orchestration (e.g. guitar goes to
acoustic_guitar because the MTG-Jamendo label is ambiguous and the
acoustic interpretation is safer for caption conditioning). Override
ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].
Project layout
src/wav2caption/
__init__.py # public API
analyzer.py # analyze() + build_caption() + dataclasses
constants.py # INSTRUMENTS, ROLE_MAP, get_role()
models.py # model-path discovery
cli.py # wav2caption console script
tests/ # no-Essentia unit tests
Development
git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src
The unit tests intentionally do not require Essentia, so CI stays fast
and free of TensorFlow. Real-audio smoke tests belong in examples/.
License
- Source code: Apache 2.0 (see LICENSE).
- Runtime dep Essentia: AGPL-3.0 (opt-in via
pip install "wav2caption[essentia]"). - Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).
Full third-party notices: NOTICE.md.
If you need a commercial pipeline you will have to either license Essentia
from MTG-UPF or swap in a different backend. The Apache-2.0-licensed code
in this repo is backend-agnostic enough that a torch / onnxruntime
port is straightforward — PRs welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wav2caption-0.1.1.tar.gz.
File metadata
- Download URL: wav2caption-0.1.1.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3142d7d1ebd032b5b420aaa94ab1a46740e739b0dcb05f22a37db3b031f9fabd
|
|
| MD5 |
5f100d2f509aaf8d625bed830767a836
|
|
| BLAKE2b-256 |
9d9b28587d6ad9afb34d0d3f64d98b3e85d062ac305db03deee1307ae62cb92a
|
Provenance
The following attestation bundles were made for wav2caption-0.1.1.tar.gz:
Publisher:
release.yml on hinanohart/wav2caption
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wav2caption-0.1.1.tar.gz -
Subject digest:
3142d7d1ebd032b5b420aaa94ab1a46740e739b0dcb05f22a37db3b031f9fabd - Sigstore transparency entry: 1328969777
- Sigstore integration time:
-
Permalink:
hinanohart/wav2caption@a317c2ea1dd7699c0b62dabb4d6d08be41a7fc1b -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a317c2ea1dd7699c0b62dabb4d6d08be41a7fc1b -
Trigger Event:
push
-
Statement type:
File details
Details for the file wav2caption-0.1.1-py3-none-any.whl.
File metadata
- Download URL: wav2caption-0.1.1-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1da97b0eff0a43ec6608ccb42dc40eb3c2b30ba099dda5a31e53bc50280cb18c
|
|
| MD5 |
e7137b6de469300629c0cb2f13aab645
|
|
| BLAKE2b-256 |
0cf12e04a36af338eebc6fd903d6ff5ba9f2dc6863f87117652a64a529f9a97a
|
Provenance
The following attestation bundles were made for wav2caption-0.1.1-py3-none-any.whl:
Publisher:
release.yml on hinanohart/wav2caption
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wav2caption-0.1.1-py3-none-any.whl -
Subject digest:
1da97b0eff0a43ec6608ccb42dc40eb3c2b30ba099dda5a31e53bc50280cb18c - Sigstore transparency entry: 1328969830
- Sigstore integration time:
-
Permalink:
hinanohart/wav2caption@a317c2ea1dd7699c0b62dabb4d6d08be41a7fc1b -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a317c2ea1dd7699c0b62dabb4d6d08be41a7fc1b -
Trigger Event:
push
-
Statement type: