Skip to main content

Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts)

Project description

wav2caption

PyPI Python License CI CodeQL Downloads

Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.

Disclaimer: This is an independent third-party tool. It is not affiliated with, endorsed by, or sponsored by Suno, Udio, ACE-Step, Essentia, MTG-Jamendo, or Discogs. Those names appear nominatively to identify the downstream prompt formats and upstream models / datasets this tool integrates with. Bundled model weights inherit their original CC-BY / MIT licenses; users are responsible for verifying that audio inputs they analyse are properly licensed.

live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section

Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).


Why this exists

Most "audio → tag" tools stop at a flat list of instruments. When you feed that into a prompt-conditioned music model, the arrangement gets lost — instruments are named but their role is missing, and dynamics are dropped entirely. wav2caption was factored out of a production pipeline that captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and it keeps two things other tools don't:

  • Role grouping. drums and bass are not just instruments; they are the rhythm and bass roles. A section that also has strings + brass gets tagged as "string section, brass section" rather than five indistinguishable labels.
  • Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.

Install

pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"

Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. The wav2caption code itself is MIT.

Models

The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:

mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb

# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1  discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a  mtg_jamendo_instrument-discogs-effnet-1.pb
EOF

The same check is available programmatically:

from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())

or automatically on every analyze(...) call by setting WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.

⚠️ Supply-chain note. The .pb files are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS from essentia.upf.edu and verify the digests before first load.

Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.

Quick start

CLI

wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5

Example output

On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:

=== song.wav ===
duration: 3:32  tempo: 132.9 BPM  key: Eb major (conf 0.87)  danceability: 1.10

[ detected instruments ]
  drums                0.402  ################
  electricguitar       0.308  ############
  bass                 0.286  ###########
  guitar               0.274  ##########
  piano                0.222  ########
  acousticguitar       0.177  #######
  synthesizer          0.176  #######
  violin               0.126  #####
  ...

[ role scores ]
  rhythm             0.468
  acoustic_guitar    0.450
  harmony            0.377
  lead_guitar        0.308
  bass               0.286
  strings            0.219
  synth              0.176
  brass              0.118
  vocal              0.067
  woodwind           0.061

[ sections ]
  0:20-0:30  loud=1301  bright=1019Hz  Eb major
    roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
    features: metallic percussion accents, string harmonies, brass accents
  0:30-0:40  loud=1224  bright=1278Hz  Eb major
    roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
    features: metallic percussion accents, staccato stabs

[ caption ]
  live drums, electric guitar, piano, bass, string section, acoustic guitar,
  Eb major, 133 BPM, dynamic build-up, breakdown section

Python

from wav2caption import analyze, build_caption

result = analyze("song.wav")
print(build_caption(result))

for s in result.sections:
    roles = {r: name for r, (name, _score) in s.roles.items()}
    print(f"{s.start:>5.1f}s  {roles}  {s.features}")

AnalysisResult is a typed dataclass:

@dataclass
class AnalysisResult:
    path: Path
    duration_sec: float
    bpm: float
    key: str
    scale: str  # "major" | "minor"
    key_confidence: float
    danceability: float
    detected_instruments: list[tuple[str, float]]   # (label, probability)
    role_scores: dict[str, float]                   # aggregated per role
    sections: list[Section]

Role taxonomy

role instruments
rhythm drums, drummachine, beat, percussion, bongo
bass bass, acousticbassguitar, doublebass
harmony piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion
lead_guitar electricguitar
acoustic_guitar acousticguitar, classicalguitar, guitar
strings strings, violin, viola, cello, orchestra
brass brass, trumpet, trombone, horn, saxophone
woodwind flute, clarinet, oboe
synth synthesizer, pad, sampler, computer
bells bell, harp, harmonica
vocal voice

The mapping is intentionally opinionated and biased toward production arrangement labels rather than strict orchestration (e.g. guitar goes to acoustic_guitar because the MTG-Jamendo label is ambiguous and the acoustic interpretation is safer for caption conditioning). Override ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].

Project layout

src/wav2caption/
    __init__.py       # public API
    analyzer.py       # analyze() + build_caption() + dataclasses
    constants.py      # INSTRUMENTS, ROLE_MAP, get_role()
    models.py         # model-path discovery
    cli.py            # wav2caption console script
tests/                # no-Essentia unit tests

Development

git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src

The unit tests intentionally do not require Essentia, so CI stays fast and free of TensorFlow. Real-audio smoke tests belong in examples/.

Verification (sigstore)

Releases from v_next_ (released after 2026-05-16) include a sigstore keyless signature bundle (.sigstore per artifact) attached to the GitHub Release.

Verify a PyPI install

pip download <pkg-name>==<version> --no-deps -d ./verify
python -m sigstore verify github \
    --cert-identity 'https://github.com/hinanohart/wav2caption/.github/workflows/release.yml@refs/tags/v<version>' \
    --cert-oidc-issuer 'https://token.actions.githubusercontent.com' \
    ./verify/*.whl ./verify/*.tar.gz

The corresponding .sigstore bundles can be downloaded from the GitHub Release page.

Historic releases (pre-2026-05-16)

Earlier releases were published without sigstore bundles. Re-installing those versions provides no cryptographic provenance — pin to a current release if assurance matters.

License

  • Source code: MIT (see LICENSE).
  • Runtime dep Essentia: AGPL-3.0 (opt-in via pip install "wav2caption[essentia]").
  • Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).

Full third-party notices: NOTICE.md.

If you need a commercial pipeline you will have to either license Essentia from MTG-UPF or swap in a different backend. The MIT-licensed code in this repo is backend-agnostic enough that a torch / onnxruntime port is straightforward — PRs welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wav2caption-0.1.4.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wav2caption-0.1.4-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file wav2caption-0.1.4.tar.gz.

File metadata

  • Download URL: wav2caption-0.1.4.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for wav2caption-0.1.4.tar.gz
Algorithm Hash digest
SHA256 6ee513d34e31b66ebb7f4367c498d2fa62c61d23771f7927e8d2c4513e2689a9
MD5 c5d1c100c096f565ebdc50074eecbee9
BLAKE2b-256 5d6a6f16e72d0d15bef0632d128e4ad13dadd9af84cb567253bd1bd0bb313b3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.4.tar.gz:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wav2caption-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: wav2caption-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for wav2caption-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4297d446f9c6c6d1daa2559bc5560ac1ed63d428939b235a4a413d4552212b52
MD5 fe73c70bcad6a3015d9ce4517c9305fd
BLAKE2b-256 bd8fe328b709cc4791c61102cccf8c17075561f4ae145ca7bc9bf6769f6ef39c

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.4-py3-none-any.whl:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page