Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

wav2caption

Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.

Disclaimer: This is an independent third-party tool. It is not affiliated with, endorsed by, or sponsored by Suno, Udio, ACE-Step, Essentia, MTG-Jamendo, or Discogs. Those names appear nominatively to identify the downstream prompt formats and upstream models / datasets this tool integrates with. Bundled model weights inherit their original CC-BY / MIT licenses; users are responsible for verifying that audio inputs they analyse are properly licensed.

live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section

Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

Why this exists

Most "audio → tag" tools stop at a flat list of instruments. When you feed that into a prompt-conditioned music model, the arrangement gets lost — instruments are named but their role is missing, and dynamics are dropped entirely. wav2caption was factored out of a production pipeline that captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and it keeps two things other tools don't:

Role grouping. drums and bass are not just instruments; they are the rhythm and bass roles. A section that also has strings + brass gets tagged as "string section, brass section" rather than five indistinguishable labels.
Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.

Install

pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"

Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. The wav2caption code itself is MIT.

Models

The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:

mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb

# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1  discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a  mtg_jamendo_instrument-discogs-effnet-1.pb
EOF

The same check is available programmatically:

from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())

or automatically on every analyze(...) call by setting WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.

⚠️ Supply-chain note. The .pb files are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS from essentia.upf.edu and verify the digests before first load.

Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.

Quick start

CLI

wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5

Example output

On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:

=== song.wav ===
duration: 3:32  tempo: 132.9 BPM  key: Eb major (conf 0.87)  danceability: 1.10

[ detected instruments ]
  drums                0.402  ################
  electricguitar       0.308  ############
  bass                 0.286  ###########
  guitar               0.274  ##########
  piano                0.222  ########
  acousticguitar       0.177  #######
  synthesizer          0.176  #######
  violin               0.126  #####
  ...

[ role scores ]
  rhythm             0.468
  acoustic_guitar    0.450
  harmony            0.377
  lead_guitar        0.308
  bass               0.286
  strings            0.219
  synth              0.176
  brass              0.118
  vocal              0.067
  woodwind           0.061

[ sections ]
  0:20-0:30  loud=1301  bright=1019Hz  Eb major
    roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
    features: metallic percussion accents, string harmonies, brass accents
  0:30-0:40  loud=1224  bright=1278Hz  Eb major
    roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
    features: metallic percussion accents, staccato stabs

[ caption ]
  live drums, electric guitar, piano, bass, string section, acoustic guitar,
  Eb major, 133 BPM, dynamic build-up, breakdown section

Python

from wav2caption import analyze, build_caption

result = analyze("song.wav")
print(build_caption(result))

for s in result.sections:
    roles = {r: name for r, (name, _score) in s.roles.items()}
    print(f"{s.start:>5.1f}s  {roles}  {s.features}")

AnalysisResult is a typed dataclass:

@dataclass
class AnalysisResult:
    path: Path
    duration_sec: float
    bpm: float
    key: str
    scale: str  # "major" | "minor"
    key_confidence: float
    danceability: float
    detected_instruments: list[tuple[str, float]]   # (label, probability)
    role_scores: dict[str, float]                   # aggregated per role
    sections: list[Section]

Role taxonomy

role	instruments
`rhythm`	drums, drummachine, beat, percussion, bongo
`bass`	bass, acousticbassguitar, doublebass
`harmony`	piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion
`lead_guitar`	electricguitar
`acoustic_guitar`	acousticguitar, classicalguitar, guitar
`strings`	strings, violin, viola, cello, orchestra
`brass`	brass, trumpet, trombone, horn, saxophone
`woodwind`	flute, clarinet, oboe
`synth`	synthesizer, pad, sampler, computer
`bells`	bell, harp, harmonica
`vocal`	voice

The mapping is intentionally opinionated and biased toward production arrangement labels rather than strict orchestration (e.g. guitar goes to acoustic_guitar because the MTG-Jamendo label is ambiguous and the acoustic interpretation is safer for caption conditioning). Override ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].

Project layout

src/wav2caption/
    __init__.py       # public API
    analyzer.py       # analyze() + build_caption() + dataclasses
    constants.py      # INSTRUMENTS, ROLE_MAP, get_role()
    models.py         # model-path discovery
    cli.py            # wav2caption console script
tests/                # no-Essentia unit tests

Development

git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src

The unit tests intentionally do not require Essentia, so CI stays fast and free of TensorFlow. Real-audio smoke tests belong in examples/.

Verification (sigstore)

Releases from v_next_ (released after 2026-05-16) include a sigstore keyless signature bundle (.sigstore per artifact) attached to the GitHub Release.

Verify a PyPI install

pip download <pkg-name>==<version> --no-deps -d ./verify
python -m sigstore verify github \
    --cert-identity 'https://github.com/hinanohart/wav2caption/.github/workflows/release.yml@refs/tags/v<version>' \
    --cert-oidc-issuer 'https://token.actions.githubusercontent.com' \
    ./verify/*.whl ./verify/*.tar.gz

The corresponding .sigstore bundles can be downloaded from the GitHub Release page.

Historic releases (pre-2026-05-16)

Earlier releases were published without sigstore bundles. Re-installing those versions provides no cryptographic provenance — pin to a current release if assurance matters.

License

Source code: MIT (see LICENSE).
Runtime dep Essentia: AGPL-3.0 (opt-in via pip install "wav2caption[essentia]").
Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).

Full third-party notices: NOTICE.md.

If you need a commercial pipeline you will have to either license Essentia from MTG-UPF or swap in a different backend. The MIT-licensed code in this repo is backend-agnostic enough that a torch / onnxruntime port is straightforward — PRs welcome.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hinanohart

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

May 19, 2026

0.1.2

Apr 17, 2026

0.1.1

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wav2caption-0.1.4.tar.gz (22.5 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wav2caption-0.1.4-py3-none-any.whl (18.8 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file wav2caption-0.1.4.tar.gz.

File metadata

Download URL: wav2caption-0.1.4.tar.gz
Upload date: May 19, 2026
Size: 22.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for wav2caption-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`6ee513d34e31b66ebb7f4367c498d2fa62c61d23771f7927e8d2c4513e2689a9`
MD5	`c5d1c100c096f565ebdc50074eecbee9`
BLAKE2b-256	`5d6a6f16e72d0d15bef0632d128e4ad13dadd9af84cb567253bd1bd0bb313b3d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.4.tar.gz:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wav2caption-0.1.4.tar.gz
- Subject digest: 6ee513d34e31b66ebb7f4367c498d2fa62c61d23771f7927e8d2c4513e2689a9
- Sigstore transparency entry: 1573588068
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: hinanohart/wav2caption@76feb27a22d4383dcfcc6c9f140f40c9773ecca8
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/hinanohart
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76feb27a22d4383dcfcc6c9f140f40c9773ecca8
- Trigger Event: push

File details

Details for the file wav2caption-0.1.4-py3-none-any.whl.

File metadata

Download URL: wav2caption-0.1.4-py3-none-any.whl
Upload date: May 19, 2026
Size: 18.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for wav2caption-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4297d446f9c6c6d1daa2559bc5560ac1ed63d428939b235a4a413d4552212b52`
MD5	`fe73c70bcad6a3015d9ce4517c9305fd`
BLAKE2b-256	`bd8fe328b709cc4791c61102cccf8c17075561f4ae145ca7bc9bf6769f6ef39c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for wav2caption-0.1.4-py3-none-any.whl:

Publisher: release.yml on hinanohart/wav2caption

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: wav2caption-0.1.4-py3-none-any.whl
- Subject digest: 4297d446f9c6c6d1daa2559bc5560ac1ed63d428939b235a4a413d4552212b52
- Sigstore transparency entry: 1573588081
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: hinanohart/wav2caption@76feb27a22d4383dcfcc6c9f140f40c9773ecca8
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/hinanohart
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76feb27a22d4383dcfcc6c9f140f40c9773ecca8
- Trigger Event: push

wav2caption 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

wav2caption

Why this exists

Install

Models

Quick start

CLI

Example output

Python

Role taxonomy

Project layout

Development

Verification (sigstore)

Verify a PyPI install

Historic releases (pre-2026-05-16)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance