Terminal-first singing voice synthesis framework for the Claude Code era.

These details have not been verified by PyPI

Project links

Project description

vocaboot

Terminal-friendly singing voice synthesis framework for the Claude Code era.

Status: stage 1 smoke under development (2026-05-12). Not yet released.

Why

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI.

vocaboot is the bridge: a single pip install and you can synthesize a singing voice from your shell, from a Python import, or from an MCP-aware agent. Stage 1 ships a CLI scaffold; whether CLI / MCP / Skill / library only is the primary surface for the Claude Code era is treated as an open question, to be re-evaluated at the end of Stage 1 against actual usage. (See "Distribution" below.)

Design constraints

Cost: $0 (inference path) — CPU-only numeric verify is the $0 invariant. The self-train path (see "Stage 0 result") is not $0 by definition: it requires user GPU compute and audio data.
License: Apache-2.0 (code) — model weights inherit their upstream license; no NC weights bundled.
Generality: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
"Roughly close" is good enough: chasing literal pitch-perfect voice cloning is a structural failure mode (see _failures/); the framework optimizes for direction of a target timbre, not literal match.
Auto-failure museum (R8): every failed run writes a structured post-mortem under _failures/, so the next attempt cannot repeat the same mistake. Paths are scrubbed to {basename, sha256} before persistence (R15); cyclic exception detail is replaced with <cycle> placeholders so museum itself never raises.

Architecture overview (locked, 4 stages, strict superset)

Stage	What works	Modules added
0	`ckpt` acquisition gate — closed 2026-05-12 with result (B): 0 license-clean pairs; project ships `--accept-nc` opt-in + self-train as dual smoke path.	(research, no new module)
1	One engine, one voice, 5-second wav, CPU, agent-verify gate	input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= 11 layer modules; `cli.py` is the CLI entry point and is excluded from the module count)
2	Any voice, any song	+ profile_match (optional target spec)
3	"Roughly like X" via voice conversion	+ svc (DDSP-SVC default; RVC v2 / Seed-VC plugin)
4	Natural-language input (`vocaboot "sing 'Yume' like a soft anime voice"`)	+ nl_parse

Hard ceilings (anti-bloat invariant): layers ≤ 9, layer modules ≤ 12 (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, score_align will be folded into input and agent_advisory into agent_verify to keep the ceiling intact.

Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with 0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs for the DiffSinger format:

vocoder: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are not drop-in compatible with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
acoustic: every public DiffSinger community voicebank (qixuan, renri, nyaru, mitsudate …) is CC-BY-NC-SA-4.0. The openvpi code is Apache-2.0; the weights are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the substantive difference must be stated: vocaboot has no pip install-and-go commercial-OK default path, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

Path	Tier	$0?	Use
`--accept-nc` opt-in	CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic)	Yes (inference only)	personal, non-commercial smoke
self-train	Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio	No — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path)	commercial / OSS-clean redistribution

The default tier in docs/ckpt_registry.md stays empty until upstream licensing changes. Engine pivot (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See docs/ckpt_registry.md for the deferral rationale.

Output wav licensing (NC opt-in path)

When --accept-nc is set, the resulting .wav is the user's copyright, and CC's public position is that CC license conditions on the model do not automatically extend to model outputs (Creative Commons FAQ, Using CC-licensed Works for AI Training, 2025). However: the act of running NC weights is itself bound by the NC clause, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless --accept-nc is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured LICENSE_NOTICE.txt emission into the failure museum and stdout when an NC ckpt is loaded; see docs/ckpt_registry.md for the literal text.

"OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

OSS-1 (revised, 2026-05-12): 2 commands max to smoke wav. After pip install vocaboot[diffsinger,verify], running vocaboot demo --accept-nc must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled examples/short.ds, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The demo subcommand is a Stage 1 step 2 deliverable. The --accept-nc requirement is the deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via [verify]). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is not a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a --verify=agent mode that invoked claude -p for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a [verify-similarity] extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as museum-annotation only — structurally excluded from the verdict by design.

--no-agent-verify exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

Distribution

Stage 1 ships:

pip install vocaboot[diffsinger,verify] → either
- vocaboot demo --accept-nc (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), or
- vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc (BYO ckpt path).
import vocaboot → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

MCP server stdio — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
Anthropic Skill / Subagent — natural surface for Claude Code users; tradeoff is client coupling.
Headless library only (no CLI) — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did not rule any of them out.

Related work (delta)

vocaboot is not the first attempt at headless DiffSinger inference. Pre-existing projects:

Project	What it does	What `vocaboot` adds
openvpi DiffSingerMiniEngine	Python HTTP server wrapping ONNX inference	License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface
diffscope/dsinfer	C++ low-level inference SDK	Higher-level Python API + agent integration + verify gate
HighCWu / Jobsecond ONNX-Infer	Reference ONNX inference snippet	Production-grade gates + structured failure capture + R8 reproducibility
OpenUtau (headless mode)	GUI-first with scripted batch	$0 verify gate, no GUI bridge required
nnsvs/nnsvs	MIT SVS framework; weights per-voicebank	If Stage 0 reopens, nnsvs is the first re-target candidate (license-clean framework, weights TBD)

If, during Stage 0 or 1, a clean integration into one of these projects becomes more valuable than a standalone framework, the project is willing to redirect toward a PR contribution.

Stage 1 (current target)

# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.

examples/short.ds uses generic CV phoneme tokens (d o r e m i f a s o); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts examples/short.ds to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

`--voice synthetic` is an oscillator, not a voice

The Stage 1 synthetic-acoustic path produces a 4-partial harmonic oscillator driven by the score's note sequence. It is not vocal-tract-filtered voice synthesis — the ph_seq (phoneme) field is deliberately ignored, and the output has no vowel formants or consonants. Its purpose is to prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without depending on a license-clean acoustic checkpoint that does not exist (Stage 0 result B). Vowel-formant / phoneme-aware synthesis arrives with the BYO acoustic path in Stage 2.

A NUMERIC verify APPROVE on the synthetic path proves the pipeline rings, not that the output sounds like a voice. Stage 2 adds spectral/timbral metrics behind [verify-similarity] so APPROVE discriminates synthetic from real-voice output.

Roadmap

Architecture lock-in (rounds 28–32; per-round audit trails are inlined into docs/stage_2.md ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in CHANGELOG.md)
Stage 0: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
Stage 1 step 2 (5c): vocaboot demo subcommand wired (click.group, bundled examples/short.ds, --accept-nc mandatory, SHA-256 runtime verify of the pinned vocoder).
Stage 1 step 2 (5d): LICENSE_NOTICE.txt attribution emission for the NC vocoder (sidecar <out>.LICENSE_NOTICE.txt + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
Stage 1: smoke wav (CPU, 5 s) via vocaboot demo --accept-nc (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in test_stage1_synthetic_smoke_wav_end_to_end already serve as the CI canary.
Stage 1.5 closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) stage_1.py → pipeline.py rename, (B) privacy.refuse_user_audio_without_consent + museum user-audio hard-exclude rule, (C) docs/quality_gate.md Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) docs/ckpt_registry.md acquisition Path B (NC-only public release) chosen, (E) docs/stage_2.md spec.
Stage 2 (docs/stage_2.md): profile_match module + audio_ingest module + Tier 2 verify metrics (mcd, wavlm_cosine).
Stage 2.5: MCP server (vocaboot mcp-serve) — thin wrapper around pipeline.run.
Stage 3: SVC fallback chain (DDSP-SVC primary) + self-train recipe (acquisition Path C).
Stage 4: natural-language interface.
Public OSS release on Path B (NC-only, local screencast demo; HF Space deferred pending Default-tier ckpt).

Non-goals

Pitch-perfect voice cloning (target literal match is a known failure mode)
Bundled model weights (we point at upstream URLs and verify SHA-256)
GUI

License

Code: Apache-2.0. Model weights: inherit from upstream — vocaboot itself bundles none.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

May 13, 2026

0.3.1

May 13, 2026

0.3.0

May 13, 2026

0.2.1

May 13, 2026

0.2.0.post2

May 13, 2026

This version

0.1.0

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocaboot-0.1.0.tar.gz (1.3 MB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocaboot-0.1.0-py3-none-any.whl (59.8 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file vocaboot-0.1.0.tar.gz.

File metadata

Download URL: vocaboot-0.1.0.tar.gz
Upload date: May 12, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4dc73e543965a6349deb6ab88a19a1b47a03cb68e4ca8d25e55d6b7e06fde9a0`
MD5	`8747f47ad8a7856392edf134e037f7b1`
BLAKE2b-256	`f1bcff4c6530e27f7e533d2ad06f4d93664d7bbf87cc108bf3088a93f7be3cea`

See more details on using hashes here.

File details

Details for the file vocaboot-0.1.0-py3-none-any.whl.

File metadata

Download URL: vocaboot-0.1.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 59.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03ff918ae1dfd2e112f7ab6699991c725f0738f00da3bb63f958784fcf0c0171`
MD5	`aaec09eaa804f127267f2f2bdc4118e4`
BLAKE2b-256	`7859f2fa44199c35cae7a1b07e2254f483663ed0a52eaa292126111dc376de33`

See more details on using hashes here.

vocaboot 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vocaboot

Why

Design constraints

Architecture overview (locked, 4 stages, strict superset)

Stage 0 result (2026-05-12, closed: B)

Output wav licensing (NC opt-in path)

"OSS-1 accessible" invariant — Stage 0 result reframe

Verify gate ($0, no Claude dependency)

Distribution

Related work (delta)

Stage 1 (current target)

--voice synthetic is an oscillator, not a voice

Roadmap

Non-goals

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`--voice synthetic` is an oscillator, not a voice