Terminal-first singing voice synthesis framework for the Claude Code era.
Project description
vocaboot
Terminal-friendly singing voice synthesis framework for the Claude Code era.
Status: stage 1 smoke under development (2026-05-12). Not yet released.
Why
Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI.
vocaboot is the bridge: a single pip install and you can synthesize a singing voice from your shell, from a Python import, or from an MCP-aware agent. Stage 1 ships a CLI scaffold; whether CLI / MCP / Skill / library only is the primary surface for the Claude Code era is treated as an open question, to be re-evaluated at the end of Stage 1 against actual usage. (See "Distribution" below.)
Design constraints
- Cost: $0 (inference path) — CPU-only numeric verify is the $0 invariant. The self-train path (see "Stage 0 result") is not $0 by definition: it requires user GPU compute and audio data.
- License: Apache-2.0 (code) — model weights inherit their upstream license; no NC weights bundled.
- Generality: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
- "Roughly close" is good enough: chasing literal pitch-perfect voice cloning is a structural failure mode (see
_failures/); the framework optimizes for direction of a target timbre, not literal match. - Auto-failure museum (R8): every failed run writes a structured post-mortem under
_failures/, so the next attempt cannot repeat the same mistake. Paths are scrubbed to{basename, sha256}before persistence (R15); cyclic exception detail is replaced with<cycle>placeholders so museum itself never raises.
Architecture overview (locked, 4 stages, strict superset)
| Stage | What works | Modules added |
|---|---|---|
| 0 | ckpt acquisition gate — closed 2026-05-12 with result (B): 0 license-clean pairs; project ships --accept-nc opt-in + self-train as dual smoke path. |
(research, no new module) |
| 1 | One engine, one voice, 5-second wav, CPU, agent-verify gate | input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= 11 layer modules; cli.py is the CLI entry point and is excluded from the module count) |
| 2 | Any voice, any song | + profile_match (optional target spec) |
| 3 | "Roughly like X" via voice conversion | + svc (DDSP-SVC default; RVC v2 / Seed-VC plugin) |
| 4 | Natural-language input (vocaboot "sing 'Yume' like a soft anime voice") |
+ nl_parse |
Hard ceilings (anti-bloat invariant): layers ≤ 9, layer modules ≤ 12 (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, score_align will be folded into input and agent_advisory into agent_verify to keep the ceiling intact.
Stage 0 result (2026-05-12, closed: B)
Stage 0 ckpt research closed with 0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs for the DiffSinger format:
- vocoder: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are not drop-in compatible with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
- acoustic: every public DiffSinger community voicebank (
qixuan,renri,nyaru,mitsudate…) is CC-BY-NC-SA-4.0. The openvpi code is Apache-2.0; the weights are not.
vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the substantive difference must be stated: vocaboot has no pip install-and-go commercial-OK default path, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:
| Path | Tier | $0? | Use |
|---|---|---|---|
--accept-nc opt-in |
CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) | Yes (inference only) | personal, non-commercial smoke |
| self-train | Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio | No — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) | commercial / OSS-clean redistribution |
The default tier in docs/ckpt_registry.md stays empty until upstream licensing changes. Engine pivot (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See docs/ckpt_registry.md for the deferral rationale.
Output wav licensing (NC opt-in path)
When --accept-nc is set, the resulting .wav is the user's copyright, and CC's public position is that CC license conditions on the model do not automatically extend to model outputs (Creative Commons FAQ, Using CC-licensed Works for AI Training, 2025). However: the act of running NC weights is itself bound by the NC clause, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless --accept-nc is explicitly passed.
The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured LICENSE_NOTICE.txt emission into the failure museum and stdout when an NC ckpt is loaded; see docs/ckpt_registry.md for the literal text.
"OSS-1 accessible" invariant — Stage 0 result reframe
Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:
OSS-1 (revised, 2026-05-12): 2 commands max to smoke wav. After
pip install vocaboot[diffsinger,verify], runningvocaboot demo --accept-ncmust fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundledexamples/short.ds, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.
The demo subcommand is a Stage 1 step 2 deliverable. The --accept-nc requirement is the deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.
Verify gate ($0, no Claude dependency)
The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via [verify]). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".
Code review is not a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a --verify=agent mode that invoked claude -p for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).
Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a [verify-similarity] extra so the default Stage 1 install stays footprint-minimal.
Agent C (qualitative LLM-as-judge advisory) still runs as museum-annotation only — structurally excluded from the verdict by design.
--no-agent-verify exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.
Distribution
Stage 1 ships:
pip install vocaboot[diffsinger,verify]→ eithervocaboot demo --accept-nc(OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), orvocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc(BYO ckpt path).
import vocaboot→ Python library (same package)
Treated as Stage 2 retrospective decisions (no irrevocable rejection now):
- MCP server stdio — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
- Anthropic Skill / Subagent — natural surface for Claude Code users; tradeoff is client coupling.
- Headless library only (no CLI) — would simplify install footprint if the CLI proves redundant against MCP/Skill.
Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did not rule any of them out.
Related work (delta)
vocaboot is not the first attempt at headless DiffSinger inference. Pre-existing projects:
| Project | What it does | What vocaboot adds |
|---|---|---|
| openvpi DiffSingerMiniEngine | Python HTTP server wrapping ONNX inference | License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface |
| diffscope/dsinfer | C++ low-level inference SDK | Higher-level Python API + agent integration + verify gate |
| HighCWu / Jobsecond ONNX-Infer | Reference ONNX inference snippet | Production-grade gates + structured failure capture + R8 reproducibility |
| OpenUtau (headless mode) | GUI-first with scripted batch | $0 verify gate, no GUI bridge required |
| nnsvs/nnsvs | MIT SVS framework; weights per-voicebank | If Stage 0 reopens, nnsvs is the first re-target candidate (license-clean framework, weights TBD) |
If, during Stage 0 or 1, a clean integration into one of these projects becomes more valuable than a standalone framework, the project is willing to redirect toward a PR contribution.
Stage 1 (current target)
# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc
# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
--voice <ckpt_path> \
--score examples/short.ds \
--out smoke.wav \
--accept-nc \
--ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.
examples/short.ds uses generic CV phoneme tokens (d o r e m i f a s o); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts examples/short.ds to match its dictionary; for other ckpts, users must align phoneme tokens themselves.
--voice synthetic is an oscillator, not a voice
The Stage 1 synthetic-acoustic path produces a 4-partial harmonic oscillator driven by the score's note sequence. It is not vocal-tract-filtered voice synthesis — the ph_seq (phoneme) field is deliberately ignored, and the output has no vowel formants or consonants. Its purpose is to prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without depending on a license-clean acoustic checkpoint that does not exist (Stage 0 result B). Vowel-formant / phoneme-aware synthesis arrives with the BYO acoustic path in Stage 2.
A NUMERIC verify APPROVE on the synthetic path proves the pipeline rings, not that the output sounds like a voice. Stage 2 adds spectral/timbral metrics behind [verify-similarity] so APPROVE discriminates synthetic from real-voice output.
Roadmap
- Architecture lock-in (rounds 28–32; per-round audit trails are inlined into
docs/stage_2.md("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked inCHANGELOG.md) - Stage 0: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
- Stage 1 step 2 (5c):
vocaboot demosubcommand wired (click.group, bundledexamples/short.ds,--accept-ncmandatory, SHA-256 runtime verify of the pinned vocoder). - Stage 1 step 2 (5d):
LICENSE_NOTICE.txtattribution emission for the NC vocoder (sidecar<out>.LICENSE_NOTICE.txt+ stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out). - Stage 1: smoke wav (CPU, 5 s) via
vocaboot demo --accept-nc(synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end). - Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
- Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in
test_stage1_synthetic_smoke_wav_end_to_endalready serve as the CI canary. - Stage 1.5 closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A)
stage_1.py→pipeline.pyrename, (B)privacy.refuse_user_audio_without_consent+museumuser-audio hard-exclude rule, (C)docs/quality_gate.mdTier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D)docs/ckpt_registry.mdacquisition Path B (NC-only public release) chosen, (E)docs/stage_2.mdspec. - Stage 2 (
docs/stage_2.md):profile_matchmodule +audio_ingestmodule + Tier 2 verify metrics (mcd,wavlm_cosine). - Stage 2.5: MCP server (
vocaboot mcp-serve) — thin wrapper aroundpipeline.run. - Stage 3: SVC fallback chain (DDSP-SVC primary) + self-train recipe (acquisition Path C).
- Stage 4: natural-language interface.
- Public OSS release on Path B (NC-only, local screencast demo; HF Space deferred pending Default-tier ckpt).
Non-goals
- Pitch-perfect voice cloning (target literal match is a known failure mode)
- Bundled model weights (we point at upstream URLs and verify SHA-256)
- GUI
License
Code: Apache-2.0.
Model weights: inherit from upstream — vocaboot itself bundles none.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocaboot-0.1.0.tar.gz.
File metadata
- Download URL: vocaboot-0.1.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dc73e543965a6349deb6ab88a19a1b47a03cb68e4ca8d25e55d6b7e06fde9a0
|
|
| MD5 |
8747f47ad8a7856392edf134e037f7b1
|
|
| BLAKE2b-256 |
f1bcff4c6530e27f7e533d2ad06f4d93664d7bbf87cc108bf3088a93f7be3cea
|
File details
Details for the file vocaboot-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vocaboot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03ff918ae1dfd2e112f7ab6699991c725f0738f00da3bb63f958784fcf0c0171
|
|
| MD5 |
aaec09eaa804f127267f2f2bdc4118e4
|
|
| BLAKE2b-256 |
7859f2fa44199c35cae7a1b07e2254f483663ed0a52eaa292126111dc376de33
|