Skip to main content

Terminal-first singing voice synthesis framework for the Claude Code era.

Project description

vocaboot

Terminal-friendly singing voice synthesis framework for the Claude Code era.

Status: stage 1 smoke under development (2026-05-12). Not yet released.

Why

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI.

vocaboot is the bridge: a single pip install and you can synthesize a singing voice from your shell, from a Python import, or from an MCP-aware agent. Stage 1 ships a CLI scaffold; whether CLI / MCP / Skill / library only is the primary surface for the Claude Code era is treated as an open question, to be re-evaluated at the end of Stage 1 against actual usage. (See "Distribution" below.)

Design constraints

  • Cost: $0 (inference path) — CPU-only numeric verify is the $0 invariant. The self-train path (see "Stage 0 result") is not $0 by definition: it requires user GPU compute and audio data.
  • License: Apache-2.0 (code) — model weights inherit their upstream license; no NC weights bundled.
  • Generality: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
  • "Roughly close" is good enough: chasing literal pitch-perfect voice cloning is a structural failure mode (see _failures/); the framework optimizes for direction of a target timbre, not literal match.
  • Auto-failure museum (R8): every failed run writes a structured post-mortem under _failures/, so the next attempt cannot repeat the same mistake. Paths are scrubbed to {basename, sha256} before persistence (R15); cyclic exception detail is replaced with <cycle> placeholders so museum itself never raises.

Architecture overview (locked, 4 stages, strict superset)

Stage What works Modules added
0 ckpt acquisition gate — closed 2026-05-12 with result (B): 0 license-clean pairs; project ships --accept-nc opt-in + self-train as dual smoke path. (research, no new module)
1 One engine, one voice, 5-second wav, CPU, agent-verify gate input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= 11 layer modules; cli.py is the CLI entry point and is excluded from the module count)
2 Any voice, any song + profile_match (optional target spec)
3 "Roughly like X" via voice conversion + svc (DDSP-SVC default; RVC v2 / Seed-VC plugin)
4 Natural-language input (vocaboot "sing 'Yume' like a soft anime voice") + nl_parse

Hard ceilings (anti-bloat invariant): layers ≤ 9, layer modules ≤ 12 (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, score_align will be folded into input and agent_advisory into agent_verify to keep the ceiling intact.

Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with 0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs for the DiffSinger format:

  • vocoder: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are not drop-in compatible with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
  • acoustic: every public DiffSinger community voicebank (qixuan, renri, nyaru, mitsudate …) is CC-BY-NC-SA-4.0. The openvpi code is Apache-2.0; the weights are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the substantive difference must be stated: vocaboot has no pip install-and-go commercial-OK default path, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

Path Tier $0? Use
--accept-nc opt-in CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) Yes (inference only) personal, non-commercial smoke
self-train Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio No — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) commercial / OSS-clean redistribution

The default tier in docs/ckpt_registry.md stays empty until upstream licensing changes. Engine pivot (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See docs/ckpt_registry.md for the deferral rationale.

Output wav licensing (NC opt-in path)

When --accept-nc is set, the resulting .wav is the user's copyright, and CC's public position is that CC license conditions on the model do not automatically extend to model outputs (Creative Commons FAQ, Using CC-licensed Works for AI Training, 2025). However: the act of running NC weights is itself bound by the NC clause, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless --accept-nc is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured LICENSE_NOTICE.txt emission into the failure museum and stdout when an NC ckpt is loaded; see docs/ckpt_registry.md for the literal text.

"OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

OSS-1 (revised, 2026-05-12): 2 commands max to smoke wav. After pip install vocaboot[diffsinger,verify], running vocaboot demo --accept-nc must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled examples/short.ds, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The demo subcommand is a Stage 1 step 2 deliverable. The --accept-nc requirement is the deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via [verify]). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is not a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a --verify=agent mode that invoked claude -p for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a [verify-similarity] extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as museum-annotation only — structurally excluded from the verdict by design.

--no-agent-verify exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

Distribution

Stage 1 ships:

  • pip install vocaboot[diffsinger,verify] → either
    • vocaboot demo --accept-nc (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), or
    • vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc (BYO ckpt path).
  • import vocaboot → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

  • MCP server stdio — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
  • Anthropic Skill / Subagent — natural surface for Claude Code users; tradeoff is client coupling.
  • Headless library only (no CLI) — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did not rule any of them out.

Related work (delta)

vocaboot is not the first attempt at headless DiffSinger inference. Pre-existing projects:

Project What it does What vocaboot adds
openvpi DiffSingerMiniEngine Python HTTP server wrapping ONNX inference License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface
diffscope/dsinfer C++ low-level inference SDK Higher-level Python API + agent integration + verify gate
HighCWu / Jobsecond ONNX-Infer Reference ONNX inference snippet Production-grade gates + structured failure capture + R8 reproducibility
OpenUtau (headless mode) GUI-first with scripted batch $0 verify gate, no GUI bridge required
nnsvs/nnsvs MIT SVS framework; weights per-voicebank If Stage 0 reopens, nnsvs is the first re-target candidate (license-clean framework, weights TBD)

If, during Stage 0 or 1, a clean integration into one of these projects becomes more valuable than a standalone framework, the project is willing to redirect toward a PR contribution.

Stage 1 (current target)

# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.

examples/short.ds uses generic CV phoneme tokens (d o r e m i f a s o); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts examples/short.ds to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

--voice synthetic is an oscillator, not a voice

The Stage 1 synthetic-acoustic path produces a 4-partial harmonic oscillator driven by the score's note sequence. It is not vocal-tract-filtered voice synthesis — the ph_seq (phoneme) field is deliberately ignored, and the output has no vowel formants or consonants. Its purpose is to prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without depending on a license-clean acoustic checkpoint that does not exist (Stage 0 result B). Vowel-formant / phoneme-aware synthesis arrives with the BYO acoustic path in Stage 2.

A NUMERIC verify APPROVE on the synthetic path proves the pipeline rings, not that the output sounds like a voice. Stage 2 adds spectral/timbral metrics behind [verify-similarity] so APPROVE discriminates synthetic from real-voice output.

Roadmap

  • Architecture lock-in (rounds 28–32; per-round audit trails are inlined into docs/stage_2.md ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in CHANGELOG.md)
  • Stage 0: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
  • Stage 1 step 2 (5c): vocaboot demo subcommand wired (click.group, bundled examples/short.ds, --accept-nc mandatory, SHA-256 runtime verify of the pinned vocoder).
  • Stage 1 step 2 (5d): LICENSE_NOTICE.txt attribution emission for the NC vocoder (sidecar <out>.LICENSE_NOTICE.txt + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
  • Stage 1: smoke wav (CPU, 5 s) via vocaboot demo --accept-nc (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
  • Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
  • Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in test_stage1_synthetic_smoke_wav_end_to_end already serve as the CI canary.
  • Stage 1.5 closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) stage_1.pypipeline.py rename, (B) privacy.refuse_user_audio_without_consent + museum user-audio hard-exclude rule, (C) docs/quality_gate.md Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) docs/ckpt_registry.md acquisition Path B (NC-only public release) chosen, (E) docs/stage_2.md spec.
  • Stage 2 (docs/stage_2.md): profile_match module + audio_ingest module + Tier 2 verify metrics (mcd, wavlm_cosine).
  • Stage 2.5: MCP server (vocaboot mcp-serve) — thin wrapper around pipeline.run.
  • Stage 3: SVC fallback chain (DDSP-SVC primary) + self-train recipe (acquisition Path C).
  • Stage 4: natural-language interface.
  • Public OSS release on Path B (NC-only, local screencast demo; HF Space deferred pending Default-tier ckpt).

Non-goals

  • Pitch-perfect voice cloning (target literal match is a known failure mode)
  • Bundled model weights (we point at upstream URLs and verify SHA-256)
  • GUI

License

Code: Apache-2.0. Model weights: inherit from upstream — vocaboot itself bundles none.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocaboot-0.1.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vocaboot-0.1.0-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file vocaboot-0.1.0.tar.gz.

File metadata

  • Download URL: vocaboot-0.1.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4dc73e543965a6349deb6ab88a19a1b47a03cb68e4ca8d25e55d6b7e06fde9a0
MD5 8747f47ad8a7856392edf134e037f7b1
BLAKE2b-256 f1bcff4c6530e27f7e533d2ad06f4d93664d7bbf87cc108bf3088a93f7be3cea

See more details on using hashes here.

File details

Details for the file vocaboot-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vocaboot-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 59.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03ff918ae1dfd2e112f7ab6699991c725f0738f00da3bb63f958784fcf0c0171
MD5 aaec09eaa804f127267f2f2bdc4118e4
BLAKE2b-256 7859f2fa44199c35cae7a1b07e2254f483663ed0a52eaa292126111dc376de33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page