Terminal-first singing voice synthesis framework for the Claude Code era.
Project description
vocaboot
Singing voice synthesis from your terminal, callable from Claude Code via MCP.
Status: 0.3.1 Beta (pip install vocaboot==0.3.1, PyPI live 2026-05-14). DiffSinger BYO acoustic provider wired. 227 passed / 4 skipped / 0 failed (231 collected, pytest -q).
30-second demo
pip install 'vocaboot[diffsinger,verify]'
vocaboot demo --accept-nc # 5s wav, CPU; first run fetches ~50 MB NSF-HiFiGAN vocoder
# --accept-nc opts you into the CC-BY-NC-SA-4.0 vocoder weights — personal / non-commercial only.
🎧 Listen to a pre-generated demo (5s, 60 KB MP3).
Honest disclosure: Stage 1's --voice synthetic path is a 4-partial harmonic oscillator passed through the NSF-HiFiGAN vocoder — it proves the pipeline rings end-to-end and is the zero-dependency smoke target. Voice synthesis itself runs via the DiffSinger BYO acoustic provider (--voice <path>/acoustic.onnx, services/svs_engine.py:229 3-provider dispatch; landed in v0.3.0). Acquisition guide: docs/diffsinger_byo_spec.md. The NC posture (--accept-nc) still applies because the only license-clean vocoder (NSF-HiFiGAN) is CC-BY-NC-SA-4.0; the MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) is a future-cycle roadmap item, not a promise. Commercial-OK usage is NOT shipped in 0.x — the path is self-train (Stage 0 result B).
Why use vocaboot
- Terminal-first. Every operation works headlessly. No GUI, no web UI, no Electron.
- Claude-Code-MCP ready.
vocaboot mcp-serveexposessynth+versiontools over JSON-RPC stdio so an agent loop can drive synthesis directly. Setup:docs/mcp_setup.md. - $0 verify gate. Every synth runs pyworld F0 / HNR with R5 bootstrap confidence intervals. APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI can distinguish "undetermined" from "rejected".
- Privacy by default.
audio_ingestrefuses paths matchingDEFAULT_PROTECTED_SUBSTRINGSor your~/.config/vocaboot/protected_patterns.txt, and requires explicit consent flags before persisting anything (seesrc/vocaboot/privacy.py). - Failure museum. Every failed run writes a scrubbed
_failures/<id>/metadata.json(paths reduced to{basename, sha256}) so the next attempt cannot repeat the same mistake. Raw audio is never committed.
Quickstart with Claude Desktop
Add to ~/.config/Claude/claude_desktop_config.json (Linux) or ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):
{
"mcpServers": {
"vocaboot": {
"command": "vocaboot",
"args": ["mcp-serve", "--accept-nc"]
}
}
}
Restart Claude Desktop. Ask Claude: "Synthesize a singing voice from examples/short.ds and save to ~/out.wav." Claude routes the request to the synth tool via MCP. The ingest family (user-voice consumption, CLI subcommand vocaboot ingest) is deliberately not exposed — an LLM cannot emit a human-signal consent flag, so that boundary stays CLI-only (docs/stage_2.md "MCP boundary invariant").
Full setup, smithery.ai registry posting, and troubleshooting: docs/mcp_setup.md.
How vocaboot fits in
Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI. vocaboot is the bridge: a single pip install and you can synthesize from your shell, from a Python import, or from an MCP-aware agent.
vocaboot does not out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate, a privacy guard with structural user-audio exclusion, and an auto-failure museum. See "Related work (delta)" below.
Design constraints
- Cost: $0 (inference path) — CPU-only numeric verify is the $0 invariant. The self-train path (see "Stage 0 result") is not $0 by definition: it requires user GPU compute and audio data.
- License: Apache-2.0 (code) — model weights inherit their upstream license; no NC weights bundled.
- Generality: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
- "Roughly close" is good enough: chasing literal pitch-perfect voice cloning is a structural failure mode (see
_failures/); the framework optimizes for direction of a target timbre, not literal match. - Auto-failure museum (R8): every failed run writes a structured post-mortem under
_failures/, so the next attempt cannot repeat the same mistake. Paths are scrubbed to{basename, sha256}before persistence (R15); cyclic exception detail is replaced with<cycle>placeholders so museum itself never raises.
Architecture overview (locked, 4 stages, strict superset)
| Stage | What works | Modules added |
|---|---|---|
| 0 | ckpt acquisition gate — closed 2026-05-12 with result (B): 0 license-clean pairs; project ships --accept-nc opt-in + self-train as dual smoke path. |
(research, no new module) |
| 1 | One engine, one voice, 5-second wav, CPU, agent-verify gate | input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= 11 layer modules; cli.py is the CLI entry point and is excluded from the module count) |
| 2 | Any voice, any song | + profile_match (optional target spec) |
| 3 | "Roughly like X" via voice conversion | + svc (DDSP-SVC default; RVC v2 / SoulX-Singer plugin) |
| 4 | Natural-language input (vocaboot "sing 'Yume' like a soft anime voice") |
+ nl_parse |
Hard ceilings (anti-bloat invariant): layers ≤ 9, layer modules ≤ 12 (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, score_align will be folded into input and agent_advisory into agent_verify to keep the ceiling intact.
Stage 0 result (2026-05-12, closed: B)
Stage 0 ckpt research closed with 0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs for the DiffSinger format:
- vocoder: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are not drop-in compatible with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
- acoustic: every public DiffSinger community voicebank (
qixuan,renri,nyaru,mitsudate…) is CC-BY-NC-SA-4.0. The openvpi code is Apache-2.0; the weights are not.
vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the substantive difference must be stated: vocaboot has no pip install-and-go commercial-OK default path, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:
| Path | Tier | $0? | Use |
|---|---|---|---|
--accept-nc opt-in |
CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) | Yes (inference only) | personal, non-commercial smoke |
| self-train | Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio | No — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) | commercial / OSS-clean redistribution |
The default tier in docs/ckpt_registry.md stays empty until upstream licensing changes. Engine pivot (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See docs/ckpt_registry.md for the deferral rationale.
Output wav licensing (NC opt-in path)
When --accept-nc is set, the resulting .wav is the user's copyright, and CC's public position is that CC license conditions on the model do not automatically extend to model outputs (Creative Commons FAQ, Using CC-licensed Works for AI Training, 2025). However: the act of running NC weights is itself bound by the NC clause, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless --accept-nc is explicitly passed.
The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured LICENSE_NOTICE.txt emission into the failure museum and stdout when an NC ckpt is loaded; see docs/ckpt_registry.md for the literal text.
"OSS-1 accessible" invariant — Stage 0 result reframe
Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:
OSS-1 (revised, 2026-05-12): 2 commands max to smoke wav. After
pip install vocaboot[diffsinger,verify], runningvocaboot demo --accept-ncmust fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundledexamples/short.ds, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.
The demo subcommand is a Stage 1 step 2 deliverable. The --accept-nc requirement is the deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.
Verify gate ($0, no Claude dependency)
The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via [verify]). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".
Code review is not a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a --verify=agent mode that invoked claude -p for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).
Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a [verify-similarity] extra so the default Stage 1 install stays footprint-minimal.
Agent C (qualitative LLM-as-judge advisory) still runs as museum-annotation only — structurally excluded from the verdict by design.
--no-agent-verify exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.
Distribution
Stage 1 ships:
pip install vocaboot[diffsinger,verify]→ eithervocaboot demo --accept-nc(OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), orvocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc(BYO ckpt path).
import vocaboot→ Python library (same package)
Treated as Stage 2 retrospective decisions (no irrevocable rejection now):
- MCP server stdio — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
- Anthropic Skill / Subagent — natural surface for Claude Code users; tradeoff is client coupling.
- Headless library only (no CLI) — would simplify install footprint if the CLI proves redundant against MCP/Skill.
Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did not rule any of them out.
Related work (delta)
vocaboot is not the first OSS SVS/SVC project. The slot it occupies is terminal- and agent-loop-friendly wrapping around upstream engines, not engine R&D itself. Direct comparisons:
| Project | What it does | What vocaboot adds |
|---|---|---|
| SVC engines (Stage 3 plugin set) | ||
| RVC-Project / Retrieval-based-Voice-Conversion-WebUI (35.6k★, MIT, large community voicebank set) | GUI-first realtime SVC, voicebanks per-license-audited | Terminal & MCP surface, $0 numeric verify gate, R5 CI, failure museum, agent-callable; vocaboot wraps RVC v2 as a plugin, not competing on engine quality |
| yxlllc / DDSP-SVC (2.5k★, MIT, CPU-capable) | Lightweight DDSP-based SVC, mature CLI inference | Primary engine of record for Stage 3; vocaboot adds protocol bifurcation, registry-pinned weights, and verify-against-target post-conversion gate |
| Soul-AILab / SoulX-Singer (608★, Apache-2.0 code + weights, 42 000 h vocal train) | Transcription-free zero-shot SVC; Apache-2.0 across the board (only one in the plugin set) | Default-tier candidate; vocaboot integrates as remote (HF Space) or subprocess depending on the 5.6 GB / GPU envelope (see docs/stage_3.md for the A/B/C distribution-form simulation) |
| SVS engines (Stage 1 plugin set) | ||
| openvpi / DiffSingerMiniEngine | Python HTTP server wrapping ONNX inference | License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface |
| diffscope / dsinfer | C++ low-level inference SDK | Higher-level Python API + agent integration + verify gate |
| Jobsecond / diffsinger-onnx-infer | Reference ONNX inference snippet | Production-grade gates + structured failure capture + R8 reproducibility |
| OpenUtau (headless mode) | GUI-first with scripted batch | $0 verify gate, no GUI bridge required |
| nnsvs / nnsvs | MIT SVS framework; weights per-voicebank | Stage 0 re-target candidate if upstream weight licensing changes |
vocaboot does not try to out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate (R5 bootstrap CI), a privacy guard with structural user-audio exclusion, and an auto-failure museum. If a clean integration into one of the above projects becomes more valuable than a standalone wrapper, vocaboot will redirect toward a PR contribution.
Stage 1 (current target)
# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc
# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
--voice <ckpt_path> \
--score examples/short.ds \
--out smoke.wav \
--accept-nc \
--ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.
examples/short.ds uses generic CV phoneme tokens (d o r e m i f a s o); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts examples/short.ds to match its dictionary; for other ckpts, users must align phoneme tokens themselves.
Two acoustic paths: synthetic (smoke) and DiffSinger BYO (voice)
--voice synthetic produces a 4-partial harmonic oscillator driven by the score's note sequence — ph_seq is deliberately ignored, output has no vowel formants or consonants. Purpose: prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without a checkpoint. A NUMERIC verify APPROVE on the synthetic path proves the pipeline rings, not that the output sounds like a voice.
--voice <path>/acoustic.onnx activates the DiffSinger BYO acoustic provider (v0.3.0+, services/svs_engine.py:229): acoustic.onnx + sibling phonemes.json are loaded by onnxruntime CPU, ph_seq / ph_dur / note_seq are converted to (tokens: int64, durations: int64, f0: float32) per the openvpi v2.3.0 contract, and the mel [1, n_frames, 128] is handed to NSF-HiFiGAN. This is real vocal-tract-filtered synthesis. Acquisition guide + on-disk layout + failure-mode table: docs/diffsinger_byo_spec.md.
G1 wiring-sanity gate (NOT a voice-quality gate): tests/integration/test_byo_acoustic.py asserts wav.size ≥ 0.75 × expected + np.isfinite(wav).all(). This proves the BYO ckpt is wire-loadable and produces a finite waveform; it does not assert that the output is voice-like or matches the input voicebank. Voice-quality objective evaluation (wavlm_cosine ≥ 0.60 + MCD CI upper bound ≤ 500 dB librosa-MFCC scale, see docs/quality_gate.md:14,:45, on N≥3 voicebanks) is post-0.3.0 work tracked in CHANGELOG.md under ## [Unreleased].
Stage 2 spectral/timbral metrics behind [verify-similarity] discriminate synthetic from real-voice output post-synth (wavlm_cosine + MCD with 2-sided threshold band).
Stage 3 DDSP-SVC voice conversion (v0.2.0-rc1, BYO)
pip install vocaboot[svc-ddsp]
git clone https://github.com/yxlllc/DDSP-SVC.git ~/ddsp-svc
# ...follow upstream README to install its requirements, train or
# download a ckpt, and compute sha256sum model.pt
export VOCABOOT_DDSP_PATH=~/ddsp-svc/exp/combsub-test/model_300000.pt
export VOCABOOT_DDSP_SHA256=<digest>
export VOCABOOT_DDSP_SRC=~/ddsp-svc
vocaboot ddsp provision # verify sha256 + weights-only probe
vocaboot ddsp convert --source singer.wav --out converted.wav
The convert subcommand walks a 4-stage refusal ladder (unconfirmed
sha256 → ckpt not found → hash mismatch → engine not wired) and
dispatches model.forward to the upstream subprocess. An in-process
vendored adapter making the VOCABOOT_DDSP_SRC step optional is on
the post-1.0 roadmap (no version promise; tracked under v1.1.0+
deferred items below).
Completion criteria (OSS 1k★ 水準、 R14 round 5 構想層、 2026-05-14 snapshot)
vocaboot が「completed」 = OSS スター 1k 級獲得水準と判定するための 5 product-level criteria。 v0.1.0 alpha の internal-quality gate (R5 verify / R8 museum / R15 privacy) の上に重ねる外向き条件で、 同類 SVS/SVC project (RVC 35k★ / MoonInTheRiver/DiffSinger 4.7k★ / OpenUtau 3.8k★ / openvpi/DiffSinger 3.1k★ / DDSP-SVC 2.5k★) の共通点を逆算して導出。
- C1.
pip install vocaboot[...]後 30 秒以内に音が出る — v0.2.0 で DiffSinger ONNX BYO path 開設、 v0.3.0 で BYO acoustic provider 完成。 OSS demo default の no-NC 完全達成は MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) が landed する future cycle (round 8 critic 残留指摘、 v1.0.0 ship までに land 必須)。 - C2. 出力が「歌声に聞こえる」 — v0.3.0 で DiffSinger BYO acoustic provider landed (
services/svs_engine.py:2293-provider dispatch、 acoustic.onnx + phonemes.json companion、 openvpi v2.3.0 baseline)。 v0.3.0 ship gate = G1 wiring sanity gate (tests/integration/test_byo_acoustic.py: wav.size ≥ 0.75 × expected +np.isfinite(wav).all()、 voice quality は主張しない)。 v1.0.0 ship gate には G2 objective gate (wavlm_cosine ≥ 0.60 + MCD CI upper bound ≤ 500 dB librosa-MFCC scale, seedocs/quality_gate.md:14,:45, on N≥3 voicebank) + maintainer-attested N=1 listening pass が必要 — 0.x lifetime 内で community feedback + real ckpt 検証を累積する。 - C3. README に動く demo wav (
<audio>embed) + 30 秒 mp4 demo —examples/demo_output.mp3(5s, 60 KB) を README 上部 link 提供で部分達成。 mp4 demo は v1.0.0 candidate cycle。 - C4. HF Space / Replicate に 1-click try ページ — Stage 3 SoulX 配布形態 C 経由、 future cycle。
- C5. Twitter/X + HN + Reddit r/MachineLearning ローンチ投稿 — v1.0.0 release で実施、 C3-C4 はそれまでに land 必須。
vocaboot 独自軸の「terminal-first + verify gate + failure museum」は developer-tooling 価値 で OSS star audience の 20% (dev community) に効く一方、 残り 80% の非エンジニア audience には C1-C5 が必須。 v1.0.0 完成宣言判定基準 (構想層、 future cycle): A (DiffSinger BYO acoustic provider、 v0.3.0 で landed) + G2 objective gate pass (real ckpt N≥3) + MIT vocoder bridge land (NC posture 解消) + maintainer-attested voice listening + R14 round 9 session-separation audit pass + packaging final。 0.3.0 → 1.0.0 jump は最低 (G2 pass + MIT bridge land) を ship gate にする。
Roadmap
- Architecture lock-in (rounds 28–32; per-round audit trails are inlined into
docs/stage_2.md("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked inCHANGELOG.md) - Stage 0: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
- Stage 1 step 2 (5c):
vocaboot demosubcommand wired (click.group, bundledexamples/short.ds,--accept-ncmandatory, SHA-256 runtime verify of the pinned vocoder). - Stage 1 step 2 (5d):
LICENSE_NOTICE.txtattribution emission for the NC vocoder (sidecar<out>.LICENSE_NOTICE.txt+ stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out). - Stage 1: smoke wav (CPU, 5 s) via
vocaboot demo --accept-nc(synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end). - Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
- Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in
test_stage1_synthetic_smoke_wav_end_to_endalready serve as the CI canary. - Stage 1.5 closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A)
stage_1.py→pipeline.pyrename, (B)privacy.refuse_user_audio_without_consent+museumuser-audio hard-exclude rule, (C)docs/quality_gate.mdTier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D)docs/ckpt_registry.mdacquisition Path B (NC-only public release) chosen, (E)docs/stage_2.mdspec. - Public OSS release (2026-05-13): Apache-2.0 LICENSE pristine, NOTICE chain audited, GitHub
license.spdx_id=apache-2.0復帰確認済。 4 commit (c267afd+18733d1+9f0ae47+26a3589) onorigin/main. - v0.2.0 release tag (2026-05-13、 現 0.2.1): C1+C2 BYO-scope 達成 + Stage 3 DDSP-SVC 実 inference wiring 完了 (R14 round 5 構想層判定)。 PyPI live:
pip install vocaboot==0.2.1。 - Stage 2.5 MCP server (
vocaboot mcp-serve、 v0.2.1 同梱、 2026-05-13):pipeline.runの thin wrapper (services/mcp_server.py)。[mcp]extra にmcp>=0.9を wire-in、 stdio JSON-RPC でsynth/versiontool を提供。audio_ingest/ consent-gated 系は MCP 境界不変条件 (docs/stage_2.md) により非公開、accept_ncは--accept-ncで server-start 時に human signal として固定。 - Stage 2 closure (2026-05-13、 v0.2.1 でレッテル付け):
docs/stage_2.mdAcceptance criteria 9 件 (Tier 2 canary / discrimination / consent gate / VoiceProfile round-trip / no Stage 1 regression / layer count / WavLM pin /SpeakerSimilarityProtocol) すべて 0.2.0.post2 時点で test gate green、 0.2.1 で changelog に明示。
v0.3.0 (released 2026-05-14, BYO acoustic provider landing)
- PR0 prerequisite: mypy strict 設定 + 真 bug 8 件 fix (
pipeline.py:66/165等 latent bug を mypy strict 実走で検出済)、 設計文書全面書き換え (docs/round8_w1_a_design.md)、 G2 wording を objective gate 化。 - W1 = A: DiffSinger BYO acoustic provider at
services/svs_engine.py:229(openvpi/DiffSinger v2.3.0 baseline、 acoustic.onnx + phonemes.json 経由、 ~140 LoC、 21 unit + 1 integration test)。 - packaging: NC posture explicit、 PyPI 0.3.0 ship。 G1 wiring sanity gate (
tests/integration/test_byo_acoustic.py) を 0.3.0 ship gate に採用。
Toward v1.0.0 (構想層、 future cycle, NOT 0.3.0 ship blockers)
両 agent (critic + analyst) audit による 1.0.0 ship blocker = G2 objective gate pass + MIT vocoder bridge land + maintainer-attested voice listening + R14 round 9 session-separation audit + API stability commitment 整備。 これらを v1.0.0 candidate cycle で累積し、 1.0.0 = production stable claim と実体が乖離しない state で ship する。 0.3.0 → 1.0.0 中間に 0.4.0 / 0.5.0 / ... を入れる選択肢は open。
Deferred (future cycles, no version promises)
- B: Stage 3 SVC remaining (
services/svc/rvc.pyRVC v2 per-voicebank license gate +services/svc/soulx.pySoulX-Singer Apache-2.0 subprocess + HF Space remote)。 - C: Stage 4 natural-language interface (NL prompt → DiffSinger
.dsJSON、[nl]extra)。 - MIT vocoder bridge (BigVGAN / Vocos + mel-format adapter): NC default 解消の真の道筋。 v1.0.0 ship gate (round 8 critic 残留指摘 解消)。
- G2 objective gate verification: real openvpi DiffSinger ckpt + N≥3 voicebank で wavlm_cosine / MCD 計測。 v1.0.0 ship gate。
- Round 9 R14 session-separation audit: round 7/8/W1 が全て同一 originSessionId 850adffc 内で進行した integrity 残留 (critic C1.1 残留 2.5/10)。 別 session で実施、 v1.0.0 ship gate。
- DiffSinger fork variants (mel_bin=80 等、 upstream schema 進化): mel-format adapter 経由で対応。
- Completion C3-C5: 30s mp4 demo + HF Space 1-click + HN/Reddit ローンチ。
- DDSP vendored adapter:
VOCABOOT_DDSP_SRC経由 subprocess を in-process adapter に置換。 - MCP-server-start telemetry instrumentation: observe-only PR で実装。
Non-goals
- Pitch-perfect voice cloning (target literal match is a known failure mode)
- Bundled model weights (we point at upstream URLs and verify SHA-256)
- GUI
Security: malicious ckpt threat model
vocaboot loads PyTorch checkpoint files (.pt) supplied by the user via
VOCABOOT_DDSP_PATH / VOCABOOT_VOCODER_PATH / VOCABOOT_WAVLM_PATH. A
malicious .pt file can embed pickle payloads that execute arbitrary
code at deserialization time (OWASP A08 / CWE-502 — insecure
deserialization). Defenses:
- sha256 verify before load —
core.ckpt_registry.verify()refuses any ckpt whose bytes don't match the user-committed digest (VOCABOOT_*_SHA256). A man-in-the-middle download swap is caught at the byte boundary, before PyTorch's parser is reached. torch.load(weights_only=True)— the default incore.ckpt_registry.safe_torch_load(). PyTorch's weights-only deserializer refuses arbitrary classes; legitimate ckpts bundlingargparse.Namespace(e.g. DDSP-SVC v5.0) triggerUnsafeCkptErrorso the user explicitly opts in toallow_unsafe=Truerather than silently widening the trust boundary.- Subprocess isolation for upstream inference —
vocaboot ddsp convertinvokesyxlllc/DDSP-SVC'smain.pyin a child process (VOCABOOT_DDSP_SRC). The weights are loaded in upstream's address space; a successful exploit would still need to break out of the subprocess to affect vocaboot's process.
Social-engineering caveat (cannot be defended automatically): if a
user is socially engineered into exporting an adversary-supplied
digest into VOCABOOT_*_SHA256 AND passing allow_unsafe=True, no
in-tree defense catches it. Pin sha256 values from sources you trust
— cross-check against docs/ckpt_registry.md or the upstream model
card before exporting.
License
Code: Apache-2.0.
Model weights: inherit from upstream — vocaboot itself bundles none.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocaboot-0.3.1.tar.gz.
File metadata
- Download URL: vocaboot-0.3.1.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4418b3a6ebdce58648af96da52c9177a7cf2304104cf96ba31037672889efe13
|
|
| MD5 |
a1ef036ab8d5239ebed03fd3b439154d
|
|
| BLAKE2b-256 |
c055fe9ba7880875a013b1f8ca7794b8aea3ed0a0ab5d3aac992e1650ba4894a
|
File details
Details for the file vocaboot-0.3.1-py3-none-any.whl.
File metadata
- Download URL: vocaboot-0.3.1-py3-none-any.whl
- Upload date:
- Size: 89.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5740a1d0e4441955acc89c9c70d1dfeb73ea559257aa5647b67d17f4179c4c4f
|
|
| MD5 |
047f0b15ef16289ecc69956017cede13
|
|
| BLAKE2b-256 |
e75e678992ae4cd06997cb030dd190e4275a5644a246df1d00db5148cde9022b
|