Terminal-first singing voice synthesis framework for the Claude Code era.

These details have not been verified by PyPI

Project links

Project description

vocaboot

Singing voice synthesis from your terminal, callable from Claude Code via MCP.

Status: 0.3.1 Beta (pip install vocaboot==0.3.1, PyPI live 2026-05-14). DiffSinger BYO acoustic provider wired. 227 passed / 4 skipped / 0 failed (231 collected, pytest -q).

30-second demo

pip install 'vocaboot[diffsinger,verify]'
vocaboot demo --accept-nc            # 5s wav, CPU; first run fetches ~50 MB NSF-HiFiGAN vocoder
# --accept-nc opts you into the CC-BY-NC-SA-4.0 vocoder weights — personal / non-commercial only.

🎧 Listen to a pre-generated demo (5s, 60 KB MP3).

Honest disclosure: Stage 1's --voice synthetic path is a 4-partial harmonic oscillator passed through the NSF-HiFiGAN vocoder — it proves the pipeline rings end-to-end and is the zero-dependency smoke target. Voice synthesis itself runs via the DiffSinger BYO acoustic provider (--voice <path>/acoustic.onnx, services/svs_engine.py:229 3-provider dispatch; landed in v0.3.0). Acquisition guide: docs/diffsinger_byo_spec.md. The NC posture (--accept-nc) still applies because the only license-clean vocoder (NSF-HiFiGAN) is CC-BY-NC-SA-4.0; the MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) is a future-cycle roadmap item, not a promise. Commercial-OK usage is NOT shipped in 0.x — the path is self-train (Stage 0 result B).

Why use vocaboot

Terminal-first. Every operation works headlessly. No GUI, no web UI, no Electron.
Claude-Code-MCP ready. vocaboot mcp-serve exposes synth + version tools over JSON-RPC stdio so an agent loop can drive synthesis directly. Setup: docs/mcp_setup.md.
$0 verify gate. Every synth runs pyworld F0 / HNR with R5 bootstrap confidence intervals. APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI can distinguish "undetermined" from "rejected".
Privacy by default. audio_ingest refuses paths matching DEFAULT_PROTECTED_SUBSTRINGS or your ~/.config/vocaboot/protected_patterns.txt, and requires explicit consent flags before persisting anything (see src/vocaboot/privacy.py).
Failure museum. Every failed run writes a scrubbed _failures/<id>/metadata.json (paths reduced to {basename, sha256}) so the next attempt cannot repeat the same mistake. Raw audio is never committed.

Quickstart with Claude Desktop

Add to ~/.config/Claude/claude_desktop_config.json (Linux) or ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "vocaboot": {
      "command": "vocaboot",
      "args": ["mcp-serve", "--accept-nc"]
    }
  }
}

Restart Claude Desktop. Ask Claude: "Synthesize a singing voice from examples/short.ds and save to ~/out.wav." Claude routes the request to the synth tool via MCP. The ingest family (user-voice consumption, CLI subcommand vocaboot ingest) is deliberately not exposed — an LLM cannot emit a human-signal consent flag, so that boundary stays CLI-only (docs/stage_2.md "MCP boundary invariant").

Full setup, smithery.ai registry posting, and troubleshooting: docs/mcp_setup.md.

How vocaboot fits in

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI. vocaboot is the bridge: a single pip install and you can synthesize from your shell, from a Python import, or from an MCP-aware agent.

vocaboot does not out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate, a privacy guard with structural user-audio exclusion, and an auto-failure museum. See "Related work (delta)" below.

Design constraints

Cost: $0 (inference path) — CPU-only numeric verify is the $0 invariant. The self-train path (see "Stage 0 result") is not $0 by definition: it requires user GPU compute and audio data.
License: Apache-2.0 (code) — model weights inherit their upstream license; no NC weights bundled.
Generality: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
"Roughly close" is good enough: chasing literal pitch-perfect voice cloning is a structural failure mode (see _failures/); the framework optimizes for direction of a target timbre, not literal match.
Auto-failure museum (R8): every failed run writes a structured post-mortem under _failures/, so the next attempt cannot repeat the same mistake. Paths are scrubbed to {basename, sha256} before persistence (R15); cyclic exception detail is replaced with <cycle> placeholders so museum itself never raises.

Architecture overview (locked, 4 stages, strict superset)

Stage	What works	Modules added
0	`ckpt` acquisition gate — closed 2026-05-12 with result (B): 0 license-clean pairs; project ships `--accept-nc` opt-in + self-train as dual smoke path.	(research, no new module)
1	One engine, one voice, 5-second wav, CPU, agent-verify gate	input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= 11 layer modules; `cli.py` is the CLI entry point and is excluded from the module count)
2	Any voice, any song	+ profile_match (optional target spec)
3	"Roughly like X" via voice conversion	+ svc (DDSP-SVC default; RVC v2 / SoulX-Singer plugin)
4	Natural-language input (`vocaboot "sing 'Yume' like a soft anime voice"`)	+ nl_parse

Hard ceilings (anti-bloat invariant): layers ≤ 9, layer modules ≤ 12 (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, score_align will be folded into input and agent_advisory into agent_verify to keep the ceiling intact.

Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with 0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs for the DiffSinger format:

vocoder: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are not drop-in compatible with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
acoustic: every public DiffSinger community voicebank (qixuan, renri, nyaru, mitsudate …) is CC-BY-NC-SA-4.0. The openvpi code is Apache-2.0; the weights are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the substantive difference must be stated: vocaboot has no pip install-and-go commercial-OK default path, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

Path	Tier	$0?	Use
`--accept-nc` opt-in	CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic)	Yes (inference only)	personal, non-commercial smoke
self-train	Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio	No — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path)	commercial / OSS-clean redistribution

The default tier in docs/ckpt_registry.md stays empty until upstream licensing changes. Engine pivot (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See docs/ckpt_registry.md for the deferral rationale.

Output wav licensing (NC opt-in path)

When --accept-nc is set, the resulting .wav is the user's copyright, and CC's public position is that CC license conditions on the model do not automatically extend to model outputs (Creative Commons FAQ, Using CC-licensed Works for AI Training, 2025). However: the act of running NC weights is itself bound by the NC clause, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless --accept-nc is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured LICENSE_NOTICE.txt emission into the failure museum and stdout when an NC ckpt is loaded; see docs/ckpt_registry.md for the literal text.

"OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

OSS-1 (revised, 2026-05-12): 2 commands max to smoke wav. After pip install vocaboot[diffsinger,verify], running vocaboot demo --accept-nc must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled examples/short.ds, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The demo subcommand is a Stage 1 step 2 deliverable. The --accept-nc requirement is the deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via [verify]). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is not a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a --verify=agent mode that invoked claude -p for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a [verify-similarity] extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as museum-annotation only — structurally excluded from the verdict by design.

--no-agent-verify exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

Distribution

Stage 1 ships:

pip install vocaboot[diffsinger,verify] → either
- vocaboot demo --accept-nc (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), or
- vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc (BYO ckpt path).
import vocaboot → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

MCP server stdio — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
Anthropic Skill / Subagent — natural surface for Claude Code users; tradeoff is client coupling.
Headless library only (no CLI) — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did not rule any of them out.

Related work (delta)

vocaboot is not the first OSS SVS/SVC project. The slot it occupies is terminal- and agent-loop-friendly wrapping around upstream engines, not engine R&D itself. Direct comparisons:

Project	What it does	What `vocaboot` adds
SVC engines (Stage 3 plugin set)
RVC-Project / Retrieval-based-Voice-Conversion-WebUI (35.6k★, MIT, large community voicebank set)	GUI-first realtime SVC, voicebanks per-license-audited	Terminal & MCP surface, $0 numeric verify gate, R5 CI, failure museum, agent-callable; vocaboot wraps RVC v2 as a plugin, not competing on engine quality
yxlllc / DDSP-SVC (2.5k★, MIT, CPU-capable)	Lightweight DDSP-based SVC, mature CLI inference	Primary engine of record for Stage 3; vocaboot adds protocol bifurcation, registry-pinned weights, and verify-against-target post-conversion gate
Soul-AILab / SoulX-Singer (608★, Apache-2.0 code + weights, 42 000 h vocal train)	Transcription-free zero-shot SVC; Apache-2.0 across the board (only one in the plugin set)	Default-tier candidate; vocaboot integrates as remote (HF Space) or subprocess depending on the 5.6 GB / GPU envelope (see `docs/stage_3.md` for the A/B/C distribution-form simulation)
SVS engines (Stage 1 plugin set)
openvpi / DiffSingerMiniEngine	Python HTTP server wrapping ONNX inference	License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface
diffscope / dsinfer	C++ low-level inference SDK	Higher-level Python API + agent integration + verify gate
Jobsecond / diffsinger-onnx-infer	Reference ONNX inference snippet	Production-grade gates + structured failure capture + R8 reproducibility
OpenUtau (headless mode)	GUI-first with scripted batch	$0 verify gate, no GUI bridge required
nnsvs / nnsvs	MIT SVS framework; weights per-voicebank	Stage 0 re-target candidate if upstream weight licensing changes

vocaboot does not try to out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate (R5 bootstrap CI), a privacy guard with structural user-audio exclusion, and an auto-failure museum. If a clean integration into one of the above projects becomes more valuable than a standalone wrapper, vocaboot will redirect toward a PR contribution.

Stage 1 (current target)

# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.

examples/short.ds uses generic CV phoneme tokens (d o r e m i f a s o); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts examples/short.ds to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

Two acoustic paths: synthetic (smoke) and DiffSinger BYO (voice)

--voice synthetic produces a 4-partial harmonic oscillator driven by the score's note sequence — ph_seq is deliberately ignored, output has no vowel formants or consonants. Purpose: prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without a checkpoint. A NUMERIC verify APPROVE on the synthetic path proves the pipeline rings, not that the output sounds like a voice.

--voice <path>/acoustic.onnx activates the DiffSinger BYO acoustic provider (v0.3.0+, services/svs_engine.py:229): acoustic.onnx + sibling phonemes.json are loaded by onnxruntime CPU, ph_seq / ph_dur / note_seq are converted to (tokens: int64, durations: int64, f0: float32) per the openvpi v2.3.0 contract, and the mel [1, n_frames, 128] is handed to NSF-HiFiGAN. This is real vocal-tract-filtered synthesis. Acquisition guide + on-disk layout + failure-mode table: docs/diffsinger_byo_spec.md.

G1 wiring-sanity gate (NOT a voice-quality gate): tests/integration/test_byo_acoustic.py asserts wav.size ≥ 0.75 × expected + np.isfinite(wav).all(). This proves the BYO ckpt is wire-loadable and produces a finite waveform; it does not assert that the output is voice-like or matches the input voicebank. Voice-quality objective evaluation (wavlm_cosine ≥ 0.60 + MCD CI upper bound ≤ 500 dB librosa-MFCC scale, see docs/quality_gate.md:14,:45, on N≥3 voicebanks) is post-0.3.0 work tracked in CHANGELOG.md under ## [Unreleased].

Stage 2 spectral/timbral metrics behind [verify-similarity] discriminate synthetic from real-voice output post-synth (wavlm_cosine + MCD with 2-sided threshold band).

Stage 3 DDSP-SVC voice conversion (v0.2.0-rc1, BYO)

pip install vocaboot[svc-ddsp]
git clone https://github.com/yxlllc/DDSP-SVC.git ~/ddsp-svc
# ...follow upstream README to install its requirements, train or
# download a ckpt, and compute sha256sum model.pt

export VOCABOOT_DDSP_PATH=~/ddsp-svc/exp/combsub-test/model_300000.pt
export VOCABOOT_DDSP_SHA256=<digest>
export VOCABOOT_DDSP_SRC=~/ddsp-svc

vocaboot ddsp provision        # verify sha256 + weights-only probe
vocaboot ddsp convert --source singer.wav --out converted.wav

The convert subcommand walks a 4-stage refusal ladder (unconfirmed sha256 → ckpt not found → hash mismatch → engine not wired) and dispatches model.forward to the upstream subprocess. An in-process vendored adapter making the VOCABOOT_DDSP_SRC step optional is on the post-1.0 roadmap (no version promise; tracked under v1.1.0+ deferred items below).

Completion criteria (OSS 1k★ 水準、 R14 round 5 構想層、 2026-05-14 snapshot)

vocaboot が「completed」 = OSS スター 1k 級獲得水準と判定するための 5 product-level criteria。 v0.1.0 alpha の internal-quality gate (R5 verify / R8 museum / R15 privacy) の上に重ねる外向き条件で、同類 SVS/SVC project (RVC 35k★ / MoonInTheRiver/DiffSinger 4.7k★ / OpenUtau 3.8k★ / openvpi/DiffSinger 3.1k★ / DDSP-SVC 2.5k★) の共通点を逆算して導出。

C1. pip install vocaboot[...] 後 30 秒以内に音が出る — v0.2.0 で DiffSinger ONNX BYO path 開設、 v0.3.0 で BYO acoustic provider 完成。 OSS demo default の no-NC 完全達成は MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) が landed する future cycle (round 8 critic 残留指摘、 v1.0.0 ship までに land 必須)。
C2. 出力が「歌声に聞こえる」 — v0.3.0 で DiffSinger BYO acoustic provider landed (services/svs_engine.py:229 3-provider dispatch、 acoustic.onnx + phonemes.json companion、 openvpi v2.3.0 baseline)。 v0.3.0 ship gate = G1 wiring sanity gate (tests/integration/test_byo_acoustic.py: wav.size ≥ 0.75 × expected + np.isfinite(wav).all()、 voice quality は主張しない)。 v1.0.0 ship gate には G2 objective gate (wavlm_cosine ≥ 0.60 + MCD CI upper bound ≤ 500 dB librosa-MFCC scale, see docs/quality_gate.md:14,:45, on N≥3 voicebank) + maintainer-attested N=1 listening pass が必要 — 0.x lifetime 内で community feedback + real ckpt 検証を累積する。
C3. README に動く demo wav (<audio> embed) + 30 秒 mp4 demo — examples/demo_output.mp3 (5s, 60 KB) を README 上部 link 提供で部分達成。 mp4 demo は v1.0.0 candidate cycle。
C4. HF Space / Replicate に 1-click try ページ — Stage 3 SoulX 配布形態 C 経由、 future cycle。
C5. Twitter/X + HN + Reddit r/MachineLearning ローンチ投稿 — v1.0.0 release で実施、 C3-C4 はそれまでに land 必須。

vocaboot 独自軸の「terminal-first + verify gate + failure museum」は developer-tooling 価値 で OSS star audience の 20% (dev community) に効く一方、残り 80% の非エンジニア audience には C1-C5 が必須。 v1.0.0 完成宣言判定基準 (構想層、 future cycle): A (DiffSinger BYO acoustic provider、 v0.3.0 で landed) + G2 objective gate pass (real ckpt N≥3) + MIT vocoder bridge land (NC posture 解消) + maintainer-attested voice listening + R14 round 9 session-separation audit pass + packaging final。 0.3.0 → 1.0.0 jump は最低 (G2 pass + MIT bridge land) を ship gate にする。

Roadmap

Architecture lock-in (rounds 28–32; per-round audit trails are inlined into docs/stage_2.md ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in CHANGELOG.md)
Stage 0: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
Stage 1 step 2 (5c): vocaboot demo subcommand wired (click.group, bundled examples/short.ds, --accept-nc mandatory, SHA-256 runtime verify of the pinned vocoder).
Stage 1 step 2 (5d): LICENSE_NOTICE.txt attribution emission for the NC vocoder (sidecar <out>.LICENSE_NOTICE.txt + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
Stage 1: smoke wav (CPU, 5 s) via vocaboot demo --accept-nc (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in test_stage1_synthetic_smoke_wav_end_to_end already serve as the CI canary.
Stage 1.5 closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) stage_1.py → pipeline.py rename, (B) privacy.refuse_user_audio_without_consent + museum user-audio hard-exclude rule, (C) docs/quality_gate.md Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) docs/ckpt_registry.md acquisition Path B (NC-only public release) chosen, (E) docs/stage_2.md spec.
Public OSS release (2026-05-13): Apache-2.0 LICENSE pristine, NOTICE chain audited, GitHub license.spdx_id=apache-2.0 復帰確認済。 4 commit (c267afd + 18733d1 + 9f0ae47 + 26a3589) on origin/main.
v0.2.0 release tag (2026-05-13、現 0.2.1): C1+C2 BYO-scope 達成 + Stage 3 DDSP-SVC 実 inference wiring 完了 (R14 round 5 構想層判定)。 PyPI live: pip install vocaboot==0.2.1。
Stage 2.5 MCP server (vocaboot mcp-serve、 v0.2.1 同梱、 2026-05-13): pipeline.run の thin wrapper (services/mcp_server.py)。 [mcp] extra に mcp>=0.9 を wire-in、 stdio JSON-RPC で synth / version tool を提供。 audio_ingest / consent-gated 系は MCP 境界不変条件 (docs/stage_2.md) により非公開、 accept_nc は --accept-nc で server-start 時に human signal として固定。
Stage 2 closure (2026-05-13、 v0.2.1 でレッテル付け): docs/stage_2.md Acceptance criteria 9 件 (Tier 2 canary / discrimination / consent gate / VoiceProfile round-trip / no Stage 1 regression / layer count / WavLM pin / SpeakerSimilarityProtocol) すべて 0.2.0.post2 時点で test gate green、 0.2.1 で changelog に明示。

v0.3.0 (released 2026-05-14, BYO acoustic provider landing)

PR0 prerequisite: mypy strict 設定 + 真 bug 8 件 fix (pipeline.py:66/165 等 latent bug を mypy strict 実走で検出済)、設計文書全面書き換え (docs/round8_w1_a_design.md)、 G2 wording を objective gate 化。
W1 = A: DiffSinger BYO acoustic provider at services/svs_engine.py:229 (openvpi/DiffSinger v2.3.0 baseline、 acoustic.onnx + phonemes.json 経由、 ~140 LoC、 21 unit + 1 integration test)。
packaging: NC posture explicit、 PyPI 0.3.0 ship。 G1 wiring sanity gate (tests/integration/test_byo_acoustic.py) を 0.3.0 ship gate に採用。

Toward v1.0.0 (構想層、 future cycle, NOT 0.3.0 ship blockers)

両 agent (critic + analyst) audit による 1.0.0 ship blocker = G2 objective gate pass + MIT vocoder bridge land + maintainer-attested voice listening + R14 round 9 session-separation audit + API stability commitment 整備。これらを v1.0.0 candidate cycle で累積し、 1.0.0 = production stable claim と実体が乖離しない state で ship する。 0.3.0 → 1.0.0 中間に 0.4.0 / 0.5.0 / ... を入れる選択肢は open。

Deferred (future cycles, no version promises)

B: Stage 3 SVC remaining (services/svc/rvc.py RVC v2 per-voicebank license gate + services/svc/soulx.py SoulX-Singer Apache-2.0 subprocess + HF Space remote)。
C: Stage 4 natural-language interface (NL prompt → DiffSinger .ds JSON、 [nl] extra)。
MIT vocoder bridge (BigVGAN / Vocos + mel-format adapter): NC default 解消の真の道筋。 v1.0.0 ship gate (round 8 critic 残留指摘解消)。
G2 objective gate verification: real openvpi DiffSinger ckpt + N≥3 voicebank で wavlm_cosine / MCD 計測。 v1.0.0 ship gate。
Round 9 R14 session-separation audit: round 7/8/W1 が全て同一 originSessionId 850adffc 内で進行した integrity 残留 (critic C1.1 残留 2.5/10)。別 session で実施、 v1.0.0 ship gate。
DiffSinger fork variants (mel_bin=80 等、 upstream schema 進化): mel-format adapter 経由で対応。
Completion C3-C5: 30s mp4 demo + HF Space 1-click + HN/Reddit ローンチ。
DDSP vendored adapter: VOCABOOT_DDSP_SRC 経由 subprocess を in-process adapter に置換。
MCP-server-start telemetry instrumentation: observe-only PR で実装。

Non-goals

Pitch-perfect voice cloning (target literal match is a known failure mode)
Bundled model weights (we point at upstream URLs and verify SHA-256)
GUI

Security: malicious ckpt threat model

vocaboot loads PyTorch checkpoint files (.pt) supplied by the user via VOCABOOT_DDSP_PATH / VOCABOOT_VOCODER_PATH / VOCABOOT_WAVLM_PATH. A malicious .pt file can embed pickle payloads that execute arbitrary code at deserialization time (OWASP A08 / CWE-502 — insecure deserialization). Defenses:

sha256 verify before load — core.ckpt_registry.verify() refuses any ckpt whose bytes don't match the user-committed digest (VOCABOOT_*_SHA256). A man-in-the-middle download swap is caught at the byte boundary, before PyTorch's parser is reached.
torch.load(weights_only=True) — the default in core.ckpt_registry.safe_torch_load(). PyTorch's weights-only deserializer refuses arbitrary classes; legitimate ckpts bundling argparse.Namespace (e.g. DDSP-SVC v5.0) trigger UnsafeCkptError so the user explicitly opts in to allow_unsafe=True rather than silently widening the trust boundary.
Subprocess isolation for upstream inference — vocaboot ddsp convert invokes yxlllc/DDSP-SVC's main.py in a child process (VOCABOOT_DDSP_SRC). The weights are loaded in upstream's address space; a successful exploit would still need to break out of the subprocess to affect vocaboot's process.

Social-engineering caveat (cannot be defended automatically): if a user is socially engineered into exporting an adversary-supplied digest into VOCABOOT_*_SHA256 AND passing allow_unsafe=True, no in-tree defense catches it. Pin sha256 values from sources you trust — cross-check against docs/ckpt_registry.md or the upstream model card before exporting.

License

Code: Apache-2.0. Model weights: inherit from upstream — vocaboot itself bundles none.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

May 13, 2026

This version

0.3.1

May 13, 2026

0.3.0

May 13, 2026

0.2.1

May 13, 2026

0.2.0.post2

May 13, 2026

0.1.0

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocaboot-0.3.1.tar.gz (1.5 MB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocaboot-0.3.1-py3-none-any.whl (89.3 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file vocaboot-0.3.1.tar.gz.

File metadata

Download URL: vocaboot-0.3.1.tar.gz
Upload date: May 13, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`4418b3a6ebdce58648af96da52c9177a7cf2304104cf96ba31037672889efe13`
MD5	`a1ef036ab8d5239ebed03fd3b439154d`
BLAKE2b-256	`c055fe9ba7880875a013b1f8ca7794b8aea3ed0a0ab5d3aac992e1650ba4894a`

See more details on using hashes here.

File details

Details for the file vocaboot-0.3.1-py3-none-any.whl.

File metadata

Download URL: vocaboot-0.3.1-py3-none-any.whl
Upload date: May 13, 2026
Size: 89.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vocaboot-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5740a1d0e4441955acc89c9c70d1dfeb73ea559257aa5647b67d17f4179c4c4f`
MD5	`047f0b15ef16289ecc69956017cede13`
BLAKE2b-256	`e75e678992ae4cd06997cb030dd190e4275a5644a246df1d00db5148cde9022b`

See more details on using hashes here.

vocaboot 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vocaboot

30-second demo

Why use vocaboot

Quickstart with Claude Desktop

How vocaboot fits in

Design constraints

Architecture overview (locked, 4 stages, strict superset)

Stage 0 result (2026-05-12, closed: B)

Output wav licensing (NC opt-in path)

"OSS-1 accessible" invariant — Stage 0 result reframe

Verify gate ($0, no Claude dependency)

Distribution

Related work (delta)

Stage 1 (current target)

Two acoustic paths: synthetic (smoke) and DiffSinger BYO (voice)

Stage 3 DDSP-SVC voice conversion (v0.2.0-rc1, BYO)

Completion criteria (OSS 1k★ 水準、 R14 round 5 構想層、 2026-05-14 snapshot)

Roadmap

v0.3.0 (released 2026-05-14, BYO acoustic provider landing)

Toward v1.0.0 (構想層、 future cycle, NOT 0.3.0 ship blockers)

Deferred (future cycles, no version promises)

Non-goals

Security: malicious ckpt threat model

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes