Skip to main content

A researcher-friendly, declarative speech data processing toolkit

Project description

VoxKitchen logo

VoxKitchen

Turn raw speech recordings into clean, inspectable training datasets.

VoxKitchen handles the repetitive audio prep around ASR, TTS, speaker analysis, and data cleaning: convert, segment, label, filter, and export from one Docker-backed YAML pipeline.

CI PyPI Python Docker-first 51 operators License

Status: Pre-alpha. APIs and Docker image contents may change between releases.

Use VoxKitchen when you want to:

  • turn long recordings into ASR training data;
  • prepare and inspect TTS datasets;
  • diarize speakers, tag languages, or run speech quality checks;
  • clean, filter, and package audio without maintaining one-off scripts.

Why VoxKitchen

Speech data preparation is usually a chain of fragile scripts: convert audio, split speech, denoise, transcribe, diarize, filter, and export. VoxKitchen makes that chain explicit and repeatable:

  • Docker-first execution: prebuilt runtimes avoid local dependency conflicts.
  • One YAML pipeline: define ingest, stages, filters, and output packs in one file.
  • 51 built-in operators: audio prep, VAD, ASR, diarization, TTS, quality metrics, and packing.
  • Resumable by design: every stage checkpoints under ./work.
  • Inspectable outputs: reports, cut statistics, provenance, and per-stage errors.

Quick Start

Requirements:

  • Docker
  • Python 3.10+ for the lightweight vkit launcher

Install the vkit launcher from PyPI:

pipx install voxkitchen      # recommended — isolates the launcher
# or
pip install voxkitchen

This installs only the lightweight launcher and inspection commands (a few MB, no torch / ASR / TTS dependencies). All pipeline runtime dependencies stay inside the prebuilt Docker images.

Run the included demo with the smallest runtime image. No repository clone is required; the published image includes the demo pipeline and demo audio.

vkit docker pull --tag slim
vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml
vkit inspect run ./work/demo-no-asr

vkit docker run writes run artifacts under ./work and exported datasets under ./output with your host user ID. It also mounts ./data automatically when that directory exists.

What You Can Build

Goal Start with Runtime image
Clean and filter raw speech audio vkit init my-cleaning --template cleaning slim
Build ASR training manifests vkit init my-asr --template asr asr
Analyze speakers and languages vkit init my-speakers --template speaker latest
Prepare TTS training data (quality gate) vkit init my-tts --template tts asr
Synthesize speech in a built-in voice see Speaker TTS tutorial tts
Clone a voice from a 3–10 s reference see Voice Cloning & TTS tutorial tts or fish-speech

How It Works

VoxKitchen pipeline overview

A pipeline is a YAML file. Each stage reads a CutSet, writes a checkpoint, and passes the result to the next stage.

version: "0.1"
name: my-pipeline
work_dir: ./work/${name}-${run_id}

ingest:
  source: dir
  args:
    root: ./data
    recursive: true

stages:
  - name: resample
    op: resample
    args: { target_sr: 16000, target_channels: 1 }

  - name: vad
    op: silero_vad
    args: { threshold: 0.5 }

  - name: asr
    op: faster_whisper_asr
    args: { model: large-v3, compute_type: float16 }

  - name: filter
    op: quality_score_filter
    args:
      conditions: ["duration > 1", "duration < 30", "metrics.snr > 10"]

  - name: pack
    op: pack_jsonl

Interrupted runs resume from completed checkpoints.

Create A Project

vkit init my-project --template asr
cd my-project

# Put your audio files in ./data first.
vkit validate pipeline.yaml
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
vkit inspect run work/

List templates:

vkit init --list-templates

Not sure which image a pipeline needs? Run:

vkit validate pipeline.yaml

It prints the recommended vkit docker pull --tag ... and vkit docker run --tag ... commands for that YAML.

Runtime Images

Every vkit docker command accepts --tag <name>:

Tag Use when GPU Approx. size
slim CPU-friendly cleaning, VAD, quality, pack, enhancement no ~13 GB
asr Faster-Whisper, FunASR, Qwen3-ASR, forced alignment yes ~48 GB
diarize Pyannote speaker diarization yes ~32 GB
tts Kokoro, ChatTTS, CosyVoice yes ~44 GB
fish-speech Fish-Speech isolated runtime yes ~57 GB
latest Mixed pipelines across ASR, diarization, TTS, or Fish-Speech yes ~123 GB

Use latest when one pipeline mixes multiple runtime families, such as ASR plus diarization or ASR plus TTS. Otherwise, prefer the smallest image that contains the operators you need.

Useful checks:

vkit docker pull --tag asr
vkit docker doctor --tag asr --expect asr
vkit docker doctor --tag latest

Configuration

Some operators require API tokens. Create ./.env; vkit docker run passes it into the container automatically.

cp .env.example .env
Variable Required by Notes
HF_TOKEN pyannote_diarize Accept the pyannote model agreement on HuggingFace first.

Common Commands

vkit init <path> --template asr           # Scaffold a project
vkit validate pipeline.yaml               # Validate YAML and recommend an image
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
vkit inspect run work/                    # Stage summary
vkit inspect cuts <cuts.jsonl.gz>          # CutSet statistics
vkit inspect errors work/                  # Per-stage failed cuts
vkit operators search <keyword>            # Find operators by name or summary
vkit operators --category quality          # List one category's operators
vkit schema export --out pipeline.schema.json  # Editor autocompletion for YAML
vkit recipes                               # List dataset recipes
vkit docker download --tag slim librispeech --root ./data/librispeech --subsets dev-clean
vkit docker doctor --tag latest            # Check image health

Documentation

Agent Skill

The repo includes an agent-neutral VoxKitchen skill at skill/. Claude, Codex, and other SKILL.md-compatible agents can copy, symlink, or import that folder into their own skill search path. The skill follows the Docker-first vkit workflow in this README.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxkitchen-0.3.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxkitchen-0.3.0-py3-none-any.whl (222.8 kB view details)

Uploaded Python 3

File details

Details for the file voxkitchen-0.3.0.tar.gz.

File metadata

  • Download URL: voxkitchen-0.3.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxkitchen-0.3.0.tar.gz
Algorithm Hash digest
SHA256 18c90fe358b8e3e2cc7b07f1adbaa7ebcc7067d9ff61347b781a665c311cd647
MD5 bf7a75fa8d18f065ffe97077543ec26e
BLAKE2b-256 6ec29cef60f406e3e5dff7a1bdf7de4408a6b1ac1e88fbfd576599e358819554

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxkitchen-0.3.0.tar.gz:

Publisher: publish.yml on XqFeng-Josie/VoxKitchen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voxkitchen-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: voxkitchen-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 222.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxkitchen-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5dc7179109d18a8f1d479c321936642e1032ab89ea0e02a71bd694bebe25cb3
MD5 4e4a3039eecd1af2d72fe638fd0bb849
BLAKE2b-256 8ea0c344b1a5efa6f9ec016d09c47c4bcf55aa3f89aa7eb1112a2fd6b718f9a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxkitchen-0.3.0-py3-none-any.whl:

Publisher: publish.yml on XqFeng-Josie/VoxKitchen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page