Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

Dub any YouTube video into another language — with the original speaker's voice

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -t ru

[!WARNING] Apple Silicon ONLY (M1–M4), tested on M4 Pro (48 GB)

One command: download, transcribe, translate (Claude), clone each speaker's voice (Qwen3-TTS), mix with the original background — done. All ML inference runs locally on your Mac's GPU via MLX

Why yt-dbl

  • Human-quality voice cloning
    Qwen3-TTS per speaker, not a generic synth. Multiple speakers are diarized and voiced separately
  • LLM translation
    Claude handles idioms, context, and produces TTS-friendly text — not word-for-word machine translation
  • Background preserved
    BS-RoFormer separates vocals from music/sfx. Sidechain ducking mixes them back naturally
  • Production audio chain
    Loudnorm (-16 LUFS), de-essing, pitch-preserving speed-up, equal-power crossfade
  • Checkpoint & resume
    Every step saves state. Interrupted? yt-dbl resume continues where it stopped
  • Private
    Everything local except the Claude API call

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1–M4) — MLX needs Metal
  • Python >= 3.12
  • FFmpeg — audio extraction, postprocessing, final assembly
  • yt-dlp — video download
  • Anthropic API key — translation via Claude

Installation

1. Install system dependencies

brew install ffmpeg yt-dlp

Optional: brew install ffmpeg-full for pitch-preserving speed-up via rubberband Without it, falls back to ffmpeg's atempo filter (works fine, just no pitch correction)

2. Install yt-dbl

# From PyPI
uv tool install --prerelease=allow yt-dbl

# Or with pipx
pipx install yt-dbl

--prerelease=allow is needed because mlx-audio depends on a pre-release transformers

If yt-dbl is not found, run uv tool update-shell && source ~/.zshrc

From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

Use uv run yt-dbl instead of yt-dbl when running from source

3. Set up the API key

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or use a .env file:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~8.2 GB) download automatically on first run, or fetch them ahead of time:

yt-dbl models download

Configuration

Priority: CLI args > env vars (YT_DBL_ prefix) > .env file > defaults

cp .env.example .env
Env variable Default Description
YT_DBL_ANTHROPIC_API_KEY Required — Anthropic API key
YT_DBL_TARGET_LANGUAGE ru Target language (ISO 639-1)
YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
YT_DBL_BACKGROUND_VOLUME 0.15 Background volume during speech (0.0–1.0)
YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up to fit timing (1.0–2.0)
YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory (0 = auto by RAM)
YT_DBL_WORK_DIR dubbed Output directory

See .env.example for all 33 parameters

Quick start

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"           # dub to Russian (default)
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es                    # dub to Spanish
yt-dbl dub "https://youtu.be/VIDEO_ID" -o ./out                 # custom output dir
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate    # re-run from a specific step
yt-dbl resume VIDEO_ID                                          # resume after interrupt
yt-dbl status VIDEO_ID                                          # check job progress

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
-o, --output-dir Output directory ./dubbed
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto
--from-step Start from: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode softsub / hardsub / none softsub
--format mp4 / mkv mp4

resume — pick up where it stopped

yt-dbl resume <video_id> [--max-models N] [-o DIR]

status — check job progress

yt-dbl status <video_id>

models list / models download

yt-dbl models list        # show models, download status, size
yt-dbl models download    # pre-download all models

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

LRU model manager — auto-selects how many models to keep loaded based on RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

ASR (~5.7 GB) is unloaded before loading the Aligner to avoid holding both in memory

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task
VibeVoice-ASR ~5.7 GB ASR + speaker diarization
Qwen3-ForcedAligner ~600 MB Word-level alignment
Qwen3-TTS ~1.7 GB TTS + voice cloning
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation
Claude Sonnet 4.5 Translation (API)

All local models run on MLX (Metal GPU), total ~8.2 GB

Development

just check    # lint + format + typecheck + tests
just test     # fast tests (parallel, coverage)
just test-e2e # E2E (needs ffmpeg + network)
just fix      # auto-fix lint
just format   # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.6.3.tar.gz (244.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.6.3-py3-none-any.whl (62.5 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.6.3.tar.gz.

File metadata

  • Download URL: yt_dbl-1.6.3.tar.gz
  • Upload date:
  • Size: 244.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.3.tar.gz
Algorithm Hash digest
SHA256 9b44db77a75b4d807c26c82691deb74d79a4b0f101895ea335ecd279166d90ca
MD5 6482776d1b2c873a072699b9f85a6cef
BLAKE2b-256 24338181993e7967f96e2cfd074ff0578432ad59869232530fe0b712b4a286d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.3.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.6.3-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.6.3-py3-none-any.whl
  • Upload date:
  • Size: 62.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e7576aefe090cb8dc9fe98adf8ed109a658fa2cfaa7458dc360c736e2e122c56
MD5 56dd504eeb10af02bb5fe769ea2a39e1
BLAKE2b-256 e33eaa02d68019f3bb9baa3e8d104c66c4b89e3247a43a1fdbd0dabf05c8f42a

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.3-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page