Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

Dub any YouTube video into another language — with the original speaker's voice

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -t ru

[!WARNING] Apple Silicon ONLY (M1–M4), tested on M4 Pro (48 GB)

One command: download, transcribe, translate (Claude), clone each speaker's voice (Qwen3-TTS), mix with the original background — done. All ML inference runs locally on your Mac's GPU via MLX

Why yt-dbl

  • Human-quality voice cloning
    Qwen3-TTS per speaker, not a generic synth. Multiple speakers are diarized and voiced separately
  • LLM translation
    Claude handles idioms, context, and produces TTS-friendly text — not word-for-word machine translation
  • Background preserved
    BS-RoFormer separates vocals from music/sfx. Sidechain ducking mixes them back naturally
  • Production audio chain
    Loudnorm (-16 LUFS), de-essing, pitch-preserving speed-up, equal-power crossfade
  • Checkpoint & resume
    Every step saves state. Interrupted? yt-dbl resume continues where it stopped
  • Private
    Everything local except the Claude API call

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1–M4) — MLX needs Metal
  • Python >= 3.12
  • FFmpeg — audio extraction, postprocessing, final assembly
  • yt-dlp — video download
  • Anthropic API key — translation via Claude

Installation

1. Install system dependencies

brew install ffmpeg yt-dlp

Optional: brew install ffmpeg-full for pitch-preserving speed-up via rubberband Without it, falls back to ffmpeg's atempo filter (works fine, just no pitch correction)

2. Install yt-dbl

# From PyPI
uv tool install --prerelease=allow yt-dbl

# Or with pipx
pipx install yt-dbl

--prerelease=allow is needed because mlx-audio depends on a pre-release transformers

If yt-dbl is not found, run uv tool update-shell && source ~/.zshrc

From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

Use uv run yt-dbl instead of yt-dbl when running from source

3. Set up the API key

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or use a .env file:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~8.2 GB) download automatically on first run, or fetch them ahead of time:

yt-dbl models download

Configuration

Priority: CLI args > env vars (YT_DBL_ prefix) > .env file > defaults

cp .env.example .env
Env variable Default Description
YT_DBL_ANTHROPIC_API_KEY Required — Anthropic API key
YT_DBL_TARGET_LANGUAGE ru Target language (ISO 639-1)
YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
YT_DBL_BACKGROUND_VOLUME 0.15 Background volume during speech (0.0–1.0)
YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up to fit timing (1.0–2.0)
YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory (0 = auto by RAM)
YT_DBL_WORK_DIR dubbed Output directory

See .env.example for all 33 parameters

Quick start

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"           # dub to Russian (default)
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es                    # dub to Spanish
yt-dbl dub "https://youtu.be/VIDEO_ID" -o ./out                 # custom output dir
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate    # re-run from a specific step
yt-dbl resume VIDEO_ID                                          # resume after interrupt
yt-dbl status VIDEO_ID                                          # check job progress

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
-o, --output-dir Output directory ./dubbed
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto
--from-step Start from: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode softsub / hardsub / none softsub
--format mp4 / mkv mp4

resume — pick up where it stopped

yt-dbl resume <video_id> [--max-models N] [-o DIR]

status — check job progress

yt-dbl status <video_id>

models list / models download

yt-dbl models list        # show models, download status, size
yt-dbl models download    # pre-download all models

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

LRU model manager — auto-selects how many models to keep loaded based on RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

ASR (~5.7 GB) is unloaded before loading the Aligner to avoid holding both in memory

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task
VibeVoice-ASR ~5.7 GB ASR + speaker diarization
Qwen3-ForcedAligner ~600 MB Word-level alignment
Qwen3-TTS ~1.7 GB TTS + voice cloning
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation
Claude Sonnet 4.5 Translation (API)

All local models run on MLX (Metal GPU), total ~8.2 GB

Development

just check    # lint + format + typecheck + tests
just test     # fast tests (parallel, coverage)
just test-e2e # E2E (needs ffmpeg + network)
just fix      # auto-fix lint
just format   # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.6.4.tar.gz (244.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.6.4-py3-none-any.whl (62.5 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.6.4.tar.gz.

File metadata

  • Download URL: yt_dbl-1.6.4.tar.gz
  • Upload date:
  • Size: 244.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.4.tar.gz
Algorithm Hash digest
SHA256 c5cf2956f507a7305c3197affb2e95119d43978be5ce01b9e74ed451d5b0f80a
MD5 ce89af51374bfa3e29f3326529d915dc
BLAKE2b-256 0bc2056e7adf7660e888cda17cda8ad6f274aaf52e87edee6b8d92c0463ea33f

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.4.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.6.4-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.6.4-py3-none-any.whl
  • Upload date:
  • Size: 62.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d989ccece9860071857d758a1f8a66431ae19e6fcbbfa7c40b78b36ce1975989
MD5 614f644cbf35f07d1b622af6e9cd4c6b
BLAKE2b-256 bafeec45e0e6dbb302b4dfa52e0d89eda5fcc17c5056b948ec2c77aecfb8f474

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.4-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page