Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

Dub any YouTube video into another language — with the original speaker's voice

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -t ru

[!WARNING] Early stage — not yet stable for long videos (30+ min)

[!WARNING] Apple Silicon only (M1–M4), tested on M4 Pro (48 GB)

One command: download, transcribe, translate (Claude), clone each speaker's voice (Qwen3-TTS), mix with the original background — done. All ML inference runs locally on your Mac's GPU via MLX

Why yt-dbl

  • Human-quality voice cloning
    Qwen3-TTS per speaker, not a generic synth. Multiple speakers are diarized and voiced separately
  • LLM translation
    Claude handles idioms, context, and produces TTS-friendly text — not word-for-word machine translation
  • Background preserved
    BS-RoFormer separates vocals from music/sfx. Sidechain ducking mixes them back naturally
  • Production audio chain
    Loudnorm (-16 LUFS), de-essing, pitch-preserving speed-up, equal-power crossfade
  • Checkpoint & resume
    Every step saves state. Interrupted? yt-dbl resume continues where it stopped
  • Private
    Everything local except the Claude API call

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1–M4) — MLX needs Metal
  • Python >= 3.12
  • FFmpeg — audio extraction, postprocessing, final assembly
  • yt-dlp — video download
  • Anthropic API key — translation via Claude

Installation

1. Install system dependencies

brew install ffmpeg yt-dlp

Optional: brew install ffmpeg-full for pitch-preserving speed-up via rubberband Without it, falls back to ffmpeg's atempo filter (works fine, just no pitch correction)

2. Install yt-dbl

# From PyPI
uv tool install --prerelease=allow yt-dbl

# Or with pipx
pipx install yt-dbl

--prerelease=allow is needed because mlx-audio depends on a pre-release transformers

If yt-dbl is not found, run uv tool update-shell && source ~/.zshrc

From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

Use uv run yt-dbl instead of yt-dbl when running from source

3. Set up the API key

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or use a .env file:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~8.2 GB) download automatically on first run, or fetch them ahead of time:

yt-dbl models download

Configuration

Priority: CLI args > env vars (YT_DBL_ prefix) > .env file > defaults

cp .env.example .env
Env variable Default Description
YT_DBL_ANTHROPIC_API_KEY Required — Anthropic API key
YT_DBL_TARGET_LANGUAGE ru Target language (ISO 639-1)
YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
YT_DBL_BACKGROUND_VOLUME 0.15 Background volume during speech (0.0–1.0)
YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up to fit timing (1.0–2.0)
YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory (0 = auto by RAM)
YT_DBL_WORK_DIR dubbed Output directory

See .env.example for all 33 parameters

Quick start

yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"           # dub to Russian (default)
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es                    # dub to Spanish
yt-dbl dub "https://youtu.be/VIDEO_ID" -o ./out                 # custom output dir
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate    # re-run from a specific step
yt-dbl resume VIDEO_ID                                          # resume after interrupt
yt-dbl status VIDEO_ID                                          # check job progress

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
-o, --output-dir Output directory ./dubbed
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto
--from-step Start from: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode softsub / hardsub / none softsub
--format mp4 / mkv mp4

resume — pick up where it stopped

yt-dbl resume <video_id> [--max-models N] [-o DIR]

status — check job progress

yt-dbl status <video_id>

models list / models download

yt-dbl models list        # show models, download status, size
yt-dbl models download    # pre-download all models

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

LRU model manager — auto-selects how many models to keep loaded based on RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

ASR (~5.7 GB) is unloaded before loading the Aligner to avoid holding both in memory

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task
VibeVoice-ASR ~5.7 GB ASR + speaker diarization
Qwen3-ForcedAligner ~600 MB Word-level alignment
Qwen3-TTS ~1.7 GB TTS + voice cloning
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation
Claude Sonnet 4.5 Translation (API)

All local models run on MLX (Metal GPU), total ~8.2 GB

Development

just check    # lint + format + typecheck + tests
just test     # fast tests (parallel, coverage)
just test-e2e # E2E (needs ffmpeg + network)
just fix      # auto-fix lint
just format   # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.6.2.tar.gz (243.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.6.2-py3-none-any.whl (62.5 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.6.2.tar.gz.

File metadata

  • Download URL: yt_dbl-1.6.2.tar.gz
  • Upload date:
  • Size: 243.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.2.tar.gz
Algorithm Hash digest
SHA256 cc3d0c269be200e0ec6d91cd80be07cb7a8d84a1847fb364d01ab527c9dc2624
MD5 34750489d0d22c7c077be769dfb45d83
BLAKE2b-256 4da0118547ca8987c75b0256a61bfe51392439cec0b4c8c37611be2d178d77b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.2.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.6.2-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.6.2-py3-none-any.whl
  • Upload date:
  • Size: 62.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1d1e18718fd511ebe902b43315daa4c6530c0e12641bd485dacb0d6fea4662b5
MD5 b2e00c592fa183b09ba887eb1f739c37
BLAKE2b-256 f81dcac8186a07ef7d33699733ea747649bf17637e43388eb35ab8efb7089ba1

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.6.2-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page