Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

[!WARNING] Apple Silicon only (M1/M2/M3/M4) — all ML inference runs on Metal GPU via MLX

Tested on M4 Pro (20-core GPU, 48 GB unified memory)

CLI tool for automatic YouTube video dubbing with voice cloning.

All ML inference (ASR, alignment, TTS) runs locally on Apple Silicon via MLX. Translation is done through the Claude API. The output is a video file dubbed in the target language using the original speaker's cloned voice.

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) — MLX only works on Metal
  • Python >= 3.12
  • FFmpeg — used for audio extraction, post-processing, and final assembly
  • yt-dlp — used to download videos from YouTube
  • Anthropic API key — for translation via Claude

Installation

1. Install system dependencies

# FFmpeg (required)
brew install ffmpeg

# yt-dlp (required)
brew install yt-dlp

Optional: for pitch-preserving speed-up via rubberband, install ffmpeg-full instead:

brew install ffmpeg-full

Without it, the tool falls back to ffmpeg's atempo filter (works fine, just no pitch correction).

2. Install yt-dbl

# From PyPI (recommended)
uv tool install --prerelease=allow yt-dbl

# Or with pipx
pipx install yt-dbl

Note: --prerelease=allow is needed because mlx-audio depends on a pre-release version of transformers.

If yt-dbl is not found after installation, run uv tool update-shell && source ~/.zshrc to add ~/.local/bin to your PATH.

From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

When running from source, use uv run yt-dbl instead of yt-dbl.

3. Set up the API key

The Anthropic API key is required for the translation step. Add it to your shell profile so it persists across sessions:

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or create a .env file in the working directory:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~8.2 GB) are downloaded automatically on first run. To fetch them ahead of time:

yt-dbl models download

Configuration

Settings are loaded in order of priority:

  1. CLI arguments
  2. Environment variables (prefix YT_DBL_)
  3. .env file
  4. Default values

Copy .env.example to .env and adjust as needed:

cp .env.example .env

Key parameters

Env variable Default Description
YT_DBL_ANTHROPIC_API_KEY Required. Anthropic API key for translation
YT_DBL_TARGET_LANGUAGE ru Target language (ISO 639-1)
YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
YT_DBL_BACKGROUND_VOLUME 0.15 Background volume during speech (0.0–1.0)
YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up to fit timing (1.0–2.0)
YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory (0 = auto by RAM)
YT_DBL_WORK_DIR dubbed Output directory for all jobs

See .env.example for the full list of 33 configurable parameters including model selection, separation tuning, TTS sampling, and chunked ASR settings.

Quick start

# Dub a video into Russian (default)
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"

# Custom output directory
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-output

# Specify target language
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es

# Start from a specific step (previous steps are skipped)
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate

# Check job status
yt-dbl status VIDEO_ID

# Resume an interrupted job
yt-dbl resume VIDEO_ID

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
-o, --output-dir Output directory ./dubbed
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto (by RAM)
--from-step Start from step: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode Subtitle mode: softsub / hardsub / none softsub
--format Output format: mp4 / mkv mp4

resume — resume an interrupted job

yt-dbl resume <video_id> [--max-models N] [-o DIR]

The pipeline saves state.json after each step. If interrupted, resume picks up from the last incomplete step.

status — check job status

yt-dbl status <video_id>

Shows a table with each step's state (pending / running / completed / failed), execution time, and video metadata.

models list — list ML models

yt-dbl models list

Shows all models, their download status, and size on disk.

models download — pre-download models

yt-dbl models download

Downloads all HuggingFace models. The audio-separator model is downloaded automatically on first use.

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

ML models are loaded and unloaded via an LRU manager.

The number of models kept in memory is determined automatically based on available RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

The ASR model (~5.7 GB) is unloaded before loading the Aligner so both don't occupy memory at the same time.

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task Inference
VibeVoice-ASR ~5.7 GB ASR + speaker diarization MLX (Metal)
Qwen3-ForcedAligner ~600 MB Word-level alignment MLX (Metal)
Qwen3-TTS ~1.7 GB TTS with voice cloning MLX (Metal)
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation ONNX + CoreML
Claude Sonnet 4.5 Text translation API (Anthropic)

Development

just check          # lint + format + typecheck + tests
just test           # fast tests (parallel, coverage)
just test-e2e       # E2E tests (requires FFmpeg + network)
just fix            # auto-fix linter
just format         # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.4.1.tar.gz (227.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.4.1-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.4.1.tar.gz.

File metadata

  • Download URL: yt_dbl-1.4.1.tar.gz
  • Upload date:
  • Size: 227.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.4.1.tar.gz
Algorithm Hash digest
SHA256 4b3165250d38a606e87cdcd463f151b876a5cc01f13c8e88574e52518749ebe4
MD5 d9d19fbfafd0c424eb0eb0ac91a9c511
BLAKE2b-256 9178737229a321e9d20d41f1fa5e07896d5b820e4fee870a6b8b5a18f29bcd6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.4.1.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7272222f0469d23d6185e058c3148c0c2b6202bd5657b81fe40a710a37ae7e2b
MD5 624e7234a4c8bc312964b86cd990c736
BLAKE2b-256 fbffedaf22479e036bf56bfa7284b41ec881bd04b2cc1c2f367e0ab77151a907

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.4.1-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page