Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

[!WARNING] Apple Silicon only (M1/M2/M3/M4) — all ML inference runs on Metal GPU via MLX

Tested on M4 Pro (20-core GPU, 48 GB unified memory)

CLI tool for automatic YouTube video dubbing with voice cloning.

All ML inference (ASR, alignment, TTS) runs locally on Apple Silicon via MLX. Translation is done through the Claude API. The output is a video file dubbed in the target language using the original speaker's cloned voice.

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) — MLX only works on Metal
  • Python >= 3.12
  • FFmpeg — used for audio extraction, post-processing, and final assembly
  • yt-dlp — used to download videos from YouTube
  • Anthropic API key — for translation via Claude

Installation

1. Install system dependencies

# FFmpeg (required)
brew install ffmpeg

# yt-dlp (required)
brew install yt-dlp

Optional: for pitch-preserving speed-up via rubberband, install ffmpeg-full instead:

brew install ffmpeg-full

Without it, the tool falls back to ffmpeg's atempo filter (works fine, just no pitch correction).

2. Install yt-dbl

# From PyPI (recommended)
uv tool install --prerelease=allow yt-dbl

# Or with pipx
pipx install yt-dbl

Note: --prerelease=allow is needed because mlx-audio depends on a pre-release version of transformers.

If yt-dbl is not found after installation, run uv tool update-shell && source ~/.zshrc to add ~/.local/bin to your PATH.

From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

When running from source, use uv run yt-dbl instead of yt-dbl.

3. Set up the API key

The Anthropic API key is required for the translation step. Add it to your shell profile so it persists across sessions:

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or create a .env file in the working directory:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~8.2 GB) are downloaded automatically on first run. To fetch them ahead of time:

yt-dbl models download

Configuration

Settings are loaded in order of priority:

  1. CLI arguments
  2. Environment variables (prefix YT_DBL_)
  3. .env file
  4. Default values

Copy .env.example to .env and adjust as needed:

cp .env.example .env

Key parameters

Env variable Default Description
YT_DBL_ANTHROPIC_API_KEY Required. Anthropic API key for translation
YT_DBL_TARGET_LANGUAGE ru Target language (ISO 639-1)
YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
YT_DBL_BACKGROUND_VOLUME 0.15 Background volume during speech (0.0–1.0)
YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up to fit timing (1.0–2.0)
YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory (0 = auto by RAM)
YT_DBL_WORK_DIR dubbed Output directory for all jobs

See .env.example for the full list of 33 configurable parameters including model selection, separation tuning, TTS sampling, and chunked ASR settings.

Quick start

# Dub a video into Russian (default)
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"

# Custom output directory
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-output

# Specify target language
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es

# Start from a specific step (previous steps are skipped)
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate

# Check job status
yt-dbl status VIDEO_ID

# Resume an interrupted job
yt-dbl resume VIDEO_ID

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
-o, --output-dir Output directory ./dubbed
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto (by RAM)
--from-step Start from step: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode Subtitle mode: softsub / hardsub / none softsub
--format Output format: mp4 / mkv mp4

resume — resume an interrupted job

yt-dbl resume <video_id> [--max-models N] [-o DIR]

The pipeline saves state.json after each step. If interrupted, resume picks up from the last incomplete step.

status — check job status

yt-dbl status <video_id>

Shows a table with each step's state (pending / running / completed / failed), execution time, and video metadata.

models list — list ML models

yt-dbl models list

Shows all models, their download status, and size on disk.

models download — pre-download models

yt-dbl models download

Downloads all HuggingFace models. The audio-separator model is downloaded automatically on first use.

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

ML models are loaded and unloaded via an LRU manager.

The number of models kept in memory is determined automatically based on available RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

The ASR model (~5.7 GB) is unloaded before loading the Aligner so both don't occupy memory at the same time.

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task Inference
VibeVoice-ASR ~5.7 GB ASR + speaker diarization MLX (Metal)
Qwen3-ForcedAligner ~600 MB Word-level alignment MLX (Metal)
Qwen3-TTS ~1.7 GB TTS with voice cloning MLX (Metal)
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation ONNX + CoreML
Claude Sonnet 4.5 Text translation API (Anthropic)

Development

just check          # lint + format + typecheck + tests
just test           # fast tests (parallel, coverage)
just test-e2e       # E2E tests (requires FFmpeg + network)
just fix            # auto-fix linter
just format         # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.4.0.tar.gz (219.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.4.0-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.4.0.tar.gz.

File metadata

  • Download URL: yt_dbl-1.4.0.tar.gz
  • Upload date:
  • Size: 219.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.4.0.tar.gz
Algorithm Hash digest
SHA256 08e91ebeb8adb93449c29f73a65e53dd3b35c36118bae3978033f873d78efdcf
MD5 995943e271fd7d096babd20dae4a32ab
BLAKE2b-256 ffd264161ec2ba026f62ca19294cb4b7b97eec7d7f2ddced758f6e42d012cbc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.4.0.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c6a81543d8e7766a4f6c2aed086b05de15f9144818b24ad787e47dada18e6b2
MD5 fa64a53cb4bfb7c8a1d4179cbbfb73bd
BLAKE2b-256 bb07ad3022bccb58c9d5b20eca52f8abbe547a7400da7a722a8e28067de95db9

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.4.0-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page