Skip to main content

CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

Project description

yt-dbl

[!WARNING] Apple Silicon only (M1/M2/M3/M4) — all ML inference runs on Metal GPU via MLX

Tested on M4 Pro (20-core GPU, 48 GB unified memory)

CLI tool for automatic YouTube video dubbing with voice cloning.

All ML inference (ASR, alignment, TTS) runs locally on Apple Silicon via MLX. Translation is done through the Claude API. The output is a video file dubbed in the target language using the original speaker's cloned voice.

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) — MLX only works on Metal
  • Python >= 3.12
  • FFmpeg — used for audio extraction, post-processing, and final assembly
  • yt-dlp — used to download videos from YouTube
  • Anthropic API key — for translation via Claude

Installation

1. Install system dependencies

# FFmpeg (required)
brew install ffmpeg

# yt-dlp (required)
brew install yt-dlp

Optional: for pitch-preserving speed-up via rubberband, install ffmpeg-full instead:

brew install ffmpeg-full

Without it, the tool falls back to ffmpeg's atempo filter (works fine, just no pitch correction).

2. Install yt-dbl

# From PyPI (recommended)
uv tool install yt-dbl

# Or with pipx
pipx install yt-dbl
From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

When running from source, use uv run yt-dbl instead of yt-dbl.

3. Set up the API key

The Anthropic API key is required for the translation step. Add it to your shell profile so it persists across sessions:

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or create a .env file in the working directory:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~10.5 GB) are downloaded automatically on first run. To fetch them ahead of time:

yt-dbl models download

Configuration

Settings are loaded in order of priority:

  1. CLI arguments
  2. Environment variables (prefix YT_DBL_)
  3. .env file
  4. Default values

Example .env:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...
YT_DBL_TARGET_LANGUAGE=ru
YT_DBL_BACKGROUND_VOLUME=0.2
YT_DBL_MAX_SPEED_FACTOR=1.3
YT_DBL_OUTPUT_FORMAT=mkv
YT_DBL_SUBTITLE_MODE=hardsub
YT_DBL_BACKGROUND_DUCKING=true
YT_DBL_VOICE_REF_DURATION=7.0
YT_DBL_SAMPLE_RATE=48000

All parameters

Parameter Env variable Default Description
target_language YT_DBL_TARGET_LANGUAGE ru Target language
output_format YT_DBL_OUTPUT_FORMAT mp4 mp4 / mkv
subtitle_mode YT_DBL_SUBTITLE_MODE softsub softsub / hardsub / none
background_volume YT_DBL_BACKGROUND_VOLUME 0.15 Background volume (0.0–1.0)
background_ducking YT_DBL_BACKGROUND_DUCKING true Duck background during speech (sidechain)
max_speed_factor YT_DBL_MAX_SPEED_FACTOR 1.4 Max TTS speed-up (1.0–2.0)
voice_ref_duration YT_DBL_VOICE_REF_DURATION 7.0 Voice reference duration (3–30 sec)
max_loaded_models YT_DBL_MAX_LOADED_MODELS 0 (auto) Max models in memory
anthropic_api_key YT_DBL_ANTHROPIC_API_KEY Anthropic API key
work_dir YT_DBL_WORK_DIR work Working directory

Quick start

# Dub a video into Russian (default)
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"

# Specify target language
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es

# Start from a specific step (previous steps are skipped)
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate

# Check job status
yt-dbl status VIDEO_ID

# Resume an interrupted job
yt-dbl resume VIDEO_ID

Commands

dub — dub a video

yt-dbl dub <URL> [options]
Option Description Default
-t, --target-language Target language ru
--bg-volume Background volume (0.0–1.0) 0.15
--max-speed Max TTS speed-up (1.0–2.0) 1.4
--max-models Max models in memory auto (by RAM)
--from-step Start from step: download / separate / transcribe / translate / synthesize / assemble
--no-subs Disable subtitles false
--sub-mode Subtitle mode: softsub / hardsub / none softsub
--format Output format: mp4 / mkv mp4

resume — resume an interrupted job

yt-dbl resume <video_id> [--max-models N]

The pipeline saves state.json after each step. If interrupted, resume picks up from the last incomplete step.

status — check job status

yt-dbl status <video_id>

Shows a table with each step's state (pending / running / completed / failed), execution time, and video metadata.

models list — list ML models

yt-dbl models list

Shows all models, their download status, and size on disk.

models download — pre-download models

yt-dbl models download

Downloads all HuggingFace models. The audio-separator model is downloaded automatically on first use.

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~8 GB)                          │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
                            ┌───────────────────┐
                            │    result.mp4     │
                            └───────────────────┘

Memory management

ML models are loaded and unloaded via an LRU manager.

The number of models kept in memory is determined automatically based on available RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

The ASR model (~8 GB) is unloaded before loading the Aligner so both don't occupy memory at the same time.

Working directory structure

work/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model Size Task Inference
VibeVoice-ASR ~8.2 GB ASR + speaker diarization MLX (Metal)
Qwen3-ForcedAligner ~600 MB Word-level alignment MLX (Metal)
Qwen3-TTS ~1.7 GB TTS with voice cloning MLX (Metal)
MelBand-RoFormer (BS-RoFormer) ~200 MB Vocal/background separation ONNX + CoreML
Claude Sonnet 4.5 Text translation API (Anthropic)

Development

just check          # lint + format + typecheck + tests
just test           # fast tests (parallel, coverage)
just test-e2e       # E2E tests (requires FFmpeg + network)
just fix            # auto-fix linter
just format         # auto-format

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.0.0.tar.gz (211.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yt_dbl-1.0.0-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file yt_dbl-1.0.0.tar.gz.

File metadata

  • Download URL: yt_dbl-1.0.0.tar.gz
  • Upload date:
  • Size: 211.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b860f8fccc1be2eac8aaf9e3057b2868dc4aa30fc2c689bd4b63f38ee7b738bd
MD5 35af7de7279a1082b37cf16f021655ee
BLAKE2b-256 3843ab887eb8987a3cc6922d23e20726268e68f8f8e3734c7ce49678356814ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.0.0.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yt_dbl-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: yt_dbl-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e62fbd3876fac6436466545bfe8b8c1f090a73dd4448871ec8fa52b2f9626bb8
MD5 be3194e367e1b68c13f82dca5e046e8a
BLAKE2b-256 8c625cf3f89a5ac1321c892eaa9d15e880b6d97f48af6d35207414afdce9d355

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.0.0-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page