CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)
Project description
yt-dbl
Dub any YouTube video into another language — with the original speaker's voice
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -t ru
[!WARNING] Early stage — not yet stable for long videos (30+ min)
[!WARNING] Apple Silicon only (M1–M4), tested on M4 Pro (48 GB)
One command: download, transcribe, translate (Claude), clone each speaker's voice (Qwen3-TTS), mix with the original background — done. All ML inference runs locally on your Mac's GPU via MLX
Why yt-dbl
- Human-quality voice cloning
Qwen3-TTS per speaker, not a generic synth. Multiple speakers are diarized and voiced separately - LLM translation
Claude handles idioms, context, and produces TTS-friendly text — not word-for-word machine translation - Background preserved
BS-RoFormer separates vocals from music/sfx. Sidechain ducking mixes them back naturally - Production audio chain
Loudnorm (-16 LUFS), de-essing, pitch-preserving speed-up, equal-power crossfade - Checkpoint & resume
Every step saves state. Interrupted?yt-dbl resumecontinues where it stopped - Private
Everything local except the Claude API call
Supported languages
TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian
ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)
Requirements
- macOS with Apple Silicon (M1–M4) — MLX needs Metal
- Python >= 3.12
- FFmpeg — audio extraction, postprocessing, final assembly
- yt-dlp — video download
- Anthropic API key — translation via Claude
Installation
1. Install system dependencies
brew install ffmpeg yt-dlp
Optional:
brew install ffmpeg-fullfor pitch-preserving speed-up via rubberband Without it, falls back to ffmpeg'satempofilter (works fine, just no pitch correction)
2. Install yt-dbl
# From PyPI
uv tool install --prerelease=allow yt-dbl
# Or with pipx
pipx install yt-dbl
--prerelease=allowis needed becausemlx-audiodepends on a pre-releasetransformersIf
yt-dblis not found, runuv tool update-shell && source ~/.zshrc
From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync
Use uv run yt-dbl instead of yt-dbl when running from source
3. Set up the API key
echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc
Or use a .env file:
YT_DBL_ANTHROPIC_API_KEY=sk-ant-...
4. Pre-download models (optional)
Models (~8.2 GB) download automatically on first run, or fetch them ahead of time:
yt-dbl models download
Configuration
Priority: CLI args > env vars (YT_DBL_ prefix) > .env file > defaults
cp .env.example .env
| Env variable | Default | Description |
|---|---|---|
YT_DBL_ANTHROPIC_API_KEY |
— | Required — Anthropic API key |
YT_DBL_TARGET_LANGUAGE |
ru |
Target language (ISO 639-1) |
YT_DBL_OUTPUT_FORMAT |
mp4 |
mp4 / mkv |
YT_DBL_SUBTITLE_MODE |
softsub |
softsub / hardsub / none |
YT_DBL_BACKGROUND_VOLUME |
0.15 |
Background volume during speech (0.0–1.0) |
YT_DBL_MAX_SPEED_FACTOR |
1.4 |
Max TTS speed-up to fit timing (1.0–2.0) |
YT_DBL_MAX_LOADED_MODELS |
0 (auto) |
Max models in memory (0 = auto by RAM) |
YT_DBL_WORK_DIR |
dubbed |
Output directory |
See
.env.examplefor all 33 parameters
Quick start
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" # dub to Russian (default)
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es # dub to Spanish
yt-dbl dub "https://youtu.be/VIDEO_ID" -o ./out # custom output dir
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate # re-run from a specific step
yt-dbl resume VIDEO_ID # resume after interrupt
yt-dbl status VIDEO_ID # check job progress
Commands
dub — dub a video
yt-dbl dub <URL> [options]
| Option | Description | Default |
|---|---|---|
-t, --target-language |
Target language | ru |
-o, --output-dir |
Output directory | ./dubbed |
--bg-volume |
Background volume (0.0–1.0) | 0.15 |
--max-speed |
Max TTS speed-up (1.0–2.0) | 1.4 |
--max-models |
Max models in memory | auto |
--from-step |
Start from: download / separate / transcribe / translate / synthesize / assemble |
— |
--no-subs |
Disable subtitles | false |
--sub-mode |
softsub / hardsub / none |
softsub |
--format |
mp4 / mkv |
mp4 |
resume — pick up where it stopped
yt-dbl resume <video_id> [--max-models N] [-o DIR]
status — check job progress
yt-dbl status <video_id>
models list / models download
yt-dbl models list # show models, download status, size
yt-dbl models download # pre-download all models
How it works
┌─────────────────────────────────────────────────────────────────────────────────┐
│ YouTube URL │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 1. DOWNLOAD │
│ │
│ yt-dlp downloads the video, ffmpeg extracts the audio track │
│ Output: video.mp4, audio.wav (48 kHz, mono) │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 2. SEPARATE │
│ │
│ BS-RoFormer splits audio into vocals and background (ONNX + CoreML) │
│ Output: vocals.wav, background.wav │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
│ │
vocals.wav background.wav
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 3. TRANSCRIBE │ │
│ │ │
│ VibeVoice-ASR (MLX, ~5.7 GB) │ │
│ → speech segments + speaker diarization │ │
│ Qwen3-ForcedAligner (MLX, ~600 MB) │ │
│ → word-level timestamps │ │
│ + language auto-detection via Unicode scripts │ │
│ │ │
│ Output: segments.json │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 4. TRANSLATE │ │
│ │ │
│ Claude API (single-pass, all segments at once) │ │
│ TTS-friendly output: short phrases, spelled-out │ │
│ numbers, no special characters │ │
│ │ │
│ Output: translations.json, subtitles.srt │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 5. SYNTHESIZE │ │
│ │ │
│ Qwen3-TTS (MLX, ~1.7 GB) — voice cloning │ │
│ using a voice reference for each speaker │ │
│ Postprocessing (parallel, ThreadPool): │ │
│ • speed-up (rubberband or atempo) │ │
│ • loudnorm (-16 LUFS, 2-pass) │ │
│ • de-essing │ │
│ │ │
│ Output: segment_0000.wav, segment_0001.wav ... │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 6. ASSEMBLE │
│ │
│ Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking) │
│ + video (copy) + subtitles (softsub / hardsub / none) │
│ All in a single ffmpeg call │
│ │
│ Output: result.mp4 │
└──────────────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────┐
│ result.mp4 │
└───────────────────┘
Memory management
LRU model manager — auto-selects how many models to keep loaded based on RAM:
RAM Models Batch (separation)
───────────── ─────── ──────────────────
<= 16 GB 1 1
17–31 GB 2 2
32–47 GB 3 4
48+ GB 3 8
ASR (~5.7 GB) is unloaded before loading the Aligner to avoid holding both in memory
Output directory structure
dubbed/
└── <video_id>/
├── state.json ← pipeline checkpoint (JSON)
├── 01_download/
│ ├── video.mp4 ← original video
│ └── audio.wav ← extracted audio track (48 kHz, mono)
├── 02_separate/
│ ├── vocals.wav ← isolated vocals
│ └── background.wav ← background music/noise
├── 03_transcribe/
│ └── segments.json ← segments, speakers, words with timestamps
├── 04_translate/
│ ├── translations.json ← translated texts
│ └── subtitles.srt ← subtitles (SRT)
├── 05_synthesize/
│ ├── ref_SPEAKER_00.wav ← speaker voice reference
│ ├── segment_0000.wav ← final segments (after postprocessing)
│ ├── segment_0001.wav
│ └── synth_meta.json ← synthesis metadata
├── 06_assemble/
│ └── speech.wav ← assembled speech track
└── result.mp4 ← final output (in job dir root)
Models
| Model | Size | Task |
|---|---|---|
| VibeVoice-ASR | ~5.7 GB | ASR + speaker diarization |
| Qwen3-ForcedAligner | ~600 MB | Word-level alignment |
| Qwen3-TTS | ~1.7 GB | TTS + voice cloning |
| MelBand-RoFormer (BS-RoFormer) | ~200 MB | Vocal/background separation |
| Claude Sonnet 4.5 | — | Translation (API) |
All local models run on MLX (Metal GPU), total ~8.2 GB
Development
just check # lint + format + typecheck + tests
just test # fast tests (parallel, coverage)
just test-e2e # E2E (needs ffmpeg + network)
just fix # auto-fix lint
just format # auto-format
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yt_dbl-1.6.2.tar.gz.
File metadata
- Download URL: yt_dbl-1.6.2.tar.gz
- Upload date:
- Size: 243.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc3d0c269be200e0ec6d91cd80be07cb7a8d84a1847fb364d01ab527c9dc2624
|
|
| MD5 |
34750489d0d22c7c077be769dfb45d83
|
|
| BLAKE2b-256 |
4da0118547ca8987c75b0256a61bfe51392439cec0b4c8c37611be2d178d77b2
|
Provenance
The following attestation bundles were made for yt_dbl-1.6.2.tar.gz:
Publisher:
release.yml on brolnickij/yt-dbl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yt_dbl-1.6.2.tar.gz -
Subject digest:
cc3d0c269be200e0ec6d91cd80be07cb7a8d84a1847fb364d01ab527c9dc2624 - Sigstore transparency entry: 927055453
- Sigstore integration time:
-
Permalink:
brolnickij/yt-dbl@683a5b43e4c98efb22970dbc442b8373ce40dfcb -
Branch / Tag:
refs/heads/master - Owner: https://github.com/brolnickij
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@683a5b43e4c98efb22970dbc442b8373ce40dfcb -
Trigger Event:
push
-
Statement type:
File details
Details for the file yt_dbl-1.6.2-py3-none-any.whl.
File metadata
- Download URL: yt_dbl-1.6.2-py3-none-any.whl
- Upload date:
- Size: 62.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d1e18718fd511ebe902b43315daa4c6530c0e12641bd485dacb0d6fea4662b5
|
|
| MD5 |
b2e00c592fa183b09ba887eb1f739c37
|
|
| BLAKE2b-256 |
f81dcac8186a07ef7d33699733ea747649bf17637e43388eb35ab8efb7089ba1
|
Provenance
The following attestation bundles were made for yt_dbl-1.6.2-py3-none-any.whl:
Publisher:
release.yml on brolnickij/yt-dbl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yt_dbl-1.6.2-py3-none-any.whl -
Subject digest:
1d1e18718fd511ebe902b43315daa4c6530c0e12641bd485dacb0d6fea4662b5 - Sigstore transparency entry: 927055454
- Sigstore integration time:
-
Permalink:
brolnickij/yt-dbl@683a5b43e4c98efb22970dbc442b8373ce40dfcb -
Branch / Tag:
refs/heads/master - Owner: https://github.com/brolnickij
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@683a5b43e4c98efb22970dbc442b8373ce40dfcb -
Trigger Event:
push
-
Statement type: