CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)
Project description
yt-dbl
[!WARNING] NOT STABLE, PLEASE DON'T USE FOR LONG VIDEOS
[!WARNING] Apple Silicon only (M1/M2/M3/M4) — all ML inference runs on Metal GPU via MLX
Tested on M4 Pro (20-core GPU, 48 GB unified memory)
CLI tool for automatic YouTube video dubbing with voice cloning.
All ML inference (ASR, alignment, TTS) runs locally on Apple Silicon via MLX. Translation is done through the Claude API. The output is a video file dubbed in the target language using the original speaker's cloned voice.
Supported languages
TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian
ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4) — MLX only works on Metal
- Python >= 3.12
- FFmpeg — used for audio extraction, post-processing, and final assembly
- yt-dlp — used to download videos from YouTube
- Anthropic API key — for translation via Claude
Installation
1. Install system dependencies
# FFmpeg (required)
brew install ffmpeg
# yt-dlp (required)
brew install yt-dlp
Optional: for pitch-preserving speed-up via rubberband, install
ffmpeg-fullinstead:brew install ffmpeg-fullWithout it, the tool falls back to ffmpeg's
atempofilter (works fine, just no pitch correction).
2. Install yt-dbl
# From PyPI (recommended)
uv tool install --prerelease=allow yt-dbl
# Or with pipx
pipx install yt-dbl
Note:
--prerelease=allowis needed becausemlx-audiodepends on a pre-release version oftransformers.If
yt-dblis not found after installation, runuv tool update-shell && source ~/.zshrcto add~/.local/binto your PATH.
From source
git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync
When running from source, use uv run yt-dbl instead of yt-dbl.
3. Set up the API key
The Anthropic API key is required for the translation step. Add it to your shell profile so it persists across sessions:
echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc
Or create a .env file in the working directory:
YT_DBL_ANTHROPIC_API_KEY=sk-ant-...
4. Pre-download models (optional)
Models (~8.2 GB) are downloaded automatically on first run. To fetch them ahead of time:
yt-dbl models download
Configuration
Settings are loaded in order of priority:
- CLI arguments
- Environment variables (prefix
YT_DBL_) .envfile- Default values
Copy .env.example to .env and adjust as needed:
cp .env.example .env
Key parameters
| Env variable | Default | Description |
|---|---|---|
YT_DBL_ANTHROPIC_API_KEY |
— | Required. Anthropic API key for translation |
YT_DBL_TARGET_LANGUAGE |
ru |
Target language (ISO 639-1) |
YT_DBL_OUTPUT_FORMAT |
mp4 |
mp4 / mkv |
YT_DBL_SUBTITLE_MODE |
softsub |
softsub / hardsub / none |
YT_DBL_BACKGROUND_VOLUME |
0.15 |
Background volume during speech (0.0–1.0) |
YT_DBL_MAX_SPEED_FACTOR |
1.4 |
Max TTS speed-up to fit timing (1.0–2.0) |
YT_DBL_MAX_LOADED_MODELS |
0 (auto) |
Max models in memory (0 = auto by RAM) |
YT_DBL_WORK_DIR |
dubbed |
Output directory for all jobs |
See
.env.examplefor the full list of 33 configurable parameters including model selection, separation tuning, TTS sampling, and chunked ASR settings.
Quick start
# Dub a video into Russian (default)
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"
# Custom output directory
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-output
# Specify target language
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es
# Start from a specific step (previous steps are skipped)
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate
# Check job status
yt-dbl status VIDEO_ID
# Resume an interrupted job
yt-dbl resume VIDEO_ID
Commands
dub — dub a video
yt-dbl dub <URL> [options]
| Option | Description | Default |
|---|---|---|
-t, --target-language |
Target language | ru |
-o, --output-dir |
Output directory | ./dubbed |
--bg-volume |
Background volume (0.0–1.0) | 0.15 |
--max-speed |
Max TTS speed-up (1.0–2.0) | 1.4 |
--max-models |
Max models in memory | auto (by RAM) |
--from-step |
Start from step: download / separate / transcribe / translate / synthesize / assemble |
— |
--no-subs |
Disable subtitles | false |
--sub-mode |
Subtitle mode: softsub / hardsub / none |
softsub |
--format |
Output format: mp4 / mkv |
mp4 |
resume — resume an interrupted job
yt-dbl resume <video_id> [--max-models N] [-o DIR]
The pipeline saves state.json after each step. If interrupted, resume picks up from the last incomplete step.
status — check job status
yt-dbl status <video_id>
Shows a table with each step's state (pending / running / completed / failed), execution time, and video metadata.
models list — list ML models
yt-dbl models list
Shows all models, their download status, and size on disk.
models download — pre-download models
yt-dbl models download
Downloads all HuggingFace models. The audio-separator model is downloaded automatically on first use.
How it works
┌─────────────────────────────────────────────────────────────────────────────────┐
│ YouTube URL │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 1. DOWNLOAD │
│ │
│ yt-dlp downloads the video, ffmpeg extracts the audio track │
│ Output: video.mp4, audio.wav (48 kHz, mono) │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 2. SEPARATE │
│ │
│ BS-RoFormer splits audio into vocals and background (ONNX + CoreML) │
│ Output: vocals.wav, background.wav │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
│ │
vocals.wav background.wav
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 3. TRANSCRIBE │ │
│ │ │
│ VibeVoice-ASR (MLX, ~5.7 GB) │ │
│ → speech segments + speaker diarization │ │
│ Qwen3-ForcedAligner (MLX, ~600 MB) │ │
│ → word-level timestamps │ │
│ + language auto-detection via Unicode scripts │ │
│ │ │
│ Output: segments.json │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 4. TRANSLATE │ │
│ │ │
│ Claude API (single-pass, all segments at once) │ │
│ TTS-friendly output: short phrases, spelled-out │ │
│ numbers, no special characters │ │
│ │ │
│ Output: translations.json, subtitles.srt │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ 5. SYNTHESIZE │ │
│ │ │
│ Qwen3-TTS (MLX, ~1.7 GB) — voice cloning │ │
│ using a voice reference for each speaker │ │
│ Postprocessing (parallel, ThreadPool): │ │
│ • speed-up (rubberband or atempo) │ │
│ • loudnorm (-16 LUFS, 2-pass) │ │
│ • de-essing │ │
│ │ │
│ Output: segment_0000.wav, segment_0001.wav ... │ │
└──────────────────────────┬───────────────────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 6. ASSEMBLE │
│ │
│ Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking) │
│ + video (copy) + subtitles (softsub / hardsub / none) │
│ All in a single ffmpeg call │
│ │
│ Output: result.mp4 │
└──────────────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌───────────────────┐
│ result.mp4 │
└───────────────────┘
Memory management
ML models are loaded and unloaded via an LRU manager.
The number of models kept in memory is determined automatically based on available RAM:
RAM Models Batch (separation)
───────────── ─────── ──────────────────
<= 16 GB 1 1
17–31 GB 2 2
32–47 GB 3 4
48+ GB 3 8
The ASR model (~5.7 GB) is unloaded before loading the Aligner so both don't occupy memory at the same time.
Output directory structure
dubbed/
└── <video_id>/
├── state.json ← pipeline checkpoint (JSON)
├── 01_download/
│ ├── video.mp4 ← original video
│ └── audio.wav ← extracted audio track (48 kHz, mono)
├── 02_separate/
│ ├── vocals.wav ← isolated vocals
│ └── background.wav ← background music/noise
├── 03_transcribe/
│ └── segments.json ← segments, speakers, words with timestamps
├── 04_translate/
│ ├── translations.json ← translated texts
│ └── subtitles.srt ← subtitles (SRT)
├── 05_synthesize/
│ ├── ref_SPEAKER_00.wav ← speaker voice reference
│ ├── segment_0000.wav ← final segments (after postprocessing)
│ ├── segment_0001.wav
│ └── synth_meta.json ← synthesis metadata
├── 06_assemble/
│ └── speech.wav ← assembled speech track
└── result.mp4 ← final output (in job dir root)
Models
| Model | Size | Task | Inference |
|---|---|---|---|
| VibeVoice-ASR | ~5.7 GB | ASR + speaker diarization | MLX (Metal) |
| Qwen3-ForcedAligner | ~600 MB | Word-level alignment | MLX (Metal) |
| Qwen3-TTS | ~1.7 GB | TTS with voice cloning | MLX (Metal) |
| MelBand-RoFormer (BS-RoFormer) | ~200 MB | Vocal/background separation | ONNX + CoreML |
| Claude Sonnet 4.5 | — | Text translation | API (Anthropic) |
Development
just check # lint + format + typecheck + tests
just test # fast tests (parallel, coverage)
just test-e2e # E2E tests (requires FFmpeg + network)
just fix # auto-fix linter
just format # auto-format
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yt_dbl-1.6.0.tar.gz.
File metadata
- Download URL: yt_dbl-1.6.0.tar.gz
- Upload date:
- Size: 237.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1e2eae8a089bcdc8483f61c06197ee57d80eed1821ea4be79e8c925c3dbaffd
|
|
| MD5 |
bfc774905e15c9029d93aa993fa6c8e3
|
|
| BLAKE2b-256 |
ec29f03734e2edbaf4be6df77d8c9bfe530577f3e3b97887d6470c66fff7204f
|
Provenance
The following attestation bundles were made for yt_dbl-1.6.0.tar.gz:
Publisher:
release.yml on brolnickij/yt-dbl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yt_dbl-1.6.0.tar.gz -
Subject digest:
b1e2eae8a089bcdc8483f61c06197ee57d80eed1821ea4be79e8c925c3dbaffd - Sigstore transparency entry: 926399116
- Sigstore integration time:
-
Permalink:
brolnickij/yt-dbl@deda7378915c6f8f2eb5aa8dcd43a3a840cb6e10 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/brolnickij
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@deda7378915c6f8f2eb5aa8dcd43a3a840cb6e10 -
Trigger Event:
push
-
Statement type:
File details
Details for the file yt_dbl-1.6.0-py3-none-any.whl.
File metadata
- Download URL: yt_dbl-1.6.0-py3-none-any.whl
- Upload date:
- Size: 60.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24876af8f76eca152ed0b7e3ebdc32d2de241d3625ad0feebb64e271fa6c040b
|
|
| MD5 |
33da31c76afd240ebbb6c3afb08dc059
|
|
| BLAKE2b-256 |
471a858afc1b543f50f958984a0f30f4cc5a347181f00704d03f4252f3f0883b
|
Provenance
The following attestation bundles were made for yt_dbl-1.6.0-py3-none-any.whl:
Publisher:
release.yml on brolnickij/yt-dbl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yt_dbl-1.6.0-py3-none-any.whl -
Subject digest:
24876af8f76eca152ed0b7e3ebdc32d2de241d3625ad0feebb64e271fa6c040b - Sigstore transparency entry: 926399168
- Sigstore integration time:
-
Permalink:
brolnickij/yt-dbl@deda7378915c6f8f2eb5aa8dcd43a3a840cb6e10 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/brolnickij
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@deda7378915c6f8f2eb5aa8dcd43a3a840cb6e10 -
Trigger Event:
push
-
Statement type: