CLI tool for automatic YouTube video dubbing with voice cloning (Apple Silicon)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

yt-dbl

[!WARNING] Apple Silicon only (M1/M2/M3/M4) — all ML inference runs on Metal GPU via MLX

Tested on M4 Pro (20-core GPU, 48 GB unified memory)

CLI tool for automatic YouTube video dubbing with voice cloning.

All ML inference (ASR, alignment, TTS) runs locally on Apple Silicon via MLX. Translation is done through the Claude API. The output is a video file dubbed in the target language using the original speaker's cloned voice.

Supported languages

TTS (synthesis): Russian, English, German, French, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Dutch, Polish, Ukrainian

ASR (recognition): auto-detected via Unicode scripts (Latin, Cyrillic, Arabic, Devanagari, CJK, etc.)

Requirements

macOS with Apple Silicon (M1/M2/M3/M4) — MLX only works on Metal
Python >= 3.12
FFmpeg — used for audio extraction, post-processing, and final assembly
yt-dlp — used to download videos from YouTube
Anthropic API key — for translation via Claude

Installation

1. Install system dependencies

# FFmpeg (required)
brew install ffmpeg

# yt-dlp (required)
brew install yt-dlp

Optional: for pitch-preserving speed-up via rubberband, install ffmpeg-full instead:
brew install ffmpeg-full
Without it, the tool falls back to ffmpeg's atempo filter (works fine, just no pitch correction).

2. Install yt-dbl

# From PyPI (recommended)
uv tool install yt-dbl

# Or with pipx
pipx install yt-dbl

From source

git clone git@github.com:brolnickij/yt-dbl.git && cd yt-dbl
uv sync

When running from source, use uv run yt-dbl instead of yt-dbl.

3. Set up the API key

The Anthropic API key is required for the translation step. Add it to your shell profile so it persists across sessions:

echo 'export YT_DBL_ANTHROPIC_API_KEY="sk-ant-..."' >> ~/.zshrc
source ~/.zshrc

Or create a .env file in the working directory:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...

4. Pre-download models (optional)

Models (~10.5 GB) are downloaded automatically on first run. To fetch them ahead of time:

yt-dbl models download

Configuration

Settings are loaded in order of priority:

CLI arguments
Environment variables (prefix YT_DBL_)
.env file
Default values

Example .env:

YT_DBL_ANTHROPIC_API_KEY=sk-ant-...
YT_DBL_TARGET_LANGUAGE=ru
YT_DBL_BACKGROUND_VOLUME=0.2
YT_DBL_MAX_SPEED_FACTOR=1.3
YT_DBL_OUTPUT_FORMAT=mkv
YT_DBL_SUBTITLE_MODE=hardsub
YT_DBL_BACKGROUND_DUCKING=true
YT_DBL_VOICE_REF_DURATION=7.0
YT_DBL_SAMPLE_RATE=48000

All parameters

Parameter	Env variable	Default	Description
`target_language`	`YT_DBL_TARGET_LANGUAGE`	`ru`	Target language
`output_format`	`YT_DBL_OUTPUT_FORMAT`	`mp4`	`mp4` / `mkv`
`subtitle_mode`	`YT_DBL_SUBTITLE_MODE`	`softsub`	`softsub` / `hardsub` / `none`
`background_volume`	`YT_DBL_BACKGROUND_VOLUME`	`0.15`	Background volume (0.0–1.0)
`background_ducking`	`YT_DBL_BACKGROUND_DUCKING`	`true`	Duck background during speech (sidechain)
`max_speed_factor`	`YT_DBL_MAX_SPEED_FACTOR`	`1.4`	Max TTS speed-up (1.0–2.0)
`voice_ref_duration`	`YT_DBL_VOICE_REF_DURATION`	`7.0`	Voice reference duration (3–30 sec)
`max_loaded_models`	`YT_DBL_MAX_LOADED_MODELS`	`0` (auto)	Max models in memory
`anthropic_api_key`	`YT_DBL_ANTHROPIC_API_KEY`	—	Anthropic API key
`work_dir`	`YT_DBL_WORK_DIR`	`dubbed`	Output directory

Quick start

# Dub a video into Russian (default)
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID"

# Custom output directory
yt-dbl dub "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-output

# Specify target language
yt-dbl dub "https://youtu.be/VIDEO_ID" -t es

# Start from a specific step (previous steps are skipped)
yt-dbl dub "https://youtu.be/VIDEO_ID" --from-step translate

# Check job status
yt-dbl status VIDEO_ID

# Resume an interrupted job
yt-dbl resume VIDEO_ID

Commands

`dub` — dub a video

yt-dbl dub <URL> [options]

Option	Description	Default
`-t`, `--target-language`	Target language	`ru`
`-o`, `--output-dir`	Output directory	`./dubbed`
`--bg-volume`	Background volume (0.0–1.0)	`0.15`
`--max-speed`	Max TTS speed-up (1.0–2.0)	`1.4`
`--max-models`	Max models in memory	auto (by RAM)
`--from-step`	Start from step: `download` / `separate` / `transcribe` / `translate` / `synthesize` / `assemble`	—
`--no-subs`	Disable subtitles	`false`
`--sub-mode`	Subtitle mode: `softsub` / `hardsub` / `none`	`softsub`
`--format`	Output format: `mp4` / `mkv`	`mp4`

`resume` — resume an interrupted job

yt-dbl resume <video_id> [--max-models N] [-o DIR]

The pipeline saves state.json after each step. If interrupted, resume picks up from the last incomplete step.

`status` — check job status

yt-dbl status <video_id>

Shows a table with each step's state (pending / running / completed / failed), execution time, and video metadata.

`models list` — list ML models

yt-dbl models list

Shows all models, their download status, and size on disk.

`models download` — pre-download models

yt-dbl models download

Downloads all HuggingFace models. The audio-separator model is downloaded automatically on first use.

How it works

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                YouTube URL                                      │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  1. DOWNLOAD                                                                    │
│                                                                                 │
│  yt-dlp downloads the video, ffmpeg extracts the audio track                    │
│  Output: video.mp4, audio.wav (48 kHz, mono)                                    │
└─────────────────────────────────────┬───────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  2. SEPARATE                                                                    │
│                                                                                 │
│  BS-RoFormer splits audio into vocals and background (ONNX + CoreML)            │
│  Output: vocals.wav, background.wav                                             │
└───────────────────────────┬────────────────────────────────────────────┬────────┘
                            │                                            │
                       vocals.wav                                  background.wav
                            │                                            │
                            ▼                                            │
┌──────────────────────────────────────────────────────┐                 │
│  3. TRANSCRIBE                                       │                 │
│                                                      │                 │
│  VibeVoice-ASR (MLX, ~5.7 GB)                        │                 │
│    → speech segments + speaker diarization           │                 │
│  Qwen3-ForcedAligner (MLX, ~600 MB)                  │                 │
│    → word-level timestamps                           │                 │
│  + language auto-detection via Unicode scripts       │                 │
│                                                      │                 │
│  Output: segments.json                               │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  4. TRANSLATE                                        │                 │
│                                                      │                 │
│  Claude API (single-pass, all segments at once)      │                 │
│  TTS-friendly output: short phrases, spelled-out     │                 │
│  numbers, no special characters                      │                 │
│                                                      │                 │
│  Output: translations.json, subtitles.srt            │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             │
┌──────────────────────────────────────────────────────┐                 │
│  5. SYNTHESIZE                                       │                 │
│                                                      │                 │
│  Qwen3-TTS (MLX, ~1.7 GB) — voice cloning            │                 │
│  using a voice reference for each speaker            │                 │
│  Postprocessing (parallel, ThreadPool):              │                 │
│    • speed-up (rubberband or atempo)                 │                 │
│    • loudnorm (-16 LUFS, 2-pass)                     │                 │
│    • de-essing                                       │                 │
│                                                      │                 │
│  Output: segment_0000.wav, segment_0001.wav ...      │                 │
└──────────────────────────┬───────────────────────────┘                 │
                           │                                             │
                           ▼                                             ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  6. ASSEMBLE                                                                    │
│                                                                                 │
│  Speech track (crossfade 50 ms, equal-power) + background (sidechain ducking)   │
│  + video (copy) + subtitles (softsub / hardsub / none)                          │
│  All in a single ffmpeg call                                                    │
│                                                                                 │
│  Output: result.mp4                                                             │
└──────────────────────────────────────────┬──────────────────────────────────────┘
                                           │
                                           ▼
                                 ┌───────────────────┐
                                 │    result.mp4     │
                                 └───────────────────┘

Memory management

ML models are loaded and unloaded via an LRU manager.

The number of models kept in memory is determined automatically based on available RAM:

RAM              Models     Batch (separation)
─────────────    ───────    ──────────────────
<= 16 GB         1          1
17–31 GB         2          2
32–47 GB         3          4
48+ GB           3          8

The ASR model (~8 GB) is unloaded before loading the Aligner so both don't occupy memory at the same time.

Output directory structure

dubbed/
└── <video_id>/
    ├── state.json                  ← pipeline checkpoint (JSON)
    ├── 01_download/
    │   ├── video.mp4               ← original video
    │   └── audio.wav               ← extracted audio track (48 kHz, mono)
    ├── 02_separate/
    │   ├── vocals.wav              ← isolated vocals
    │   └── background.wav          ← background music/noise
    ├── 03_transcribe/
    │   └── segments.json           ← segments, speakers, words with timestamps
    ├── 04_translate/
    │   ├── translations.json       ← translated texts
    │   └── subtitles.srt           ← subtitles (SRT)
    ├── 05_synthesize/
    │   ├── ref_SPEAKER_00.wav      ← speaker voice reference
    │   ├── segment_0000.wav        ← final segments (after postprocessing)
    │   ├── segment_0001.wav
    │   └── synth_meta.json         ← synthesis metadata
    ├── 06_assemble/
    │   └── speech.wav              ← assembled speech track
    └── result.mp4                  ← final output (in job dir root)

Models

Model	Size	Task	Inference
VibeVoice-ASR	~5.7 GB	ASR + speaker diarization	MLX (Metal)
Qwen3-ForcedAligner	~600 MB	Word-level alignment	MLX (Metal)
Qwen3-TTS	~1.7 GB	TTS with voice cloning	MLX (Metal)
MelBand-RoFormer (BS-RoFormer)	~200 MB	Vocal/background separation	ONNX + CoreML
Claude Sonnet 4.5	—	Text translation	API (Anthropic)

Development

just check          # lint + format + typecheck + tests
just test           # fast tests (parallel, coverage)
just test-e2e       # E2E tests (requires FFmpeg + network)
just fix            # auto-fix linter
just format         # auto-format

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

brolnickij

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.8.0

Feb 8, 2026

1.7.1

Feb 8, 2026

1.7.0

Feb 7, 2026

1.6.4

Feb 7, 2026

1.6.3

Feb 7, 2026

1.6.2

Feb 7, 2026

1.6.1

Feb 7, 2026

1.6.0

Feb 7, 2026

1.5.0

Feb 6, 2026

1.4.1

Feb 6, 2026

1.4.0

Feb 6, 2026

1.3.0

Feb 6, 2026

This version

1.2.0

Feb 6, 2026

1.1.0

Feb 6, 2026

1.0.0

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yt_dbl-1.2.0.tar.gz (212.7 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yt_dbl-1.2.0-py3-none-any.whl (51.1 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file yt_dbl-1.2.0.tar.gz.

File metadata

Download URL: yt_dbl-1.2.0.tar.gz
Upload date: Feb 6, 2026
Size: 212.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`297da4973f1339e3ee7119d65c99541087002db16f920b6c8744e786f1a45461`
MD5	`f79e5da31508df3ee2994e76308e7822`
BLAKE2b-256	`ee61db767fc3c501b36c58fb9e5861615c53fa5fc0b0222e3881888fc33c481e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.2.0.tar.gz:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yt_dbl-1.2.0.tar.gz
- Subject digest: 297da4973f1339e3ee7119d65c99541087002db16f920b6c8744e786f1a45461
- Sigstore transparency entry: 924896108
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: brolnickij/yt-dbl@9101509329a5a1fb795936f324e6d3dcea0993ea
- Branch / Tag: refs/heads/master
- Owner: https://github.com/brolnickij
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9101509329a5a1fb795936f324e6d3dcea0993ea
- Trigger Event: push

File details

Details for the file yt_dbl-1.2.0-py3-none-any.whl.

File metadata

Download URL: yt_dbl-1.2.0-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 51.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for yt_dbl-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cee6e81669a7cc1eb0b1017a08d9251d7dcb8db3beabcb32f0b4821163e35450`
MD5	`9e5bc66160d9716427aef2258d53256c`
BLAKE2b-256	`0b20207da6fa51bc7c2a08c33bc28c8f6660eddab9ba163be13951a917431c73`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yt_dbl-1.2.0-py3-none-any.whl:

Publisher: release.yml on brolnickij/yt-dbl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yt_dbl-1.2.0-py3-none-any.whl
- Subject digest: cee6e81669a7cc1eb0b1017a08d9251d7dcb8db3beabcb32f0b4821163e35450
- Sigstore transparency entry: 924896147
- Sigstore integration time: Feb 6, 2026
Source repository:
- Permalink: brolnickij/yt-dbl@9101509329a5a1fb795936f324e6d3dcea0993ea
- Branch / Tag: refs/heads/master
- Owner: https://github.com/brolnickij
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@9101509329a5a1fb795936f324e6d3dcea0993ea
- Trigger Event: push

yt-dbl 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

yt-dbl

Supported languages

Requirements

Installation

1. Install system dependencies

2. Install yt-dbl

3. Set up the API key

4. Pre-download models (optional)

Configuration

All parameters

Quick start

Commands

dub — dub a video

resume — resume an interrupted job

status — check job status

models list — list ML models

models download — pre-download models

How it works

Memory management

Output directory structure

Models

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`dub` — dub a video

`resume` — resume an interrupted job

`status` — check job status

`models list` — list ML models

`models download` — pre-download models