Synthesize a timestamp-synced speech track from a subtitle file and mux it into video
Project description
srt2speech
Turn a subtitle file into a timestamp-synced speech track and mux it into a video.
Give it a video + an .srt (or .vtt/.ass); it synthesizes audio where each subtitle is spoken
at its timestamp, then optionally muxes the track back in with ffmpeg. Useful for restoring lost
audio, rough translation dubs, narrating silent videos, or adding audio description by reading
only the descriptive/SDH cues.
It does the SRT→audio part well and nothing else: no translation, no transcription — bring an already-final subtitle file.
Requirements
- Python ≥ 3.11, uv
ffmpeg/ffprobeonPATH- A TTS backend:
- piper — a local gopipertts server (free, default;
set
SRT2SPEECH_PIPER_URLif not onhttp://localhost:8080) - openai —
gpt-4o-mini-tts(setOPENAI_API_KEY) - elevenlabs —
eleven_multilingual_v2(setELEVENLABS_API_KEY)
- piper — a local gopipertts server (free, default;
set
Install
Run it straight from PyPI with no install — uvx fetches
it on first use:
uvx srt2speech --help
Or install it as a persistent tool (then just call srt2speech):
uv tool install srt2speech
Usage
# generate a synced track with the local piper backend, sized to the video
uvx srt2speech generate subs.srt --video clip.mp4 -o track.wav
# generate + mux into the video in one step
uvx srt2speech run clip.mp4 subs.srt -o dubbed.mp4
# emit one audio file per segment + a manifest, instead of a merged track
uvx srt2speech generate subs.srt --chunks ./chunks
# raw, per-cue synthesis (no time-fitting)
uvx srt2speech generate subs.srt --chunks ./chunks --chunk-by cue --chunk-audio raw
# chunks with ending verification: transcribe each chunk, re-synthesize dropped endings
OPENAI_API_KEY=... uvx srt2speech generate subs.srt --backend openai \
--chunks ./chunks --verify-endings
# surgically re-synthesize chunks 3 and 17 into an existing chunks dir
uvx srt2speech generate subs.srt --chunks ./chunks --only 3,17
# paid backend with delivery guidance
OPENAI_API_KEY=... uvx srt2speech generate subs.srt \
--backend openai --voice coral --instructions "calm documentary narration" -o track.wav
# audio description: only descriptive/SDH cues, mixed over the existing audio
uvx srt2speech run movie.mkv subs.srt --mode descriptive --mux-mode mix -o described.mkv
# mux an existing track yourself
uvx srt2speech mux clip.mp4 track.wav -o dubbed.mp4
# list a backend's voices
uvx srt2speech voices --backend openai
Docker Compose
Runs a local piper server plus an on-demand CLI; no host Python or ffmpeg needed. Put your video
and subtitles in ./data (mounted at /data); pulled voices are cached in ./voices.
# 1. start the piper TTS server (preloads the default voice)
docker compose up -d gopipertts
# 2. run the CLI against files in ./data
docker compose run --rm srt2speech run /data/clip.mp4 /data/subs.srt -o /data/dubbed.mp4
# 3. tear down when done
docker compose down
For the OpenAI backend, put OPENAI_API_KEY=sk-... in a .env file (gitignored) — Compose loads it
automatically and passes it through to the CLI container.
Sync strategies (--strategy)
Speech rarely fits a cue's window exactly. The fit engine offers:
hybrid(default) — fit into the cue window plus the silent gap before the next cue; only then speed up, capped by--max-speedup(default1.15).overflow— never speed up; let speech run into following silence (best quality, can drift).precise— fit the exact cue window, speeding up to the cap.
Modes (--mode)
all (default) · descriptive (SDH/audio-description only) · dialogue (drop sound cues).
Chunked output (--chunks)
Instead of one merged track, generate --chunks DIR writes each piece of speech to its own
.wav and a manifest.json mapping every file back to its timing and text — handy for
re-importing into a video editor. In this mode -o/--video are ignored.
--chunk-by—segment(default): merged sentence-sized units (better prosody);cue: one file per raw subtitle entry.--chunk-audio—fitted(default): time-shaped to its window per--strategy;raw: the natural synthesis with no time-stretching.
Files are named <index>_<start_ms>ms.wav (e.g. 0003_0012400ms.wav). The manifest records
start_ms/end_ms (the cue/segment window), audio_ms (rendered length), and text per chunk.
--only 3,17— re-synthesize just those chunk indexes into an existing chunks dir, overwriting their audio and refreshing their manifest entries. For surgical fixes after editing a cue's text, without paying to re-render everything else.
Ending verification (--verify-endings)
Some TTS models — observed extensively on gpt-4o-mini-tts — occasionally drop a short
trailing sentence during synthesis: the audio just ends early, so duration checks pass and the
loss is silent. --verify-endings (chunks mode only) closes that hole: after synthesis it
transcribes each chunk (OpenAI whisper-1, so OPENAI_API_KEY is required regardless of TTS
backend) and checks the chunk's last sentence was actually spoken, comparing content words
so hyphenation, numerals, and currency wording don't false-positive. Chunks that lost their
ending are re-synthesized and re-checked, up to --verify-rounds (default 3) rounds.
Each pass writes a verify.json verdict ({ok, checked, failed[]}) next to the manifest. If a
chunk still fails after all rounds the command exits non-zero and names the cues — the drop is
stochastic but text-dependent, so rewrite the cue (fold the short trailing sentence into the
prior one) and re-run with --only <index> --verify-endings. --verify-thresh (default 0.5)
sets the fraction of last-sentence content words that must be heard.
Development
From a clone of the repo:
uv sync
uv run srt2speech --help
uv run pytest
uv run ruff check
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file srt2speech-1.6.0.tar.gz.
File metadata
- Download URL: srt2speech-1.6.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
835a7473d3717f9cfbd5a579187dcfc79c53962a628473b0fdb42664190d6b89
|
|
| MD5 |
7453786c169c411c3ff012414ba26878
|
|
| BLAKE2b-256 |
5e679ae89de8b1c5e12892fff15036af3a057213a4b3b157de688b5a1a61fa98
|
Provenance
The following attestation bundles were made for srt2speech-1.6.0.tar.gz:
Publisher:
publish.yml on nbr23/srt2speech
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
srt2speech-1.6.0.tar.gz -
Subject digest:
835a7473d3717f9cfbd5a579187dcfc79c53962a628473b0fdb42664190d6b89 - Sigstore transparency entry: 2043678817
- Sigstore integration time:
-
Permalink:
nbr23/srt2speech@ad2ca4747fd3c262642e2c6a2125bcd7aaf100ea -
Branch / Tag:
refs/heads/master - Owner: https://github.com/nbr23
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad2ca4747fd3c262642e2c6a2125bcd7aaf100ea -
Trigger Event:
push
-
Statement type:
File details
Details for the file srt2speech-1.6.0-py3-none-any.whl.
File metadata
- Download URL: srt2speech-1.6.0-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0c032cd332dc3b655a852749989ffbaa946f05691f62947bbb1417aaa3ad54d
|
|
| MD5 |
2e628d7dc8b172f413eb8064d2cd5633
|
|
| BLAKE2b-256 |
7532a580149bc1b6627ced935a212eeb442b6417b11d381e931036aef1a676f8
|
Provenance
The following attestation bundles were made for srt2speech-1.6.0-py3-none-any.whl:
Publisher:
publish.yml on nbr23/srt2speech
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
srt2speech-1.6.0-py3-none-any.whl -
Subject digest:
b0c032cd332dc3b655a852749989ffbaa946f05691f62947bbb1417aaa3ad54d - Sigstore transparency entry: 2043678825
- Sigstore integration time:
-
Permalink:
nbr23/srt2speech@ad2ca4747fd3c262642e2c6a2125bcd7aaf100ea -
Branch / Tag:
refs/heads/master - Owner: https://github.com/nbr23
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad2ca4747fd3c262642e2c6a2125bcd7aaf100ea -
Trigger Event:
push
-
Statement type: