Skip to main content

Synthesize a timestamp-synced speech track from a subtitle file and mux it into video

Project description

srt2speech

Turn a subtitle file into a timestamp-synced speech track and mux it into a video.

Give it a video + an .srt (or .vtt/.ass); it synthesizes audio where each subtitle is spoken at its timestamp, then optionally muxes the track back in with ffmpeg. Useful for restoring lost audio, rough translation dubs, narrating silent videos, or adding audio description by reading only the descriptive/SDH cues.

It does the SRT→audio part well and nothing else: no translation, no transcription — bring an already-final subtitle file.

Requirements

  • Python ≥ 3.11, uv
  • ffmpeg / ffprobe on PATH
  • A TTS backend:
    • piper — a local gopipertts server (free, default; set SRT2SPEECH_PIPER_URL if not on http://localhost:8080)
    • openaigpt-4o-mini-tts (set OPENAI_API_KEY)
    • elevenlabseleven_multilingual_v2 (set ELEVENLABS_API_KEY)

Install

Run it straight from PyPI with no install — uvx fetches it on first use:

uvx srt2speech --help

Or install it as a persistent tool (then just call srt2speech):

uv tool install srt2speech

Usage

# generate a synced track with the local piper backend, sized to the video
uvx srt2speech generate subs.srt --video clip.mp4 -o track.wav

# generate + mux into the video in one step
uvx srt2speech run clip.mp4 subs.srt -o dubbed.mp4

# emit one audio file per segment + a manifest, instead of a merged track
uvx srt2speech generate subs.srt --chunks ./chunks

# raw, per-cue synthesis (no time-fitting)
uvx srt2speech generate subs.srt --chunks ./chunks --chunk-by cue --chunk-audio raw

# chunks with ending verification: transcribe each chunk, re-synthesize dropped endings
OPENAI_API_KEY=... uvx srt2speech generate subs.srt --backend openai \
    --chunks ./chunks --verify-endings

# surgically re-synthesize chunks 3 and 17 into an existing chunks dir
uvx srt2speech generate subs.srt --chunks ./chunks --only 3,17

# paid backend with delivery guidance
OPENAI_API_KEY=... uvx srt2speech generate subs.srt \
    --backend openai --voice coral --instructions "calm documentary narration" -o track.wav

# audio description: only descriptive/SDH cues, mixed over the existing audio
uvx srt2speech run movie.mkv subs.srt --mode descriptive --mux-mode mix -o described.mkv

# mux an existing track yourself
uvx srt2speech mux clip.mp4 track.wav -o dubbed.mp4

# list a backend's voices
uvx srt2speech voices --backend openai

Docker Compose

Runs a local piper server plus an on-demand CLI; no host Python or ffmpeg needed. Put your video and subtitles in ./data (mounted at /data); pulled voices are cached in ./voices.

# 1. start the piper TTS server (preloads the default voice)
docker compose up -d gopipertts

# 2. run the CLI against files in ./data
docker compose run --rm srt2speech run /data/clip.mp4 /data/subs.srt -o /data/dubbed.mp4

# 3. tear down when done
docker compose down

For the OpenAI backend, put OPENAI_API_KEY=sk-... in a .env file (gitignored) — Compose loads it automatically and passes it through to the CLI container.

Sync strategies (--strategy)

Speech rarely fits a cue's window exactly. The fit engine offers:

  • hybrid (default) — fit into the cue window plus the silent gap before the next cue; only then speed up, capped by --max-speedup (default 1.15).
  • overflow — never speed up; let speech run into following silence (best quality, can drift).
  • precise — fit the exact cue window, speeding up to the cap.

Modes (--mode)

all (default) · descriptive (SDH/audio-description only) · dialogue (drop sound cues).

Chunked output (--chunks)

Instead of one merged track, generate --chunks DIR writes each piece of speech to its own .wav and a manifest.json mapping every file back to its timing and text — handy for re-importing into a video editor. In this mode -o/--video are ignored.

  • --chunk-bysegment (default): merged sentence-sized units (better prosody); cue: one file per raw subtitle entry.
  • --chunk-audiofitted (default): time-shaped to its window per --strategy; raw: the natural synthesis with no time-stretching.

Files are named <index>_<start_ms>ms.wav (e.g. 0003_0012400ms.wav). The manifest records start_ms/end_ms (the cue/segment window), audio_ms (rendered length), and text per chunk.

  • --only 3,17 — re-synthesize just those chunk indexes into an existing chunks dir, overwriting their audio and refreshing their manifest entries. For surgical fixes after editing a cue's text, without paying to re-render everything else.

Ending verification (--verify-endings)

Some TTS models — observed extensively on gpt-4o-mini-tts — occasionally drop a short trailing sentence during synthesis: the audio just ends early, so duration checks pass and the loss is silent. --verify-endings (chunks mode only) closes that hole: after synthesis it transcribes each chunk (OpenAI whisper-1, so OPENAI_API_KEY is required regardless of TTS backend) and checks the chunk's last sentence was actually spoken, comparing content words so hyphenation, numerals, and currency wording don't false-positive. Chunks that lost their ending are re-synthesized and re-checked, up to --verify-rounds (default 3) rounds.

Each pass writes a verify.json verdict ({ok, checked, failed[]}) next to the manifest. If a chunk still fails after all rounds the command exits non-zero and names the cues — the drop is stochastic but text-dependent, so rewrite the cue (fold the short trailing sentence into the prior one) and re-run with --only <index> --verify-endings. --verify-thresh (default 0.5) sets the fraction of last-sentence content words that must be heard.

Development

From a clone of the repo:

uv sync
uv run srt2speech --help
uv run pytest
uv run ruff check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srt2speech-1.6.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srt2speech-1.6.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file srt2speech-1.6.0.tar.gz.

File metadata

  • Download URL: srt2speech-1.6.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for srt2speech-1.6.0.tar.gz
Algorithm Hash digest
SHA256 835a7473d3717f9cfbd5a579187dcfc79c53962a628473b0fdb42664190d6b89
MD5 7453786c169c411c3ff012414ba26878
BLAKE2b-256 5e679ae89de8b1c5e12892fff15036af3a057213a4b3b157de688b5a1a61fa98

See more details on using hashes here.

Provenance

The following attestation bundles were made for srt2speech-1.6.0.tar.gz:

Publisher: publish.yml on nbr23/srt2speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file srt2speech-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: srt2speech-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for srt2speech-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b0c032cd332dc3b655a852749989ffbaa946f05691f62947bbb1417aaa3ad54d
MD5 2e628d7dc8b172f413eb8064d2cd5633
BLAKE2b-256 7532a580149bc1b6627ced935a212eeb442b6417b11d381e931036aef1a676f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for srt2speech-1.6.0-py3-none-any.whl:

Publisher: publish.yml on nbr23/srt2speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page