Generate VTT subtitles with timestamps snapped to voice onset using WhisperX forced alignment
Project description
vtt-synced-voice
Generate VTT subtitles with timestamps precisely snapped to voice onset using WhisperX forced alignment.
Features
- Word-level timestamp alignment via WhisperX (Whisper + wav2vec2 forced alignment)
- FCP-style peak normalization for recording-level-independent silence detection
- Bidirectional onset detection: backward scan when CTC start is inside voice, forward scan when in silence
- Guaranteed silence gap between cues (100ms minimum)
Installation
1. Install ffmpeg
macOS
brew install ffmpeg
Windows
winget install ffmpeg
Linux (Debian / Ubuntu)
sudo apt install ffmpeg
2. Install PyTorch
WhisperX runs on PyTorch. GPU (CUDA) is significantly faster than CPU for transcription (roughly 10–20x). Install the build that matches your environment.
macOS — CPU only (no CUDA support on macOS)
pip install torch torchaudio
Windows — CUDA 12.8 (recommended for RTX 30xx / 40xx and newer)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
Windows — CUDA 11.8 (for older GPUs such as GTX 10xx / 20xx)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
Windows — CPU only (no NVIDIA GPU)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
Linux — CUDA 12.8 (recommended for RTX 30xx / 40xx and newer)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
Linux — CUDA 11.8 (for older GPUs such as GTX 10xx / 20xx)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
Linux — CPU only (no NVIDIA GPU)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
To check your CUDA version:
nvidia-smi(Windows/Linux). If you don't have an NVIDIA GPU, use the CPU build. For the full list of builds, see the PyTorch installation guide.
3. Install vtt-synced-voice
pip install vtt-synced-voice
Usage
from vtt_synced_voice import transcribe
transcribe(
audio_file="sample.m4a",
output_file="output.vtt",
language="ja", # "ja" / "en" / etc.
model="large-v2", # "small" / "medium" / "large-v2"
device="cpu", # "cpu" / "cuda"
margin_before=0.066, # seconds to shift start earlier after onset detection
margin_after=0.0, # seconds to extend end
silence_threshold=0.001, # RMS threshold after peak normalization
verbose=True,
)
silence_threshold
After peak normalization, complete silence ≈ 0.0 and voiced speech ≈ 0.05–1.0.
The default 0.001 works well for clean recordings with no background noise.
Use verbose=True to inspect onset detection results and adjust if needed.
Requirements
- Python 3.10+
- ffmpeg (system)
- numpy
- whisperx
Development
Setup
macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Windows
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
Running tests
python -m pytest tests/ -v
33 tests covering vtt_io, onset, and cue_builder modules.
Manual test with a local audio file
Place an audio file in audio_input/, then run:
python test_run.py
Output is written to vtt_output/test_package.vtt.
Project structure
src/vtt_synced_voice/
├── __init__.py # exports transcribe()
├── transcriber.py # transcribe() entry point, ffmpeg conversion, WhisperX calls
├── onset.py # find_onset() — bidirectional voice onset detection
├── cue_builder.py # build_cues_from_segments() — WhisperX result → VttCue
└── vtt_io.py # VttCue dataclass, read_vtt(), write_vtt(), format_timestamp()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vtt_synced_voice-0.1.1.tar.gz.
File metadata
- Download URL: vtt_synced_voice-0.1.1.tar.gz
- Upload date:
- Size: 606.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fded1b76b9044af1b38f2bcd385ca1fb10d919a93a34976bef1d5f15ff5cdd31
|
|
| MD5 |
6be2e01cc7b79466556b961567241543
|
|
| BLAKE2b-256 |
a915956787d71f67780925054c9e13460256b97a4a1bd2afdeec0412037cedf0
|
File details
Details for the file vtt_synced_voice-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vtt_synced_voice-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42b196037ef26745f9b1cbfe212e289e5a092cf3afdc11ba6fc57a4d7bb3369b
|
|
| MD5 |
3eb29de25e53e2329e853bbf72e3f7a6
|
|
| BLAKE2b-256 |
261582aaff19e58c61a6d1b61e984f8a2503c7a354365b64fc4b11a48f57ddc4
|