Skip to main content

Generate VTT subtitles with timestamps snapped to voice onset using WhisperX forced alignment

Project description

vtt-synced-voice

Generate VTT subtitles with timestamps precisely snapped to voice onset using WhisperX forced alignment.

日本語版 README はこちら

Features

  • Word-level timestamp alignment via WhisperX (Whisper + wav2vec2 forced alignment)
  • FCP-style peak normalization for recording-level-independent silence detection
  • Bidirectional onset detection: backward scan when CTC start is inside voice, forward scan when in silence
  • Guaranteed silence gap between cues (100ms minimum)

Installation

1. Install ffmpeg

macOS

brew install ffmpeg

Windows

winget install ffmpeg

Linux (Debian / Ubuntu)

sudo apt install ffmpeg

2. Install PyTorch

WhisperX runs on PyTorch. GPU (CUDA) is significantly faster than CPU for transcription (roughly 10–20x). Install the build that matches your environment.

macOS — CPU only (no CUDA support on macOS)

pip install torch torchaudio

Windows — CUDA 12.8 (recommended for RTX 30xx / 40xx and newer)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

Windows — CUDA 11.8 (for older GPUs such as GTX 10xx / 20xx)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Windows — CPU only (no NVIDIA GPU)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

Linux — CUDA 12.8 (recommended for RTX 30xx / 40xx and newer)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

Linux — CUDA 11.8 (for older GPUs such as GTX 10xx / 20xx)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Linux — CPU only (no NVIDIA GPU)

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

To check your CUDA version: nvidia-smi (Windows/Linux). If you don't have an NVIDIA GPU, use the CPU build. For the full list of builds, see the PyTorch installation guide.

3. Install vtt-synced-voice

pip install vtt-synced-voice

Usage

from vtt_synced_voice import transcribe

transcribe(
    audio_file="sample.m4a",
    output_file="output.vtt",
    language="ja",           # "ja" / "en" / etc.
    model="large-v2",        # "small" / "medium" / "large-v2"
    device="cpu",            # "cpu" / "cuda"
    margin_before=0.066,     # seconds to shift start earlier after onset detection
    margin_after=0.0,        # seconds to extend end
    silence_threshold=0.001, # RMS threshold after peak normalization
    verbose=True,
)

silence_threshold

After peak normalization, complete silence ≈ 0.0 and voiced speech ≈ 0.05–1.0. The default 0.001 works well for clean recordings with no background noise. Use verbose=True to inspect onset detection results and adjust if needed.

Requirements

  • Python 3.10+
  • ffmpeg (system)
  • numpy
  • whisperx

Development

Setup

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Windows

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Running tests

python -m pytest tests/ -v

33 tests covering vtt_io, onset, and cue_builder modules.

Manual test with a local audio file

Place an audio file in audio_input/, then run:

python test_run.py

Output is written to vtt_output/test_package.vtt.

Project structure

src/vtt_synced_voice/
├── __init__.py       # exports transcribe()
├── transcriber.py    # transcribe() entry point, ffmpeg conversion, WhisperX calls
├── onset.py          # find_onset() — bidirectional voice onset detection
├── cue_builder.py    # build_cues_from_segments() — WhisperX result → VttCue
└── vtt_io.py         # VttCue dataclass, read_vtt(), write_vtt(), format_timestamp()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vtt_synced_voice-0.1.1.tar.gz (606.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vtt_synced_voice-0.1.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file vtt_synced_voice-0.1.1.tar.gz.

File metadata

  • Download URL: vtt_synced_voice-0.1.1.tar.gz
  • Upload date:
  • Size: 606.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vtt_synced_voice-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fded1b76b9044af1b38f2bcd385ca1fb10d919a93a34976bef1d5f15ff5cdd31
MD5 6be2e01cc7b79466556b961567241543
BLAKE2b-256 a915956787d71f67780925054c9e13460256b97a4a1bd2afdeec0412037cedf0

See more details on using hashes here.

File details

Details for the file vtt_synced_voice-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for vtt_synced_voice-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 42b196037ef26745f9b1cbfe212e289e5a092cf3afdc11ba6fc57a4d7bb3369b
MD5 3eb29de25e53e2329e853bbf72e3f7a6
BLAKE2b-256 261582aaff19e58c61a6d1b61e984f8a2503c7a354365b64fc4b11a48f57ddc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page