Transcribe video and audio files to text using OpenAI Whisper with optional speaker diarization
Project description
vtt-transcribe
Takes a video file, extracts or splits the audio, and transcribes the audio to text
using OpenAI's Whisper model (via the openai Python client).
This repository provides a small CLI tool (vtt) and a set of helper
functions for handling audio extraction, chunking large audio files, and
formatting verbose JSON transcripts into readable timestamped output.
Features
- Extract audio from video files (writes
.mp3by default) or transcribe audio directly (.mp3, .wav, .ogg, .m4a) - Prefer minute-aligned chunk durations for large audio files exceeding 25MB API limit
- Transcribe audio via OpenAI's Whisper API with
verbose_jsonresponse format - Speaker diarization using pyannote.audio to identify and label speakers in transcripts
- Format transcripts into human-friendly lines:
[HH:MM:SS - HH:MM:SS] textwith optional speaker labels - Shift chunk-local timestamps into absolute timeline when chunking
- Keep or delete intermediate audio/chunk files based on flags
- Interactive speaker review to rename/merge speakers after diarization
Dependencies
- Python 3.13+
- ffmpeg (required for video/audio processing via moviepy)
- moviepy (audio/video helpers)
- openai (Whisper API client)
- pyannote.audio (speaker diarization, optional - requires [diarization] extra)
- torch (required for pyannote.audio)
- Dev / test: pytest, mypy, ruff, pre-commit, coverage, python-dotenv
Prerequisites
- ffmpeg must be installed on your system for video/audio processing
- Recommended approach: Use the provided
.devcontainerwhich includes:- Pre-configured ffmpeg installation
- GPU support for diarization (if host has NVIDIA GPU + drivers)
- All Python dependencies
- VS Code extensions and settings
- Manual setup: If not using devcontainer, ensure ffmpeg is installed:
- Ubuntu/Debian:
sudo apt-get install ffmpeg - macOS:
brew install ffmpeg - Windows: Download from https://ffmpeg.org/download.html
- Ubuntu/Debian:
Speaker Diarization
- The speaker diarization feature (
--diarize) identifies and labels different speakers in audio - Requirements:
- Hugging Face token (set
HF_TOKENenvironment variable or use--hf-tokenflag) - User must accept pyannote model access at https://huggingface.co/pyannote/speaker-diarization-3.1
- Minimum audio duration: ~10 seconds (shorter files may fail)
- Hugging Face token (set
- GPU Support (Optional):
- Can leverage CUDA GPUs for faster processing (10-100x speedup)
- By default, uses
--device autowhich automatically detects and uses CUDA if available - To explicitly control device selection, use
--device cudaor--device cpu - .devcontainer handles prerequisites for GPU support
- Prerequisites for GPU support:
- NVIDIA GPU with CUDA support
- NVIDIA drivers installed on the host system
nvidia-container-toolkitinstalled on the host (for Docker/devcontainer)
- If GPU is not available or fails, automatically falls back to CPU
Quick Start
Option 1: Using devcontainer (Recommended)
- Open project in VS Code
- Install "Dev Containers" extension
- Click "Reopen in Container" when prompted (or use Command Palette: "Dev Containers: Reopen in Container")
- The devcontainer includes ffmpeg, GPU support, and all dependencies pre-configured
Option 2: Manual setup
- Ensure ffmpeg is installed on your system (see Prerequisites above)
Installation
From PyPI (Recommended)
# Basic installation (transcription only)
pip install vtt-transcribe
# OR: With diarization support
pip install vtt-transcribe[diarization]
# Using uv (faster)
uv pip install vtt-transcribe
uv pip install vtt-transcribe[diarization]
Note: Installing with
[diarization]extras adds large dependencies such as PyTorch andpyannote.audio, which significantly increases the download and install size of your environment. The actual diarization model weights are typically downloaded at runtime (e.g., via the Hugging Face cache) on first use, so overall disk usage for diarization (dependencies + cached models) can reach several GB. Only install these extras if you need speaker identification features.
From Source
-
Ensure ffmpeg is installed on your system (see Prerequisites above)
-
Run the installer which installs
uvand creates the project's virtual environment:
# Basic install (transcription only, no diarization)
make install
# OR: Install with diarization support (includes torch + pyannote.audio)
make install-diarization
Upgrading from 0.2.0
Important: Version 0.3.0 introduces optional dependencies for speaker diarization. If you are upgrading from 0.2.0 and want to use diarization features, you need to explicitly install the [diarization] extra. See the CHANGELOG for detailed upgrade instructions.
Setup Environment Variables
You can set environment variables in your shell or create a .env file in your project directory:
Option 1: Shell environment
export OPENAI_API_KEY="your-openai-key"
export HF_TOKEN="your-huggingface-token" # Only needed for --diarize
Option 2: .env file (automatically loaded)
# Create a .env file in your project directory
echo 'OPENAI_API_KEY="your-openai-key"' > .env
echo 'HF_TOKEN="your-huggingface-token"' >> .env
# For publishing to PyPI (developers only)
echo 'TWINE_USERNAME=__token__' >> .env
echo 'TESTPYPI_API_TOKEN=your-testpypi-token' >> .env
echo 'PYPI_API_TOKEN=your-pypi-token' >> .env
The tool will automatically load variables from .env if the file exists.
Publishing Environment Variables (Developers Only):
TWINE_USERNAME: Should always be__token__for PyPI token authenticationTESTPYPI_API_TOKEN: Your TestPyPI API tokenPYPI_API_TOKEN: Your PyPI API token- These are only needed if you're building and publishing packages using
make build,make publish-test, ormake publish
Usage
Command Line
# Basic transcription
vtt path/to/input.mp4
# With speaker diarization
vtt path/to/input.mp4 --diarize
# Direct audio transcription
vtt path/to/audio.mp3 --diarize
# Using uv run (if installed from source)
uv run vtt path/to/input.mp4
CLI options
Input/Output:
input_file: positional path to the input video or audio file (.mp4, .mp3, .wav, .ogg, .m4a)-k, --api-key: OpenAI API key (or setOPENAI_API_KEYenv var)-o, --output-audio: path for extracted audio file (defaults to input name with.mp3; not allowed if input is already audio)-s, --save-transcript: path to save the transcript (will ensure.txtextension)
Processing Options:
-f, --force: re-extract audio even if it already exists--delete-audio: delete audio files after transcription (default: keep them)--scan-chunks: when input is a chunk file (e.g.,audio_chunk0.mp3), detect and process all sibling chunks in order
Diarization Options:
--diarize: enable speaker diarization (requiresHF_TOKENand model access)--hf-token: Hugging Face token for pyannote models (or setHF_TOKENenv var)--device: device for diarization (auto,cuda/gpu, orcpu; default:auto)--diarize-only: run diarization on existing audio without transcription--apply-diarization PATH: apply diarization to an existing transcript file--no-review-speakers: skip interactive speaker review (default: review is enabled)
Makefile targets
make install— installsuvand basic dependencies (transcription only, no diarization)make install-diarization— installsuvand all dependencies including diarization supportmake test— runs the test suite (pytest)make test-integration— runs only integration testsmake ruff-check— runsruff check .make ruff-fix— runsruff format .(autoformat where supported)make mypy— runsmypy .for static typing checksmake lint— runs bothruffandmypy(alias forruff-check mypy)make format— runs the automatic ruff-format step (ruff format .)make clean— remove compiled python artifactsmake build— build distribution packagesmake publish-test— publish to TestPyPI (requiresTESTPYPI_API_TOKENin environment)make publish— publish to PyPI (requiresPYPI_API_TOKENin environment)
Notes on linting and typing
ruffis configured inruff.toml. The ruleCOM812is disabled to avoid conflicts with formatters. A per-file ignore exists for tests to allow certain private-member accesses used in unit tests.- Some tests use light mypy
# type: ignore[...]annotations to accommodate test doubles and dynamically injected modules.
Testing
- Run the full test suite with
make test. The project includes comprehensive unit tests for audio extraction, chunking, timestamp formatting, and the CLI wiring. - Note: The project has only been tested on Linux (and WSL2)
Continuous Integration
- The repository includes a GitHub Actions workflow (
.github/workflows/ci.yml) that runsmake installfollowed bymake lintandmake teston pushes and pull requests tomain. This mirrors the recommended localmake installsetup.
Acknowledgements
- This project was developed with test-driven iterations and linting guidance.
- Parts of the implementation and assistance during development were produced with help from GitHub Copilot.
Files of interest
- CHANGELOG.md — version history and upgrade instructions
- main.py — CLI entrypoint and
VideoTranscriberimplementation - test_main.py — main test suite (integration + unit tests)
- test_audio_management.py — audio/chunk management tests
- Makefile — convenience commands for dev tooling
- ruff.toml — ruff configuration
- .pre-commit-config.yaml — pre-commit hooks for formatting/linting
Contributing
- Please run
make formatandmake lintbefore submitting a PR. - Run
make testto ensure all tests pass locally. - See CONTRIBUTING.md for detailed development setup and workflow.
Building and Publishing (For Maintainers)
The project uses Hatch as the build system. Build artifacts can be created and tested locally:
# Install build dependencies
make install-build
# Build distribution packages (creates dist/*.whl and dist/*.tar.gz)
make build
# Test publishing to TestPyPI
make publish-test
# Production publish to PyPI (via GitHub Actions on release)
# Tag a release: git tag v0.3.0b1 && git push origin v0.3.0b1
# Create GitHub release (triggers automated publish workflow)
For complete build and publish workflow documentation, see CONTRIBUTING.md.
License
- See the
LICENSEfile in the repository root.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vtt_transcribe-0.3.0b3.tar.gz.
File metadata
- Download URL: vtt_transcribe-0.3.0b3.tar.gz
- Upload date:
- Size: 524.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
651e39a9c8c0a50da465896f680b5954ede8ff49d2fd7907eb8b589120a19284
|
|
| MD5 |
3dada49665d4cb1091b7bd6895629119
|
|
| BLAKE2b-256 |
bd1ff166b4caf2ff8632060d0ecc87a1e9139ac9f09d016d8302b25ca12f8fac
|
Provenance
The following attestation bundles were made for vtt_transcribe-0.3.0b3.tar.gz:
Publisher:
publish.yml on JLCodeSource/vtt-transcribe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vtt_transcribe-0.3.0b3.tar.gz -
Subject digest:
651e39a9c8c0a50da465896f680b5954ede8ff49d2fd7907eb8b589120a19284 - Sigstore transparency entry: 906396962
- Sigstore integration time:
-
Permalink:
JLCodeSource/vtt-transcribe@bc52552065025cda7cebdc9cf99d4a77ffeef018 -
Branch / Tag:
refs/tags/v0.3.0b3 - Owner: https://github.com/JLCodeSource
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc52552065025cda7cebdc9cf99d4a77ffeef018 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vtt_transcribe-0.3.0b3-py3-none-any.whl.
File metadata
- Download URL: vtt_transcribe-0.3.0b3-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaa0a7e14b90a1b6e7040313330467e388c34c6ab216579c4cbbda22b279cf65
|
|
| MD5 |
d0667862990011b0de201f18edef46ca
|
|
| BLAKE2b-256 |
2be33d5ca4196f276cfcfb3a161ce77fe192a42fd1d81cf11d229b501ef7c9b5
|
Provenance
The following attestation bundles were made for vtt_transcribe-0.3.0b3-py3-none-any.whl:
Publisher:
publish.yml on JLCodeSource/vtt-transcribe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vtt_transcribe-0.3.0b3-py3-none-any.whl -
Subject digest:
eaa0a7e14b90a1b6e7040313330467e388c34c6ab216579c4cbbda22b279cf65 - Sigstore transparency entry: 906397017
- Sigstore integration time:
-
Permalink:
JLCodeSource/vtt-transcribe@bc52552065025cda7cebdc9cf99d4a77ffeef018 -
Branch / Tag:
refs/tags/v0.3.0b3 - Owner: https://github.com/JLCodeSource
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc52552065025cda7cebdc9cf99d4a77ffeef018 -
Trigger Event:
release
-
Statement type: