A Python package to create closed captions with face detection and recognition.
Project description
CaptionAlchemy
A Python package for creating intelligent closed captions with face detection and speaker recognition.
Features
- Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
- Speaker Diarization: Identifies different speakers in audio
- Face Recognition: Links speakers to known faces for character identification
- Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
- Voice Activity Detection: Intelligently detects speech vs non-speech segments
- GPU Acceleration: Automatic CUDA support when available
Installation
pip install -e .
Prerequisites
- Python 3.10+
- FFmpeg (for video/audio processing)
- CUDA-capable GPU (optional, for acceleration)
- Whisper.cpp capable (optional on MacOS)
Quick Start
-
Set up environment variables (create
.envfile):HF_AUTH_TOKEN=your_huggingface_token_here -
Prepare known faces (optional, for speaker identification): Create
known_faces.json:[ { "name": "Speaker Name", "image_path": "path/to/speaker/photo.jpg" } ]
-
Generate captions:
captionalchemy video.mp4 -f srt -o my_captions
Usage
Basic Usage
# Generate SRT captions from video file
captionalchemy video.mp4
# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output
# Disable face recognition
captionalchemy video.mp4 --no-face-id
Command Line Options
captionalchemy VIDEO [OPTIONS]
Arguments:
VIDEO Video file path or URL
Options:
-f, --format Caption format: srt, vtt, smi (default: srt)
-o, --output Output file base name (default: output_captions)
--no-face-id Disable face recognition
--known-faces-json Path to known faces JSON (default: example/known_faces.json)
--embed-faces-json Path to face embeddings JSON (default: example/embed_faces.json)
-v, --verbose Enable debug logging
How It Works
- Face Embedding: Pre-processes known faces into embeddings
- Audio Extraction: Extracts audio from video files
- Voice Activity Detection: Identifies speech segments
- Speaker Diarization: Separates different speakers
- Transcription: Converts speech to text using Whisper
- Face Recognition: Matches speakers to known faces (if enabled)
- Caption Generation: Creates timestamped captions with speaker names
Configuration
Known Faces Setup
Create a known_faces.json file with speaker information:
[
{
"name": "John Doe",
"image_path": "photos/john_doe.jpg"
},
{
"name": "Jane Smith",
"image_path": "photos/jane_smith.png"
}
]
Environment Variables
HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models
Output Examples
SRT Format
1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.
2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.
VTT Format
WEBVTT
00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.
00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.
Development
Setup Development Environment
# Install in development mode
pip install -e .
# Install development dependencies
pip install -r requirements-dev.txt
Running Tests
pytest
Code Quality
# Linting
flake8
# Type checking
mypy src/
Requirements
See requirements.txt for the complete list of dependencies. Key packages include:
openai-whisper: Speech transcriptionpyannote.audio: Speaker diarizationopencv-python: Computer visioninsightface: Face recognitiontorch: Deep learning framework
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
Troubleshooting
Common Issues
- CUDA out of memory: Use CPU-only mode or reduce batch sizes
- Missing models: Ensure whisper.cpp models are downloaded
- Face recognition errors: Verify image paths in known_faces.json
- Audio extraction fails: Check that FFmpeg is installed
Getting Help
- Check the logs with
-vflag for detailed error information - Ensure all dependencies are properly installed
- Verify video file format compatibility
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file captionalchemy-0.1.0.tar.gz.
File metadata
- Download URL: captionalchemy-0.1.0.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d5f66538c652017b021a80a11cba7b0aad345b9e19cdcf6feaaae4b9a600897
|
|
| MD5 |
da354c5e607c879279dcb6dd6e09daa8
|
|
| BLAKE2b-256 |
9a6b7974cae36be20963c7bd86bd0d6c2028a12fe9bd17481617d0d23450cbb8
|
Provenance
The following attestation bundles were made for captionalchemy-0.1.0.tar.gz:
Publisher:
ci.yml on benbatman/captionalchemy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
captionalchemy-0.1.0.tar.gz -
Subject digest:
9d5f66538c652017b021a80a11cba7b0aad345b9e19cdcf6feaaae4b9a600897 - Sigstore transparency entry: 261678347
- Sigstore integration time:
-
Permalink:
benbatman/captionalchemy@ef9c8c0a614d9b77863d91a8b2a3d81f8cafab97 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/benbatman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@ef9c8c0a614d9b77863d91a8b2a3d81f8cafab97 -
Trigger Event:
push
-
Statement type:
File details
Details for the file captionalchemy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: captionalchemy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd9f80b0dc0b826823903ec4bee6dd788b4fef4ea7356965df9f9e16798e503b
|
|
| MD5 |
e33191e80b58c95e97b6af38eff1c532
|
|
| BLAKE2b-256 |
1bf8ce053ff6ab103ec2ec85899a930a802b8d2942020a3064613df755b78f17
|
Provenance
The following attestation bundles were made for captionalchemy-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on benbatman/captionalchemy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
captionalchemy-0.1.0-py3-none-any.whl -
Subject digest:
dd9f80b0dc0b826823903ec4bee6dd788b4fef4ea7356965df9f9e16798e503b - Sigstore transparency entry: 261678353
- Sigstore integration time:
-
Permalink:
benbatman/captionalchemy@ef9c8c0a614d9b77863d91a8b2a3d81f8cafab97 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/benbatman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@ef9c8c0a614d9b77863d91a8b2a3d81f8cafab97 -
Trigger Event:
push
-
Statement type: