A Python package to create closed captions with face detection and recognition.
Project description
CaptionAlchemy
A Python package for creating intelligent closed captions with face detection and speaker recognition.
Features
- Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
- Speaker Diarization: Identifies different speakers in audio
- Face Recognition: Links speakers to known faces for character identification
- Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
- Voice Activity Detection: Intelligently detects speech vs non-speech segments
- GPU Acceleration: Automatic CUDA support when available
Installation
pip install captionalchemy
If you have a GPU and want to use hardware acceleration:
pip install captionalchemy[cuda]
Prerequisites
- Python 3.10+
- FFmpeg (for video/audio processing)
- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
- Whisper.cpp capable (optional on MacOS)
If using Whisper.cpp on MacOS, follow installation instructions [here] Clone the whisper repo into your working directory.
Quick Start
-
Set up environment variables (create
.envfile):HF_AUTH_TOKEN=your_huggingface_token_here -
Prepare known faces (optional, for speaker identification): Create
known_faces.json:[ { "name": "Speaker Name", "image_path": "path/to/speaker/photo.jpg" } ]
-
Generate captions:
captionalchemy video.mp4 -f srt -o my_captions
or in a python script
from dotenv import load_dotenv
from captionalchemy import caption
load_dotenv()
caption.run_pipeline(
video_url_or_path="path/to/your/video.mp4", # this can be a video URL or local file
character_identification=False, # True by default
known_faces_json="path/to/known_faces.json",
embed_faces_json="path/to/embed_faces.json", # name of the output file
caption_output_path="my_captions/output", # will write output to output.srt (or .vtt/.smi)
caption_format="srt"
)
Usage
Basic Usage
# Generate SRT captions from video file
captionalchemy video.mp4
# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output
# Disable face recognition
captionalchemy video.mp4 --no-face-id
Command Line Options
captionalchemy VIDEO [OPTIONS]
Arguments:
VIDEO Video file path or URL
Options:
-f, --format Caption format: srt, vtt, smi (default: srt)
-o, --output Output file base name (default: output_captions)
--no-face-id Disable face recognition
--known-faces-json Path to known faces JSON (default: example/known_faces.json)
--embed-faces-json Path to face embeddings JSON (default: example/embed_faces.json)
-v, --verbose Enable debug logging
How It Works
- Face Embedding: Pre-processes known faces into embeddings
- Audio Extraction: Extracts audio from video files
- Voice Activity Detection: Identifies speech segments
- Speaker Diarization: Separates different speakers
- Transcription: Converts speech to text using Whisper
- Face Recognition: Matches speakers to known faces (if enabled)
- Caption Generation: Creates timestamped captions with speaker names
Configuration
Known Faces Setup
Create a known_faces.json file with speaker information:
[
{
"name": "John Doe",
"image_path": "photos/john_doe.jpg"
},
{
"name": "Jane Smith",
"image_path": "photos/jane_smith.png"
}
]
Environment Variables
HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models
Output Examples
SRT Format
1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.
2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.
VTT Format
WEBVTT
00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.
00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.
Development and Contributing
Setup Development Environment
# Install in development mode
pip install -e ".[dev]"
Running Tests
pytest
Code Quality
# Linting
flake8
# Code formatting
black src/ tests/
Requirements
See requirements.txt for the complete list of dependencies. Key packages include:
openai-whisper: Speech transcriptionpyannote.audio: Speaker diarizationopencv-python: Computer visioninsightface: Face recognitiontorch: Deep learning framework
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
Troubleshooting
Common Issues
- CUDA out of memory: Use CPU-only mode or reduce batch sizes
- Missing models: Ensure whisper.cpp models are downloaded
- Face recognition errors: Verify image paths in known_faces.json
- Audio extraction fails: Check that FFmpeg is installed
Getting Help
- Check the logs with
-vflag for detailed error information - Ensure all dependencies are properly installed
- Verify video file format compatibility
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file captionalchemy-0.0.0.tar.gz.
File metadata
- Download URL: captionalchemy-0.0.0.tar.gz
- Upload date:
- Size: 41.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85daa4f7a0626696dbbd2a40b417dee01a146b2ceb2f317fa8fe7f3a84f224fd
|
|
| MD5 |
7db0abed8c17f9412f8de64c62cf18e9
|
|
| BLAKE2b-256 |
13373131d4c35aae2c05e920361f7b85779b9179b45c963c212a6c5d3c120fc4
|
Provenance
The following attestation bundles were made for captionalchemy-0.0.0.tar.gz:
Publisher:
ci.yml on benbatman/captionalchemy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
captionalchemy-0.0.0.tar.gz -
Subject digest:
85daa4f7a0626696dbbd2a40b417dee01a146b2ceb2f317fa8fe7f3a84f224fd - Sigstore transparency entry: 268705090
- Sigstore integration time:
-
Permalink:
benbatman/captionalchemy@e20566876499839682e9702b0fd45656a14ea110 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/benbatman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@e20566876499839682e9702b0fd45656a14ea110 -
Trigger Event:
push
-
Statement type:
File details
Details for the file captionalchemy-0.0.0-py3-none-any.whl.
File metadata
- Download URL: captionalchemy-0.0.0-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f3c2cacb83f4e8b5d2c669001b87303af264ba67ce2e85756c46b9ff9709cb3
|
|
| MD5 |
34d8c230131a35b51974bdc30a716d71
|
|
| BLAKE2b-256 |
093a4e582dd5b4a01acd751c310a6c33fb4ae2b59f006818acee675e2feb1e67
|
Provenance
The following attestation bundles were made for captionalchemy-0.0.0-py3-none-any.whl:
Publisher:
ci.yml on benbatman/captionalchemy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
captionalchemy-0.0.0-py3-none-any.whl -
Subject digest:
0f3c2cacb83f4e8b5d2c669001b87303af264ba67ce2e85756c46b9ff9709cb3 - Sigstore transparency entry: 268705092
- Sigstore integration time:
-
Permalink:
benbatman/captionalchemy@e20566876499839682e9702b0fd45656a14ea110 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/benbatman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@e20566876499839682e9702b0fd45656a14ea110 -
Trigger Event:
push
-
Statement type: