Skip to main content

A Python package to create closed captions with face detection and recognition.

Project description

CaptionAlchemy

A Python package for creating intelligent closed captions with face detection and speaker recognition.

Features

  • Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
  • Speaker Diarization: Identifies different speakers in audio
  • Face Recognition: Links speakers to known faces for character identification
  • Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
  • Voice Activity Detection: Intelligently detects speech vs non-speech segments
  • GPU Acceleration: Automatic CUDA support when available

Installation

pip install -e .

Prerequisites

  • Python 3.10+
  • FFmpeg (for video/audio processing)
  • CUDA-capable GPU (optional, for acceleration)
  • Whisper.cpp capable (optional on MacOS)

Quick Start

  1. Set up environment variables (create .env file):

    HF_AUTH_TOKEN=your_huggingface_token_here
    
  2. Prepare known faces (optional, for speaker identification): Create known_faces.json:

    [
      {
        "name": "Speaker Name",
        "image_path": "path/to/speaker/photo.jpg"
      }
    ]
    
  3. Generate captions:

    captionalchemy video.mp4 -f srt -o my_captions
    

Usage

Basic Usage

# Generate SRT captions from video file
captionalchemy video.mp4

# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output

# Disable face recognition
captionalchemy video.mp4 --no-face-id

Command Line Options

captionalchemy VIDEO [OPTIONS]

Arguments:
  VIDEO                Video file path or URL

Options:
  -f, --format         Caption format: srt, vtt, smi (default: srt)
  -o, --output         Output file base name (default: output_captions)
  --no-face-id         Disable face recognition
  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)
  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)
  -v, --verbose        Enable debug logging

How It Works

  1. Face Embedding: Pre-processes known faces into embeddings
  2. Audio Extraction: Extracts audio from video files
  3. Voice Activity Detection: Identifies speech segments
  4. Speaker Diarization: Separates different speakers
  5. Transcription: Converts speech to text using Whisper
  6. Face Recognition: Matches speakers to known faces (if enabled)
  7. Caption Generation: Creates timestamped captions with speaker names

Configuration

Known Faces Setup

Create a known_faces.json file with speaker information:

[
  {
    "name": "John Doe",
    "image_path": "photos/john_doe.jpg"
  },
  {
    "name": "Jane Smith",
    "image_path": "photos/jane_smith.png"
  }
]

Environment Variables

  • HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models

Output Examples

SRT Format

1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.

2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.

VTT Format

WEBVTT

00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.

00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.

Development

Setup Development Environment

# Install in development mode
pip install -e .

# Install development dependencies
pip install -r requirements-dev.txt

Running Tests

pytest

Code Quality

# Linting
flake8

# Type checking
mypy src/

Requirements

See requirements.txt for the complete list of dependencies. Key packages include:

  • openai-whisper: Speech transcription
  • pyannote.audio: Speaker diarization
  • opencv-python: Computer vision
  • insightface: Face recognition
  • torch: Deep learning framework

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

Troubleshooting

Common Issues

  • CUDA out of memory: Use CPU-only mode or reduce batch sizes
  • Missing models: Ensure whisper.cpp models are downloaded
  • Face recognition errors: Verify image paths in known_faces.json
  • Audio extraction fails: Check that FFmpeg is installed

Getting Help

  • Check the logs with -v flag for detailed error information
  • Ensure all dependencies are properly installed
  • Verify video file format compatibility

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

captionalchemy-0.1.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

captionalchemy-0.1.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file captionalchemy-0.1.0.tar.gz.

File metadata

  • Download URL: captionalchemy-0.1.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9d5f66538c652017b021a80a11cba7b0aad345b9e19cdcf6feaaae4b9a600897
MD5 da354c5e607c879279dcb6dd6e09daa8
BLAKE2b-256 9a6b7974cae36be20963c7bd86bd0d6c2028a12fe9bd17481617d0d23450cbb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-0.1.0.tar.gz:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file captionalchemy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: captionalchemy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd9f80b0dc0b826823903ec4bee6dd788b4fef4ea7356965df9f9e16798e503b
MD5 e33191e80b58c95e97b6af38eff1c532
BLAKE2b-256 1bf8ce053ff6ab103ec2ec85899a930a802b8d2942020a3064613df755b78f17

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-0.1.0-py3-none-any.whl:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page