Skip to main content

A Python package to create closed captions with face detection and recognition.

Project description

CaptionAlchemy

A Python package for creating intelligent closed captions with face detection and speaker recognition.

Features

  • Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
  • Speaker Diarization: Identifies different speakers in audio
  • Face Recognition: Links speakers to known faces for character identification
  • Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
  • Voice Activity Detection: Intelligently detects speech vs non-speech segments
  • GPU Acceleration: Automatic CUDA support when available

Installation

pip install captionalchemy

If you have a GPU and want to use hardware acceleration:

pip install captionalchemy[cuda]

Prerequisites

  • Python 3.10+
  • FFmpeg (for video/audio processing)
  • CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
  • Whisper.cpp capable (optional on MacOS)

If using Whisper.cpp on MacOS, follow installation instructions [here] Clone the whisper repo into your working directory.

Quick Start

  1. Set up environment variables (create .env file):

    HF_AUTH_TOKEN=your_huggingface_token_here
    
  2. Prepare known faces (optional, for speaker identification): Create known_faces.json:

    [
      {
        "name": "Speaker Name",
        "image_path": "path/to/speaker/photo.jpg"
      }
    ]
    
  3. Generate captions:

captionalchemy video.mp4 -f srt -o my_captions

or in a python script

from dotenv import load_dotenv
from captionalchemy import caption

load_dotenv()

caption.run_pipeline(
    video_url_or_path="path/to/your/video.mp4",         # this can be a video URL or local file
    character_identification=False,                      # True by default
    known_faces_json="path/to/known_faces.json",
    embed_faces_json="path/to/embed_faces.json",        # name of the output file
    caption_output_path="my_captions/output",           # will write output to output.srt (or .vtt/.smi)
    caption_format="srt"
)

Usage

Basic Usage

# Generate SRT captions from video file
captionalchemy video.mp4

# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output

# Disable face recognition
captionalchemy video.mp4 --no-face-id

Command Line Options

captionalchemy VIDEO [OPTIONS]

Arguments:
  VIDEO                Video file path or URL

Options:
  -f, --format         Caption format: srt, vtt, smi (default: srt)
  -o, --output         Output file base name (default: output_captions)
  --no-face-id         Disable face recognition
  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)
  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)
  -v, --verbose        Enable debug logging

How It Works

  1. Face Embedding: Pre-processes known faces into embeddings
  2. Audio Extraction: Extracts audio from video files
  3. Voice Activity Detection: Identifies speech segments
  4. Speaker Diarization: Separates different speakers
  5. Transcription: Converts speech to text using Whisper
  6. Face Recognition: Matches speakers to known faces (if enabled)
  7. Caption Generation: Creates timestamped captions with speaker names

Configuration

Known Faces Setup

Create a known_faces.json file with speaker information:

[
  {
    "name": "John Doe",
    "image_path": "photos/john_doe.jpg"
  },
  {
    "name": "Jane Smith",
    "image_path": "photos/jane_smith.png"
  }
]

Environment Variables

  • HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models

Output Examples

SRT Format

1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.

2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.

VTT Format

WEBVTT

00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.

00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.

Development and Contributing

Setup Development Environment

# Install in development mode
pip install -e ".[dev]"

Running Tests

pytest

Code Quality

# Linting
flake8

# Code formatting
black src/ tests/

Requirements

See requirements.txt for the complete list of dependencies. Key packages include:

  • openai-whisper: Speech transcription
  • pyannote.audio: Speaker diarization
  • opencv-python: Computer vision
  • insightface: Face recognition
  • torch: Deep learning framework

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

Troubleshooting

Common Issues

  • CUDA out of memory: Use CPU-only mode or reduce batch sizes
  • Missing models: Ensure whisper.cpp models are downloaded
  • Face recognition errors: Verify image paths in known_faces.json
  • Audio extraction fails: Check that FFmpeg is installed

Getting Help

  • Check the logs with -v flag for detailed error information
  • Ensure all dependencies are properly installed
  • Verify video file format compatibility

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

captionalchemy-1.1.1.tar.gz (36.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

captionalchemy-1.1.1-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file captionalchemy-1.1.1.tar.gz.

File metadata

  • Download URL: captionalchemy-1.1.1.tar.gz
  • Upload date:
  • Size: 36.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-1.1.1.tar.gz
Algorithm Hash digest
SHA256 703cd10f9c0080ae2831190163ff1058ddacc545af6a199c97012efdc80ccf79
MD5 07729ade7b8b487e73823ff3c9b599af
BLAKE2b-256 57ff72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-1.1.1.tar.gz:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file captionalchemy-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: captionalchemy-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40e62dac63870800170e1e16f8cd71a6508362f207fd11c95922fef5a5970612
MD5 cb4f62577c44746e9132b4ff3c7a3a1b
BLAKE2b-256 e5251dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-1.1.1-py3-none-any.whl:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page