A Python package to create closed captions with face detection and recognition.

These details have not been verified by PyPI

Project description

CaptionAlchemy

A Python package for creating intelligent closed captions with face detection and speaker recognition.

Features

Audio Transcription: Powered by OpenAI Whisper for high-quality speech-to-text
Speaker Diarization: Identifies different speakers in audio
Face Recognition: Links speakers to known faces for character identification
Multiple Output Formats: Supports SRT, VTT, and SAMI caption formats
Voice Activity Detection: Intelligently detects speech vs non-speech segments
GPU Acceleration: Automatic CUDA support when available

Installation

pip install captionalchemy

If you have a GPU and want to use hardware acceleration:

pip install captionalchemy[cuda]

Prerequisites

Python 3.10+
FFmpeg (for video/audio processing)
CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
Whisper.cpp capable (optional on MacOS)

If using Whisper.cpp on MacOS, follow installation instructions [here] Clone the whisper repo into your working directory.

Quick Start

Set up environment variables (create .env file):
```
HF_AUTH_TOKEN=your_huggingface_token_here
```

Prepare known faces (optional, for speaker identification): Create known_faces.json:

[
  {
    "name": "Speaker Name",
    "image_path": "path/to/speaker/photo.jpg"
  }
]

Generate captions:

captionalchemy video.mp4 -f srt -o my_captions

or in a python script

from dotenv import load_dotenv
from captionalchemy import caption

load_dotenv()

caption.run_pipeline(
    video_url_or_path="path/to/your/video.mp4",         # this can be a video URL or local file
    character_identification=False,                      # True by default
    known_faces_json="path/to/known_faces.json",
    embed_faces_json="path/to/embed_faces.json",        # name of the output file
    caption_output_path="my_captions/output",           # will write output to output.srt (or .vtt/.smi)
    caption_format="srt"
)

Usage

Basic Usage

# Generate SRT captions from video file
captionalchemy video.mp4

# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output

# Disable face recognition
captionalchemy video.mp4 --no-face-id

Command Line Options

captionalchemy VIDEO [OPTIONS]

Arguments:
  VIDEO                Video file path or URL

Options:
  -f, --format         Caption format: srt, vtt, smi (default: srt)
  -o, --output         Output file base name (default: output_captions)
  --no-face-id         Disable face recognition
  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)
  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)
  -v, --verbose        Enable debug logging

How It Works

Face Embedding: Pre-processes known faces into embeddings
Audio Extraction: Extracts audio from video files
Voice Activity Detection: Identifies speech segments
Speaker Diarization: Separates different speakers
Transcription: Converts speech to text using Whisper
Face Recognition: Matches speakers to known faces (if enabled)
Caption Generation: Creates timestamped captions with speaker names

Configuration

Known Faces Setup

Create a known_faces.json file with speaker information:

[
  {
    "name": "John Doe",
    "image_path": "photos/john_doe.jpg"
  },
  {
    "name": "Jane Smith",
    "image_path": "photos/jane_smith.png"
  }
]

Environment Variables

HF_AUTH_TOKEN: Hugging Face token for accessing pyannote models

Output Examples

SRT Format

1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.

2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.

VTT Format

WEBVTT

00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.

00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.

Development and Contributing

Setup Development Environment

# Install in development mode
pip install -e ".[dev]"

Running Tests

pytest

Code Quality

# Linting
flake8

# Code formatting
black src/ tests/

Requirements

See requirements.txt for the complete list of dependencies. Key packages include:

openai-whisper: Speech transcription
pyannote.audio: Speaker diarization
opencv-python: Computer vision
insightface: Face recognition
torch: Deep learning framework

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

Troubleshooting

Common Issues

CUDA out of memory: Use CPU-only mode or reduce batch sizes
Missing models: Ensure whisper.cpp models are downloaded
Face recognition errors: Verify image paths in known_faces.json
Audio extraction fails: Check that FFmpeg is installed

Getting Help

Check the logs with -v flag for detailed error information
Ensure all dependencies are properly installed
Verify video file format compatibility

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.1

Jul 9, 2025

0.1.0

Jul 3, 2025

0.0.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

captionalchemy-1.1.1.tar.gz (36.2 MB view details)

Uploaded Jul 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

captionalchemy-1.1.1-py3-none-any.whl (1.2 MB view details)

Uploaded Jul 9, 2025 Python 3

File details

Details for the file captionalchemy-1.1.1.tar.gz.

File metadata

Download URL: captionalchemy-1.1.1.tar.gz
Upload date: Jul 9, 2025
Size: 36.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`703cd10f9c0080ae2831190163ff1058ddacc545af6a199c97012efdc80ccf79`
MD5	`07729ade7b8b487e73823ff3c9b599af`
BLAKE2b-256	`57ff72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da`

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-1.1.1.tar.gz:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: captionalchemy-1.1.1.tar.gz
- Subject digest: 703cd10f9c0080ae2831190163ff1058ddacc545af6a199c97012efdc80ccf79
- Sigstore transparency entry: 268747323
- Sigstore integration time: Jul 9, 2025
Source repository:
- Permalink: benbatman/captionalchemy@4cf95c28021225d0670c1450d780e1b7cf3b3b44
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/benbatman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@4cf95c28021225d0670c1450d780e1b7cf3b3b44
- Trigger Event: push

File details

Details for the file captionalchemy-1.1.1-py3-none-any.whl.

File metadata

Download URL: captionalchemy-1.1.1-py3-none-any.whl
Upload date: Jul 9, 2025
Size: 1.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for captionalchemy-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40e62dac63870800170e1e16f8cd71a6508362f207fd11c95922fef5a5970612`
MD5	`cb4f62577c44746e9132b4ff3c7a3a1b`
BLAKE2b-256	`e5251dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20`

See more details on using hashes here.

Provenance

The following attestation bundles were made for captionalchemy-1.1.1-py3-none-any.whl:

Publisher: ci.yml on benbatman/captionalchemy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: captionalchemy-1.1.1-py3-none-any.whl
- Subject digest: 40e62dac63870800170e1e16f8cd71a6508362f207fd11c95922fef5a5970612
- Sigstore transparency entry: 268747326
- Sigstore integration time: Jul 9, 2025
Source repository:
- Permalink: benbatman/captionalchemy@4cf95c28021225d0670c1450d780e1b7cf3b3b44
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/benbatman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@4cf95c28021225d0670c1450d780e1b7cf3b3b44
- Trigger Event: push

captionalchemy 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

CaptionAlchemy

Features

Installation

Prerequisites

Quick Start

Usage

Basic Usage

Command Line Options

How It Works

Configuration

Known Faces Setup

Environment Variables

Output Examples

SRT Format

VTT Format

Development and Contributing

Setup Development Environment

Running Tests

Code Quality

Requirements

License

Contributing

Troubleshooting

Common Issues

Getting Help

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance