Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video
Project description
vidclaude — Multimodal Video Understanding
A Python CLI tool that extracts structured evidence from videos (frames, audio transcript, OCR, temporal timeline) for analysis by Claude.
Quick Start
Option 1: npm (easiest)
npm install -g vidclaude
vidclaude your_video.mp4 --extract --mode standard --verbose
Option 2: pip
pip install vidclaude
vidclaude your_video.mp4 --extract --mode standard --verbose
Option 3: From source
git clone <repo-url> && cd claudevid
setup.bat # Windows
bash setup.sh # macOS / Linux
python video_understand.py your_video.mp4 --extract --mode standard --verbose
No API key needed. If you have a Claude Max/Pro plan, the tool works entirely through Claude Code — Claude in your conversation does the reasoning.
How It Works
Video File → ffmpeg extraction → Frames + Audio + Metadata
↓
faster-whisper large-v3 → Transcript (Hindi, English, 90+ languages)
pytesseract → OCR text (optional)
Shot detection → Scene boundaries
↓
Timeline builder → Unified event list
↓
evidence.md + cached frames → Claude reasons over it
No API key needed. Your Claude Max/Pro plan covers everything. The tool extracts evidence, and Claude in your conversation reasons over it.
Prerequisites
| Requirement | How to install |
|---|---|
| Python 3.10+ | python.org |
| ffmpeg | Windows: winget install ffmpeg / macOS: brew install ffmpeg / Linux: sudo apt install ffmpeg |
Installation
Option A: One-line setup (recommended)
# Windows
setup.bat
# macOS / Linux
bash setup.sh
Option B: Manual
pip install -r requirements.txt
Optional extras
pip install pytesseract # OCR (also needs Tesseract binary)
pip install anthropic # Only for standalone --api mode
Usage
Inside Claude Code (recommended)
- Copy
SKILL.mdinto your project (or keep it here) - Ask Claude: "analyze the video at D:/path/to/video.mp4"
- Claude runs the extraction, reads the evidence, and answers
- Ask follow-up questions — the cache is reused instantly
From the command line
# Standard analysis (recommended for most videos)
python video_understand.py video.mp4 --extract --mode standard --verbose
# Quick analysis (fast, fewer frames)
python video_understand.py video.mp4 --extract --mode quick
# Deep analysis (dense frames, full OCR)
python video_understand.py video.mp4 --extract --mode deep --verbose
# Process a folder of videos
python video_understand.py ./videos/ --extract --verbose
# Skip audio / OCR
python video_understand.py video.mp4 --extract --no-audio --no-ocr
# Force fresh extraction (ignore cache)
python video_understand.py video.mp4 --extract --no-cache --verbose
Processing Modes
| Mode | Frames | Audio model | OCR | Best for |
|---|---|---|---|---|
quick |
~20, uniform sampling | whisper base | skip | Fast overview, short clips |
standard |
~60, shot-aware | whisper large-v3 | keyframes | General analysis |
deep |
~150, burst sampling | whisper large-v3 | all frames | Detailed review, long videos |
Caching
First run extracts everything to .vidcache/<hash>/:
.vidcache/a3f7b2c1/
meta.json # Video metadata
frames/ # Extracted JPEG frames
transcript.json # Timestamped transcript
ocr.json # OCR results
timeline.json # Merged timeline
evidence.md # Human-readable report
Follow-up questions reuse the cache — no re-extraction needed.
Delete .vidcache/ to free disk space.
CLI Reference
| Flag | Default | Description |
|---|---|---|
input |
required | Video file or folder path |
--extract |
- | Extract only (skill mode, no API key) |
-q "..." |
none | Question (for --api mode) |
--mode |
standard | quick / standard / deep |
-f N |
auto | FPS override |
-m N |
auto | Max frames override |
--no-audio |
- | Skip transcription |
--no-ocr |
- | Skip OCR |
--no-cache |
- | Force re-extraction |
--verbose |
- | Detailed progress |
-o file |
stdout | Output file |
--batch-summary |
- | Cross-video summary for folders |
Project Structure
video_understand.py # CLI entry point
SKILL.md # Claude Code skill definition
setup.bat / setup.sh # One-click setup scripts
requirements.txt # Python dependencies
vidclaude/
cli.py # Argument parsing, orchestration
models.py # Data model (VideoMeta, Frame, Shot, etc.)
ingest.py # Layer A: Video validation + metadata
segment.py # Layer B+C: Shot detection + adaptive sampling
audio.py # Layer D: faster-whisper transcription
ocr.py # Layer E: Text extraction from frames
intent.py # Intent classification (adjusts pipeline)
timeline.py # Layer G: Temporal event merging
memory.py # Layer I: Hierarchical summaries
reason.py # Layer J: Evidence assembly
util.py # Shared helpers
Architecture
Based on claude_video_understanding_architecture.md, this tool implements a multi-layer video understanding pipeline:
- Layer A (Ingestion): Format validation, ffprobe metadata
- Layer B (Segmentation): Shot boundary detection via scene filter
- Layer C (Adaptive Sampling): Content-aware frame selection
- Layer D (Audio): faster-whisper large-v3 ASR with timestamps (90+ languages)
- Layer E (OCR): Text extraction from key frames
- Layer G (Timeline): Unified temporal event list
- Layer I (Memory): Hierarchical summaries for long videos
- Layer J (Reasoning): Evidence assembly for Claude
Claude serves as the reasoning brain — the tool provides structured, time-grounded evidence for Claude to analyze.
Language Support
Uses faster-whisper with large-v3 model which supports 90+ languages including: Hindi, English, Spanish, French, German, Chinese, Japanese, Arabic, and more. Language is auto-detected with confidence scoring.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vidclaude-0.2.0.tar.gz.
File metadata
- Download URL: vidclaude-0.2.0.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5618b92c7d3db88ece7da0cff202ae6d147d32454a9e673fe019c434f9b2a13
|
|
| MD5 |
c95695a8260bfdf2926395024988af52
|
|
| BLAKE2b-256 |
64216fab2d9139c27f6c92d7fd14237d64b748d8080f635555b46b821024f8c5
|
File details
Details for the file vidclaude-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vidclaude-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52085c710c71cca79b37300b96c2d68c12a0a513cd912b128ec19a8715fc27d2
|
|
| MD5 |
3e39784bcb3da951754b82b15dda3d48
|
|
| BLAKE2b-256 |
28b1b5c18692ae1d300c05f2db259454a61c164d4a014d98cc1d5bb643fd2f84
|