Skip to main content

Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video

Project description

vidclaude

Multimodal video understanding for Claude Code. Extract frames, transcribe audio in 90+ languages, build temporal timelines — all from a single command. No API key needed.

pip install vidclaude
vidclaude video.mp4 --mode standard --verbose

What it does

Drop a video in, get structured evidence out. Claude in your conversation does the thinking.

Video File
  │
  ├─ ffmpeg ──────────► Frames (adaptive, shot-aware sampling)
  ├─ faster-whisper ──► Transcript with timestamps (large-v3, 90+ languages)
  ├─ pytesseract ─────► On-screen text / OCR (optional)
  ├─ scene detection ─► Shot boundaries
  │
  └─► Timeline ──► evidence.md ──► Claude reasons over it

Works with Hindi, English, Spanish, Japanese, Arabic — any of the 90+ languages Whisper supports. Language is auto-detected.

Prerequisites

Requirement Install
Python 3.10+ python.org
ffmpeg Windows: winget install ffmpeg / macOS: brew install ffmpeg / Linux: sudo apt install ffmpeg

Install

pip install vidclaude

That's it. First run downloads the Whisper model (~3GB, one time).

Usage

With Claude Code (recommended)

Set up the skill once:

vidclaude --install-skill

Then in Claude Code, just say:

"analyze the video at C:/Users/me/Videos/meeting.mp4"

"what does the speaker say about the budget in presentation.mp4?"

"when does the logo appear in intro.mov?"

Claude runs the extraction, reads the evidence report + frames, and answers your question. Your Max/Pro plan covers everything — no API key needed.

Follow-up questions about the same video are instant (cached).

From the command line

# Standard analysis — good for most videos
vidclaude video.mp4 --mode standard --verbose

# Quick — fewer frames, faster
vidclaude video.mp4 --mode quick

# Deep — dense frames, full OCR, detailed
vidclaude video.mp4 --mode deep --verbose

# Batch process a folder
vidclaude ./videos/ --verbose

# Skip audio transcription
vidclaude video.mp4 --no-audio

# Force fresh extraction (ignore cache)
vidclaude video.mp4 --no-cache

Output

Every run creates a .vidcache/ directory next to the video:

.vidcache/a3f7b2c1/
  evidence.md        ← Human-readable report (Claude reads this)
  frames/            ← Extracted JPEG frames
  transcript.json    ← Timestamped transcript
  timeline.json      ← Unified event timeline
  meta.json          ← Video metadata
  shots.json         ← Shot boundaries
  ocr.json           ← On-screen text
  summaries.json     ← Scene/chapter summaries

Modes

Mode Frames Whisper model OCR Best for
quick ~20, uniform base skip Short clips, fast overview
standard ~60, shot-aware large-v3 keyframes General use
deep ~150, burst sampling large-v3 all frames Long videos, detailed analysis

Smart frame budget: If a video is too long for the frame limit, FPS is automatically reduced. Shot boundary frames are always prioritized.

CLI Reference

vidclaude [input] [options]

positional:
  input                    Video file or folder

setup:
  --install-skill          Copy SKILL.md to current dir for Claude Code

processing:
  --mode {quick,standard,deep}   Processing mode (default: standard)
  -f, --fps N              Override frames per second
  -m, --max-frames N       Override max frame count
  --no-audio               Skip audio transcription
  --no-ocr                 Skip OCR extraction
  --no-cache               Force re-extraction

output:
  --extract                Print cache path summary
  -o, --output FILE        Write output to file
  --verbose                Show detailed progress
  --batch-summary          Cross-video summary for folders

How it works

Built on a multi-layer video understanding architecture:

  • Layer A — Ingestion: Validates format (MP4, MOV, MKV, WebM, AVI), extracts metadata via ffprobe
  • Layer B — Segmentation: Detects shot boundaries using ffmpeg's scene change filter
  • Layer C — Adaptive Sampling: Content-aware frame selection — more frames at scene transitions, smart frame budgets per mode
  • Layer D — Audio: faster-whisper with large-v3 for multilingual transcription with word-level timestamps and VAD filtering
  • Layer E — OCR: pytesseract text extraction from key frames (optional)
  • Layer G — Timeline: Merges speech, OCR, and scene events into a single time-sorted list
  • Layer I — Memory: Hierarchical summaries for longer videos (scene → chapter → global)
  • Layer J — Evidence Assembly: Generates evidence.md with frame references, transcript, timeline for Claude to reason over

Intent-aware processing

The tool classifies your question and adjusts the pipeline:

Question type Example What happens
Description "What happens in this video?" Balanced extraction
Moment retrieval "When does the person stand up?" Prioritizes transcript + timeline
Temporal ordering "Does X happen before Y?" Prioritizes timeline events
Counting "How many cars appear?" Denser frame sampling
OCR / text "What text is on the slide?" Prioritizes OCR extraction
Speech "What did they say about revenue?" Prioritizes transcript

Optional extras

# OCR support (also needs Tesseract binary installed)
pip install pytesseract

# Standalone API mode (use outside Claude Code)
pip install anthropic
export ANTHROPIC_API_KEY=sk-...
vidclaude video.mp4 --api -q "What happens in this video?"

Examples

Analyze a meeting recording:

vidclaude meeting.mp4 --mode standard --verbose
# → 55 frames, full transcript, timeline
# → In Claude Code: "summarize the key decisions from this meeting"

Review security footage:

vidclaude ./cameras/ --mode deep --verbose
# → Batch processes all videos in the folder
# → Dense frame extraction catches fast events

Extract text from a lecture:

pip install pytesseract
vidclaude lecture.mp4 --mode standard --verbose
# → Captures slide text via OCR + speaker transcript

Quick check on a short clip:

vidclaude clip.mp4 --mode quick
# → 20 frames, fast whisper, done in seconds

Troubleshooting

Problem Fix
ffmpeg not found Install: winget install ffmpeg (Windows) / brew install ffmpeg (Mac)
No module named 'faster_whisper' pip install faster-whisper
Slow first run Normal — downloading Whisper large-v3 model (~3GB, one time)
Wrong language detected Whisper auto-detects; usually correct. For edge cases, the transcript still captures phonetics
Large .vidcache folder Delete it to free space: rm -rf .vidcache/
Want fresh extraction Use --no-cache flag
OCR not working Install pytesseract + Tesseract binary

Project structure

vidclaude/
  cli.py          Argument parsing, orchestration, caching
  models.py       Data model (VideoMeta, Frame, Shot, TranscriptChunk, etc.)
  ingest.py       Video validation + ffprobe metadata
  segment.py      Shot detection + adaptive frame sampling
  audio.py        faster-whisper transcription (large-v3)
  ocr.py          pytesseract text extraction
  intent.py       Question intent classification
  timeline.py     Temporal event merging
  memory.py       Hierarchical summaries
  reason.py       Evidence assembly + optional API mode
  util.py         ffmpeg helpers, image encoding, caching
  SKILL.md        Claude Code skill definition (bundled)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidclaude-0.2.2.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vidclaude-0.2.2-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file vidclaude-0.2.2.tar.gz.

File metadata

  • Download URL: vidclaude-0.2.2.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vidclaude-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0252b1c5377516eac0b4499f26bf84d024ae31fbf639ab0e972ce92e68ffd590
MD5 de440087523ffdb2cbb96e657fc61f96
BLAKE2b-256 d8b0c26ccd4a3bab3b7201881e48ec10198adb2b561046477fc897e280e5bcc7

See more details on using hashes here.

File details

Details for the file vidclaude-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: vidclaude-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vidclaude-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4945c5f08e64f28b85a354ba62b49cdf05efa9ef8ef454f2726983a4c8e232a8
MD5 133a026cf0682fb55c2717661028abe4
BLAKE2b-256 23278c5221f4bc84368032bcd72f9eda609c03b96528fe5b9fb5ea017dbcb524

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page