Skip to main content

Let Claude (or any LLM) actually watch a video — scene-aware, deduplicated frames + transcript, from a URL or local file.

Project description

claude-real-video

Let Claude — or any LLM — actually watch a video.

Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.

claude-real-video does it differently, and locally: point it at a URL or a file, and it pulls the frames that actually matter (every scene change, not a fixed quota), throws away the near-duplicates, transcribes the audio, and hands you a clean folder any LLM can read. All the processing happens on your own machine — what gets sent anywhere is only the frames/text you choose to paste into an LLM afterwards.

crv "https://www.youtube.com/watch?v=..."
# → crv-out/frames/*.jpg  +  crv-out/transcript.txt  +  crv-out/MANIFEST.txt

Then drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.

New in 0.3.0 — tell it why you're watching, and keep what it finds:

crv "https://youtu.be/..." --why "find the pricing strategy" --kb ~/notes

--why makes the analysis focus on what you care about instead of a generic summary; --kb saves the result as a dated note in your own notes folder, so it doesn't die in crv-out.


Why not just sample frames?

Most "let an LLM watch a video" scripts (and Gemini's own pipeline) grab frames at a fixed interval — e.g. one per second. That over-samples a static screencast and under-samples a fast-cut reel. claude-real-video is smarter:

fixed-interval sampling claude-real-video
Frame selection every N seconds scene-change detection + density floor
Repeated shots (A-B-A cuts) sent again every time sliding-window dedup sends each shot once
Static slide (10 min) ~600 near-identical frames collapses to 1 (dedup)
Fast-cut reel misses frames between samples catches each visual change
Audio often ignored Whisper transcript w/ language detect
Where the processing happens often in someone's cloud on your machine (you choose what to share with an LLM afterwards)
Input usually local file only URL (yt-dlp) or local file

You feed the model fewer, more meaningful frames — cheaper context, better understanding.


Install

pip install claude-real-video              # core (frames + dedup)
pip install "claude-real-video[whisper]"   # + audio transcription

System requirement: ffmpeg

ffmpeg / ffprobe are used for frame extraction and audio, and aren't pip-installable. Install them once:

OS command
macOS brew install ffmpeg
Linux sudo apt install ffmpeg (or your distro's package manager)
Windows winget install Gyan.FFmpeg — or choco install ffmpeg — or download a build and add its bin\ folder to your PATH

Verify it's on your PATH:

ffmpeg -version

Transcription uses the whisper CLI (installed by the [whisper] extra, or pip install openai-whisper). Whisper also relies on ffmpeg.

Works on macOS, Windows, and Linux — Python 3.10+.


Usage

# A YouTube / Instagram / TikTok / ... link
crv "https://www.instagram.com/reel/XXXX/"

# A local file, English transcript, output to ./out
crv lecture.mp4 -o out --lang en

# Frames only, no transcription
crv clip.mp4 --no-transcribe

# A login-gated video (your own / authorised use): pass a Netscape cookie file
crv "https://..." --cookies cookies.txt

python -m claude_real_video ... works as an alias for crv too.

Options

flag default meaning
-o, --out crv-out output directory
--scene 0.30 scene-change sensitivity (lower = more frames)
--fps-floor 1.0 at least one frame every N seconds
--max-frames 150 hard cap on total frames
--lang auto Whisper language (en, zh, auto, ...)
--dedup-threshold 8 % of pixels that must change for a frame to count as new; higher = fewer frames
--dedup-window 4 compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (1 = consecutive-only)
--report off keep dropped frames in ./dropped + write report.html visualising every keep/drop decision
--no-transcribe off skip audio
--keep-audio off also save the full soundtrack (audio.m4a) so audio models can hear it
--why why you're watching, e.g. --why "find the pricing strategy" — written into MANIFEST.txt so the model analyses with that lens instead of a generic summary
--kb also save the analysis as a dated markdown note into this folder (your Obsidian vault, notes dir, ...) — so it joins your knowledge base instead of dying in crv-out
--cookies Netscape cookie file for login-gated sources

Use it from Python

from claude_real_video import process

r = process("https://youtu.be/...", "out", lang="en")
print(r.frame_count, r.transcript_path)

How it works

  1. Fetchyt-dlp for URLs (optional cookies), or copy a local file.
  2. Extract — one chronological ffmpeg select pass grabs every scene change plus a density floor (at least one frame every --fps-floor seconds), so fast cuts and slow screencasts are both covered.
  3. Dedup — real pixel difference (downscaled RGB, not a perceptual hash — hashes go blind on flat colours and equal-luma hue changes) against a sliding window of the last --dedup-window kept frames, so an A-B-A cutaway doesn't re-send a shot the model has already seen. --report writes report.html showing every keep/drop decision with its diff %, for tuning.
  4. Text — if the video already has subtitles (a sidecar .srt/.vtt next to a local file, or an embedded subtitle track), those are used as the transcript — faster and more accurate than re-transcribing. Only when there are no subtitles does it fall back to Whisper on the audio (skipped cleanly if there's no audio).
  5. Audio (optional, --keep-audio) — save the full original soundtrack (audio.m4a: music + speech + effects, copied losslessly when possible). The transcript only has the words; the audio file lets a model that can listen (Gemini, GPT-4o, …) actually hear the music and tone.
  6. ManifestMANIFEST.txt summarises everything for the model.

So the model can see (key frames), read (transcript) and — with --keep-audiohear (full soundtrack) the video. The transcript is plain text any model can read; the tool doesn't burn subtitles into the video — burning is a presentation choice, not something needed to make a video AI-readable.


Notes

  • Only download content you have the right to. The --cookies option is for your own, authorised access — don't ship credentials in a repo.
  • Re-running overwrites the output directory.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_real_video-0.4.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_real_video-0.4.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file claude_real_video-0.4.0.tar.gz.

File metadata

  • Download URL: claude_real_video-0.4.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for claude_real_video-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ccf0184b38a52a2839c4b4e66d3a507fe4d51571a7462c9a22e6dca46b7a7a9c
MD5 ed15c7c24c5fca9af12bd2fc8f900d2e
BLAKE2b-256 359985f4828ed000aa047ea9fc570374d2a20dc3d888f89104968f190780b422

See more details on using hashes here.

File details

Details for the file claude_real_video-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for claude_real_video-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5c19c7eb8f7c4f892607baf3f4004d9dbb008fff2cebdd8c30401abbeb25cbb5
MD5 b7747bb8d18744aefa0ab69a50d85ad3
BLAKE2b-256 832bf4b628935a403385c0c85c9f45b898db2317a9453bbe8b5f1339ca95adc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page