Skip to main content

Convert video to AI-ingestable format (frames + transcript + LLM analysis)

Project description

video2ai

Turn any video into AI-ready structured content. Extract frames, transcribe audio, auto-detect key moments — all running locally on your Mac's Neural Engine.

No cloud. No API keys. No PyTorch. Just Apple Silicon doing what it does best.

PyPI version License: MIT Python 3.10+


Installation

Prerequisites

  • macOS (Apple Vision framework required)
  • ffmpegbrew install ffmpeg

Option 1: pip (recommended)

pip install video2ai

With Apple Vision support (OCR, embeddings, classification):

pip install "video2ai[vision]"

Option 2: Homebrew

brew tap sameeeeeeep/video2ai https://github.com/sameeeeeeep/video2ai.git
brew install video2ai

Option 3: Download binary

Grab the latest pre-built macOS binary from Releases — no Python required:

curl -L https://github.com/sameeeeeeep/video2ai/releases/latest/download/video2ai -o video2ai
chmod +x video2ai
sudo mv video2ai /usr/local/bin/

Option 4: Install from source

git clone https://github.com/sameeeeeeep/video2ai.git && cd video2ai
pip install -e ".[vision]"

Optional extras

pip install openai-whisper    # transcription (local, base model is fine)
brew install yt-dlp           # URL downloads (YouTube, Vimeo, etc.)

The Problem

You have a video. You need an AI to understand it. But LLMs can't watch videos — they need frames + text. Manually scrubbing through to pick the right frames is tedious. Existing tools are slow, memory-hungry, or require cloud APIs.

The Solution

Video → ffmpeg + Whisper + Apple Vision → structured content in seconds

Drop a video in. Get back:

  • Timestamped transcript — Whisper, fully local
  • Key frames auto-selected per transcript segment — Apple Vision Neural Engine embeddings + cosine similarity
  • Visual theme clusters — k-means on frame embeddings, filter out talking heads, keep product shots
  • Lightweight Markdown export — local image paths, no base64 bloat, AI reads text instantly and loads images on demand
  • Self-contained HTML export — images embedded inline, for human viewing
  • Screen capture — record any tab/screen directly from the browser, bypasses all platform download restrictions

Quick Start

Web UI

video2ai --web
# → http://localhost:8910

Three input modes:

  • Upload — drag a video file
  • Paste URL — YouTube, Threads, Vimeo, anything yt-dlp supports
  • Screen Capture — share any browser tab or screen, record at 1fps + audio, process through the same pipeline. Works with Instagram, TikTok, Netflix — anything on screen.

CLI

video2ai video.mp4 -o output/

Claude Code Skill

# Invoke from Claude Code:
/video2ai /path/to/video.mp4

The skill runs the full pipeline and outputs a lightweight Markdown file that Claude can read with local image paths.

How It Works

Video file / URL / Screen capture
  │
  ├─ ffmpeg ──────────── frames (1/sec, JPEG)
  │
  ├─ Whisper ─────────── transcript segments + timestamps
  │
  ├─ Apple Vision ────── 768-dim embedding per frame (Neural Engine)
  │    │
  │    ├─ per-segment ── cosine distance → visual state changes → key frame suggestions
  │    │
  │    └─ global ─────── k-means clustering → visual theme groups
  │
  ├─ Apple Vision OCR ── optional, on-demand text extraction from key frames
  │
  └─ Apple Intelligence ── on-device OCR summary via FoundationModels (auto-launches server)

The key insight: frame selection is a vector math problem, not an LLM problem. Embed every frame, embed (or timestamp-match) every transcript segment, pick the frames with the highest visual distinctiveness per segment. Runs in seconds, not minutes.

Zero ML overhead in Python. VNGenerateImageFeaturePrintRequest runs on the Neural Engine — the Python process just shuffles bytes. No PyTorch, no CLIP, no transformers loaded into RAM.

The Workflow

  1. Upload, paste URL, or screen capture — any input mode
  2. Pipeline runs — probe → extract → transcribe → embed → suggest
  3. Review — transcript sidebar, frame grid per segment, pre-selected key frames
  4. Filter by visual theme — click to deselect/select all frames in a theme, right-click to suppress
  5. OCR (optional) — run Apple Vision OCR on selected key frames, auto-summarized by Apple Intelligence on-device
  6. Export — Markdown (for AI) or HTML (for humans). OCR summary included by default, raw OCR opt-in.

Export Formats

Format Mode Best for
Markdown Download for AI AI consumption — lightweight text + local image paths, ~150 lines vs 170k tokens
HTML Download HTML Human viewing — self-contained, base64 images, opens in any browser
HTML (AI) ?mode=ai Compressed thumbnails, still self-contained

Architecture

Module What it does
probe.py ffprobe wrapper — duration, resolution, codecs, audio detection
frames.py ffmpeg frame extraction at configurable intervals
transcribe.py Whisper speech-to-text, returns timed segments
clip_match.py Apple Vision embeddings, visual change detection, k-means clustering
vision.py Apple Vision OCR + image classification + Apple Intelligence summarization
llm.py Ollama LLM analysis — optional, for summaries
web.py Flask web UI — upload, URL, screen capture, review, export
embed.py Bake metadata into video via ffmpeg

Why Not Just Use CLIP?

We tried. CLIP + PyTorch eats ~2GB RAM and requires loading a 600MB model. Apple Vision's VNGenerateImageFeaturePrintRequest runs on the Neural Engine with near-zero memory overhead — it's already on your machine, already optimized, and produces 768-dim embeddings that work great for frame similarity.

For transcript↔frame matching, we don't even need cross-modal embeddings. The transcript gives us timestamps → we know which frames belong to which segment → we pick the most visually distinct ones within each segment. Simple, fast, accurate.

Contributing

git clone https://github.com/sameeeeeeep/video2ai.git && cd video2ai
make dev

Releasing

make release VERSION=0.2.0

This bumps the version, commits, tags, and pushes. GitHub Actions handles PyPI publishing, binary builds, and Homebrew formula updates automatically.

License

MIT


Built with Claude Code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

video2ai-0.1.0.tar.gz (49.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

video2ai-0.1.0-py3-none-any.whl (51.4 kB view details)

Uploaded Python 3

File details

Details for the file video2ai-0.1.0.tar.gz.

File metadata

  • Download URL: video2ai-0.1.0.tar.gz
  • Upload date:
  • Size: 49.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for video2ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9d15b70f1a0994a1b4bbb6ab77da6695bd674dd089f123c4a013eedff3317e65
MD5 71423a243926e1c39b35b6934dcec63b
BLAKE2b-256 99e21f32c55b94a82483c413b7486b1a3bd0932cc61e4b21b1d717ace89b9b08

See more details on using hashes here.

Provenance

The following attestation bundles were made for video2ai-0.1.0.tar.gz:

Publisher: release.yml on sameeeeeeep/video2ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file video2ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: video2ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for video2ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0226a0b513d5983e9ba5f5674383d5eaf8256aaa7711c1f405eca98988f42af3
MD5 7d53f73455d812af6969bf5c3d387d9b
BLAKE2b-256 0b057399f69dd6c3abafeb052af6176d47254b84b3cf8b3642043572575f825c

See more details on using hashes here.

Provenance

The following attestation bundles were made for video2ai-0.1.0-py3-none-any.whl:

Publisher: release.yml on sameeeeeeep/video2ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page