A Lightweight Video RAG Framework for Multimodal Reasoning

These details have not been verified by PyPI

Project links

Project description

VideoChain

Edge-optimized multimodal RAG framework for video understanding — transforms raw footage into a structured, queryable knowledge base.

Python CUDA License Status

Overview

VideoChain is a lightweight, modular framework that combines computer vision, speech recognition, and LLM reasoning into a unified late-fusion pipeline. Designed to run on consumer-grade GPUs (tested on NVIDIA RTX 3050), it carefully schedules VRAM across concurrent vision and language inference — making on-device video intelligence practical without cloud dependency.

Core Pipeline

Video Input → Frame Extraction → Vision Inference → Audio Transcription → Fusion Engine → Knowledge Base → LLM Query

Key Capabilities

Adaptive Keyframe Extraction

Gaussian-blurred frame differencing filters transient noise and isolates semantically significant motion events, reducing redundant frame processing by discarding visually similar consecutive frames.

Multimodal Data Alignment

Visual labels from MobileNetV3 and Whisper-generated transcripts are synchronized via timestamp, producing a unified timeline for downstream retrieval and reasoning.

Domain-Agnostic Design

Modular loader and processor interfaces allow straightforward adaptation to security, retail analytics, education, and personal content search — without restructuring the core pipeline.

Edge-First Optimization

Concurrent vision and LLM inference with careful VRAM scheduling. Validated on RTX 3050 (4 GB). No cloud inference dependency for core pipeline execution.

Installation

# Clone the repository
git clone https://github.com/rahulsiiitm/videochain
cd videochain

# Install in editable mode (recommended for development)
pip install -e .

⚠️ Requirement: NVIDIA drivers and CUDA 12.1 are required for GPU-accelerated inference. CPU-only execution is supported but significantly slower for vision workloads.

Quick Start

1 — Build a knowledge base

Analyze a video file and generate a structured JSON knowledge base:

videochain-analyze --input sample.mp4

Output: knowledge_base.json

2 — Train a custom vision model

Fine-tune the vision classifier on a domain-specific dataset:

videochain-train --epochs 15 --batch-size 16

Place labeled training images under data/train/ before running. Class subdirectory names become label strings in the knowledge base.

3 — Query the knowledge base

Use Ollama (local) or Gemini API (cloud) to issue natural language queries over the generated knowledge base.

System Architecture

VideoChain follows a late-fusion architecture — each modality is processed independently before being merged at the knowledge-base level. This decouples model upgrade paths and allows per-modality optimization.

Layer	Component	Responsibility
1	Loaders	Frame extraction (OpenCV), audio separation (MoviePy), format normalization
2	Processors	Vision: MobileNetV3 classification · Audio: Whisper transcription with word-level timestamps
3	Fusion Engine	Timestamp synchronization, confidence-weighted merging of modalities
4	LLM Reasoning	Natural language querying via Ollama (Llama 3, local) or Gemini API (remote)
5	Knowledge Base	Structured JSON output — indexed by timestamp, designed for vector DB integration

Knowledge Base Schema

Each event entry in knowledge_base.json follows this structure:

{
  "timestamp": "00:01:23",
  "visual": ["person", "running"],
  "audio": "Someone is running across the hallway",
  "confidence": 0.91,
  "frame_index": 2490
}

Project Structure

videochain/
├── core/            # Fusion engine, LLM query interface, KB I/O
├── loaders/         # Video frame extraction, audio separation
├── processors/      # MobileNetV3 vision, Whisper audio inference
├── scripts/         # Training utilities, dataset prep helpers
├── data/
│   └── train/       # Class-labeled training images (one dir per class)
└── pyproject.toml   # Dependencies, CLI entry points, metadata

Tech Stack

Component	Technology
Vision model	MobileNetV3
ASR	OpenAI Whisper
Video I/O	OpenCV + MoviePy
ML framework	PyTorch
LLM (local)	Ollama / Llama 3
LLM (cloud)	Gemini API
Language	Python 3.10+
GPU runtime	CUDA 12.1
Packaging	pyproject.toml

Use Cases

#	Use Case	Description
01	CCTV Surveillance	Query footage for specific events, persons, or time windows in natural language
02	Retail Analytics	Track customer behavior patterns and dwell-time events across store zones
03	Lecture Indexing	Search educational video by spoken content or visual slide transitions
04	Personal Media Search	Find moments in home video archives using natural language descriptions

Roadmap

Real-time streaming pipeline — live ingestion and indexing with low-latency event detection
Vector database integration — FAISS or Chroma backends for semantic similarity search
Advanced temporal reasoning — event co-occurrence detection, causal chain inference, multi-clip reasoning
Query dashboard — browser-based UI for video playback, timeline visualization, and KB exploration

Contributing

Contributions, issues, and feature requests are welcome. Open a GitHub issue or submit a pull request.

Author

Rahul Sharma — B.Tech CSE, IIIT Manipur

License

Distributed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

videochain-0.1.0.tar.gz (16.8 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

videochain-0.1.0-py3-none-any.whl (17.2 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file videochain-0.1.0.tar.gz.

File metadata

Download URL: videochain-0.1.0.tar.gz
Upload date: Apr 2, 2026
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for videochain-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2898c7201c7b07c5f234df6793a2982e2240d5cfdff311222ca5a006966f2412`
MD5	`33255f30aa7c9194e96f4d2c3ff2256f`
BLAKE2b-256	`d0841ac0a1c8c0c11e5209f08a977f61c5f36c41fab97c8e9e38506cf4ee15d8`

See more details on using hashes here.

File details

Details for the file videochain-0.1.0-py3-none-any.whl.

File metadata

Download URL: videochain-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 17.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for videochain-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d8fe83e78b64e0e1c6e2c2d576cce5e23db2b988ee1f6a9c909e7d5b7c72031`
MD5	`f0e46070599b0b07ff53f6ac2de0b613`
BLAKE2b-256	`171ebe019f393795ef8850cd4d120e6b8c1b8ada48c0d4ecb5dba4701497e799`

See more details on using hashes here.

videochain 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

VideoChain

Overview

Core Pipeline

Key Capabilities

Adaptive Keyframe Extraction

Multimodal Data Alignment

Domain-Agnostic Design

Edge-First Optimization

Installation

Quick Start

1 — Build a knowledge base

2 — Train a custom vision model

3 — Query the knowledge base

System Architecture

Knowledge Base Schema

Project Structure

Tech Stack

Use Cases

Roadmap

Contributing

Author

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes