A Lightweight Video RAG Framework for Multimodal Reasoning

These details have not been verified by PyPI

Project links

Project description

VidChain: Video Intelligence RAG Framework

Edge-optimized multimodal RAG framework for video understanding — transforms raw footage into a structured, queryable knowledge base.

Python CUDA License Status

Overview

vidchain v0.2.0 is a lightweight, modular framework that combines computer vision, smart OCR, speech recognition, and LLM reasoning into a unified late-fusion pipeline. Designed to run efficiently on consumer-grade GPUs (tested on NVIDIA RTX 3050), it extracts human-readable stories from raw sensor data, making on-device video intelligence practical without massive cloud dependency.

At the heart of the framework is B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation), an elite AI copilot that uses abductive reasoning to translate raw, flickering object/action logs into flowing, conversational narratives.

Core Pipeline

Video Input → Adaptive Keyframes → Dual-Brain Vision (YOLO + MobileNet) + OCR → Audio Transcription → Semantic Chunking → FAISS Vector DB → B.A.B.U.R.A.O. RAG

Key Capabilities

🧠 Dual-Brain Vision Engine

Instead of basic classification, vidchain uses a two-pronged visual approach:

The "Noun" Engine (YOLOv8): Detects specific objects (e.g., "1 person, 2 laptops").
The "Verb" Engine (MobileNetV3): Classifies the intent or state of the scene (e.g., NORMAL, SUSPICIOUS, VIOLENCE).

🔤 Context-Aware OCR

Powered by EasyOCR, the system intelligently scans for text only when YOLO detects readable surfaces (monitors, laptops, books, whiteboards), saving massive compute power while capturing ground-truth data (e.g., reading the brand "ASUS Vivobook" off a laptop).

B.A.B.U.R.A.O. RAG Engine (Conversational)

Unlike standard RAGs that read out robotic timelines, B.A.B.U.R.A.O. acts as a human copilot:

Abductive Reasoning: If it sees a "laptop" and a "keyboard", it deduces the scene is a "computer desk."
Sensor Filtering: Automatically ignores momentary hardware glitches/hallucinations (e.g., a TV briefly misidentified as an oven).
Natural Translation: Translates raw model labels like VIOLENCE into contextual human behaviors like "the person became visibly frustrated and hit the desk."

Edge-First GPU Optimization

Engineered to prevent VRAM crashes. Smart memory routing disables PyTorch's buggy layer fusion during YOLO inference and safely manages VRAM across concurrent vision, audio, and language models.

Installation

# 1. Install the core package
pip install vidchain

# 2. IMPORTANT: Install GPU-accelerated PyTorch (CUDA 12.1 recommended)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121) --force-reinstall

⚠️ Requirement: NVIDIA drivers and CUDA are strongly recommended. To verify your hardware is correctly mapped, run the built-in diagnostic script: python scripts/check_gpu.py

Quick Start

1 — Analyze a Video (Build Knowledge Base)

Analyze a video file, extract multimodal context, and generate a structured JSON timeline:

vidchain-analyze sample.mp4

This command automatically builds a FAISS index and drops you into the interactive B.A.B.U.R.A.O. chat terminal.

2 — Train the Action Engine

Fine-tune the MobileNetV3 "Verb" classifier on your domain-specific dataset:

vidchain-train

Place labeled training images under data/train/ before running.

Knowledge Base Schema

The framework utilizes Semantic Chunking to compress repetitive frames. The knowledge_base.json outputs a clean, fused timeline:

{
    "time": 0.97,
    "type": "ocr",
    "content": "ASUS Vivabook"
},
{
    "time": 3.87,
    "type": "visual",
    "content": "Duration: [3.87s - 6.77s] | Subjects: 1 laptop, 1 tv | Action State: SUSPICIOUS"
},
{
    "time": 19.34,
    "type": "visual",
    "content": "Duration: [19.34s - 19.34s] | Subjects: 1 tv | Action State: VIOLENCE"
}

Tech Stack

Component	Technology
Object Detection (Nouns)	YOLOv8s
Intent Classification (Verbs)	MobileNetV3 (Custom fine-tuned)
Text Extraction (OCR)	EasyOCR
ASR (Audio)	OpenAI Whisper (Base)
Vector Database	FAISS + Sentence-Transformers (`all-MiniLM-L6-v2`)
LLM Routing	LiteLLM (`gemini-2.5-flash` default, Ollama supported)
GPU Runtime	CUDA 12.1 (Optimized for 4GB+ VRAM)

Roadmap

Real-time streaming pipeline — live ingestion and indexing with low-latency event detection.
Advanced temporal reasoning — multi-clip reasoning and cross-camera subject tracking.
Interactive Dashboard — PyQt5 HUD for video playback, timeline visualization, and KB exploration.

Contributing

Contributions, issues, and feature requests are highly welcome! Open a GitHub issue or submit a pull request.

Author

Rahul Sharma — B.Tech CSE, IIIT Manipur

License

Distributed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

May 6, 2026

1.0.0

Apr 25, 2026

0.9.1

Apr 24, 2026

0.9.0

Apr 22, 2026

0.8.8

Apr 22, 2026

0.8.3

Apr 21, 2026

0.8.0

Apr 20, 2026

0.7.2

Apr 19, 2026

0.6.0

Apr 18, 2026

0.5.0

Apr 18, 2026

0.4.0

Apr 18, 2026

This version

0.2.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidchain-0.2.0.tar.gz (18.6 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vidchain-0.2.0-py3-none-any.whl (22.6 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file vidchain-0.2.0.tar.gz.

File metadata

Download URL: vidchain-0.2.0.tar.gz
Upload date: Apr 4, 2026
Size: 18.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ec1587154eeacd8734141c1e16efb14f16d8b550e980f906b4056d8fbdfe964b`
MD5	`52b67c9f766eed18e2bd59a930b11bb7`
BLAKE2b-256	`88fdd6d3b5740c8fda556eee45ef2ee560818e7ca132b440962c4846fc977f90`

See more details on using hashes here.

File details

Details for the file vidchain-0.2.0-py3-none-any.whl.

File metadata

Download URL: vidchain-0.2.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43e65a9a41231ca144959b6aa8ba905363306cbe7e11bba9d744f04c860fed4b`
MD5	`e5da40448171d4c369e0b5fcd6855c38`
BLAKE2b-256	`47bb2300793759e970134f446bc2fa3d5de54f3daba3e5bf60ab3b65a81f380e`

See more details on using hashes here.

vidchain 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

VidChain: Video Intelligence RAG Framework

Overview

Core Pipeline

Key Capabilities

🧠 Dual-Brain Vision Engine

🔤 Context-Aware OCR

B.A.B.U.R.A.O. RAG Engine (Conversational)

Edge-First GPU Optimization

Installation

Quick Start

1 — Analyze a Video (Build Knowledge Base)

2 — Train the Action Engine

Knowledge Base Schema

Tech Stack

Roadmap

Contributing

Author

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes