Edge-optimized multimodal RAG framework for video understanding
Project description
VidChain: Video Intelligence RAG Framework
Edge-optimized multimodal RAG framework for video understanding — transforms raw footage into a structured, queryable knowledge base.
Overview
VidChain v0.2.0 is a lightweight, modular framework that combines computer vision, OCR, speech recognition, emotion analysis, and LLM reasoning into a unified late-fusion pipeline. Designed to run on consumer-grade GPUs (tested on NVIDIA RTX 3050 4GB), it makes on-device video intelligence practical without cloud dependency.
At the heart is B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation) — a conversational AI copilot that translates raw sensor logs into human-readable narratives using abductive reasoning.
Core Pipeline
Video → WAV Extraction → Whisper ASR → Frame Loop →
├── YOLO (Objects)
├── MobileNetV3 (Action)
├── EasyOCR (Screen Text)
├── DeepFace (Emotion, threaded)
└── TemporalTracker (Object Persistence + Camera Motion)
→ Semantic Fusion → ChromaDB → B.A.B.U.R.A.O. RAG
Key Capabilities
🧠 Dual-Brain Vision Engine
- YOLO (Nouns): Detects objects with bounding boxes —
"1 person, 1 laptop" - MobileNetV3 (Verbs): Classifies scene intent —
NORMAL / SUSPICIOUS / VIOLENCE / EMERGENCY
🔤 Context-Aware OCR
EasyOCR runs only when YOLO detects readable surfaces (laptop, monitor, whiteboard) — saves compute while capturing ground-truth text.
😶 Threaded Emotion Analysis
DeepFace runs on CPU in a background thread so it never competes with YOLO/MobileNet for VRAM.
📡 Temporal Tracking
- Object Persistence: IoU tracker assigns persistent IDs across frames (
person #1 present 12s, moving left) - Camera Motion: Lucas-Kanade optical flow detects pan, tilt, zoom, static
- Scene Cut Detection: HSV histogram correlation resets trackers on hard cuts
🗣️ B.A.B.U.R.A.O. RAG Engine
- BGE embedder (
BAAI/bge-base-en-v1.5) for domain-specific retrieval - Cross-encoder reranker for precision before LLM call
- Intent routing — distinguishes video search from conversational follow-ups
- Chat memory — maintains context across multi-turn conversations
Installation
pip install vidchain
# GPU-accelerated PyTorch (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
Run
python scripts/check_gpu.pyto verify CUDA is detected.
Quick Start
Python API (Library)
from vidchain import VidChain
# Initialize
vc = VidChain(config={
"llm_provider": "gemini/gemini-2.5-flash", # or "ollama/llama3" for offline
"db_path": "./vidchain_storage" # omit for in-memory (no persistence)
})
# Ingest a video
video_id = vc.ingest("surveillance.mp4")
# Query
print(vc.ask("what happened in the video?"))
print(vc.ask("was anyone acting suspiciously?"))
# Multi-video: scope query to a specific video
vc.ingest("cam1.mp4", video_id="cam1")
vc.ingest("cam2.mp4", video_id="cam2")
print(vc.ask("did anyone enter the room?", video_id="cam1"))
CLI
# Analyze and chat
vidchain-analyze video.mp4
# Single-shot query
vidchain-analyze video.mp4 --query "what happened at the desk?"
# Offline with Ollama
vidchain-analyze video.mp4 --llm ollama/llama3
# Multilingual OCR
vidchain-analyze video.mp4 --ocr-lang en fr
Train Custom Action Engine
# Place labeled images in data/train/<class>/
vidchain-train
Knowledge Base Schema
Each fused timeline entry contains all modalities at that moment:
{
"time": 5.8,
"duration": 3.2,
"objects": "1 person, 1 laptop",
"action": "SUSPICIOUS",
"emotion": "visibly agitated",
"ocr": "ASUS Vivobook",
"audio": "I told you this would happen",
"camera": "static",
"tracking": ["person #1 (present 4.8s), moving left", "laptop #2 (present 5.8s)"],
"audio_anomaly": "NORMAL"
}
Tech Stack
| Component | Technology |
|---|---|
| Object Detection | YOLOv8s (Ultralytics) |
| Action Classification | MobileNetV3 (custom fine-tuned) |
| Speech Recognition | OpenAI Whisper (base) |
| OCR | EasyOCR |
| Emotion Analysis | DeepFace (opencv backend) |
| Temporal Tracking | IoU tracker + Lucas-Kanade optical flow |
| Embedder | BAAI/bge-base-en-v1.5 |
| Reranker | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Vector Store | ChromaDB (persistent) |
| LLM Routing | LiteLLM (gemini-2.5-flash default, Ollama supported) |
| Scene Understanding | CLIP (openai/clip-vit-base-patch32) |
| GPU Runtime | CUDA 12.1 (4GB+ VRAM, RTX 30-series tested) |
Developer Utilities
# List all indexed videos
vc.list_indexed_videos()
# Generate a narrative summary
vc.summarize_video(video_id, depth="concise") # or "detailed"
# Hot-swap LLM
vc.set_llm("ollama/llama3")
# Purge a specific video
vc.purge_storage(video_id="cam1")
# Purge everything
vc.purge_storage()
Roadmap
- CLIP scene understanding — zero-shot environment classification (v0.3.0)
- Adaptive audio filtering — energy gating, anomaly detection, segment merging (v0.3.0)
- Multi-video scoped queries —
vc.ask(query, video_id="cam1")(v0.3.0) - Graceful degradation — every engine fails independently (v0.3.0)
- Real-time streaming — live camera ingestion with low-latency indexing
- Cross-video subject tracking — link the same person across multiple camera feeds
- Export to CSV — structured timeline export for downstream analysis
Contributing
Contributions, issues, and feature requests are welcome. Open a GitHub issue or submit a pull request.
Author
Rahul Sharma — B.Tech CSE, IIIT Manipur
License
Distributed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vidchain-0.4.0.tar.gz.
File metadata
- Download URL: vidchain-0.4.0.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
444e58ca24336564ea2fbc76caa48df04208c8c0c206a81f89bfba171938d3ed
|
|
| MD5 |
616be2486d7dc7537504fef4126e0eb2
|
|
| BLAKE2b-256 |
80d0d9f29f52e85872450e9fd6adfffbf6eadbf88a6e01002e29b6b54f77bc57
|
File details
Details for the file vidchain-0.4.0-py3-none-any.whl.
File metadata
- Download URL: vidchain-0.4.0-py3-none-any.whl
- Upload date:
- Size: 52.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afd5a3d51087f9c09a1279587cc862785273860598614d895c87226da48eda8f
|
|
| MD5 |
f5a8a84bbf9e7466f257be9cfcfcb209
|
|
| BLAKE2b-256 |
57a796d96ab895a264f86427359764e22e62c58efa7c9245492a10f4ca53b456
|