Skip to main content

Edge-optimized multimodal RAG framework for video understanding

Project description

VidChain: Video Intelligence RAG Framework

Edge-optimized multimodal RAG framework for video understanding — transforms raw footage into a structured, queryable knowledge base.

Python CUDA License Status PyPI version


Overview

VidChain v0.2.0 is a lightweight, modular framework that combines computer vision, OCR, speech recognition, emotion analysis, and LLM reasoning into a unified late-fusion pipeline. Designed to run on consumer-grade GPUs (tested on NVIDIA RTX 3050 4GB), it makes on-device video intelligence practical without cloud dependency.

At the heart is B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation) — a conversational AI copilot that translates raw sensor logs into human-readable narratives using abductive reasoning.


Core Pipeline

Video → WAV Extraction → Whisper ASR → Frame Loop →
  ├── YOLO (Objects)
  ├── MobileNetV3 (Action)
  ├── EasyOCR (Screen Text)
  ├── DeepFace (Emotion, threaded)
  └── TemporalTracker (Object Persistence + Camera Motion)
→ Semantic Fusion → ChromaDB → B.A.B.U.R.A.O. RAG

Key Capabilities

🧠 Dual-Brain Vision Engine

  • YOLO (Nouns): Detects objects with bounding boxes — "1 person, 1 laptop"
  • MobileNetV3 (Verbs): Classifies scene intent — NORMAL / SUSPICIOUS / VIOLENCE / EMERGENCY

🔤 Context-Aware OCR

EasyOCR runs only when YOLO detects readable surfaces (laptop, monitor, whiteboard) — saves compute while capturing ground-truth text.

😶 Threaded Emotion Analysis

DeepFace runs on CPU in a background thread so it never competes with YOLO/MobileNet for VRAM.

📡 Temporal Tracking

  • Object Persistence: IoU tracker assigns persistent IDs across frames (person #1 present 12s, moving left)
  • Camera Motion: Lucas-Kanade optical flow detects pan, tilt, zoom, static
  • Scene Cut Detection: HSV histogram correlation resets trackers on hard cuts

🗣️ B.A.B.U.R.A.O. RAG Engine

  • BGE embedder (BAAI/bge-base-en-v1.5) for domain-specific retrieval
  • Cross-encoder reranker for precision before LLM call
  • Intent routing — distinguishes video search from conversational follow-ups
  • Chat memory — maintains context across multi-turn conversations

Installation

pip install vidchain

# GPU-accelerated PyTorch (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

Run python scripts/check_gpu.py to verify CUDA is detected.


Quick Start

Python API (Library)

from vidchain import VidChain

# Initialize
vc = VidChain(config={
    "llm_provider": "gemini/gemini-2.5-flash",  # or "ollama/llama3" for offline
    "db_path": "./vidchain_storage"              # omit for in-memory (no persistence)
})

# Ingest a video
video_id = vc.ingest("surveillance.mp4")

# Query
print(vc.ask("what happened in the video?"))
print(vc.ask("was anyone acting suspiciously?"))

# Multi-video: scope query to a specific video
vc.ingest("cam1.mp4", video_id="cam1")
vc.ingest("cam2.mp4", video_id="cam2")
print(vc.ask("did anyone enter the room?", video_id="cam1"))

CLI

# Analyze and chat
vidchain-analyze video.mp4

# Single-shot query
vidchain-analyze video.mp4 --query "what happened at the desk?"

# Offline with Ollama
vidchain-analyze video.mp4 --llm ollama/llama3

# Multilingual OCR
vidchain-analyze video.mp4 --ocr-lang en fr

Train Custom Action Engine

# Place labeled images in data/train/<class>/
vidchain-train

Knowledge Base Schema

Each fused timeline entry contains all modalities at that moment:

{
    "time": 5.8,
    "duration": 3.2,
    "objects": "1 person, 1 laptop",
    "action": "SUSPICIOUS",
    "emotion": "visibly agitated",
    "ocr": "ASUS Vivobook",
    "audio": "I told you this would happen",
    "camera": "static",
    "tracking": ["person #1 (present 4.8s), moving left", "laptop #2 (present 5.8s)"],
    "audio_anomaly": "NORMAL"
}

Tech Stack

Component Technology
Object Detection YOLOv8s (Ultralytics)
Action Classification MobileNetV3 (custom fine-tuned)
Speech Recognition OpenAI Whisper (base)
OCR EasyOCR
Emotion Analysis DeepFace (opencv backend)
Temporal Tracking IoU tracker + Lucas-Kanade optical flow
Embedder BAAI/bge-base-en-v1.5
Reranker cross-encoder/ms-marco-MiniLM-L-6-v2
Vector Store ChromaDB (persistent)
LLM Routing LiteLLM (gemini-2.5-flash default, Ollama supported)
Scene Understanding CLIP (openai/clip-vit-base-patch32)
GPU Runtime CUDA 12.1 (4GB+ VRAM, RTX 30-series tested)

Developer Utilities

# List all indexed videos
vc.list_indexed_videos()

# Generate a narrative summary
vc.summarize_video(video_id, depth="concise")  # or "detailed"

# Hot-swap LLM
vc.set_llm("ollama/llama3")

# Purge a specific video
vc.purge_storage(video_id="cam1")

# Purge everything
vc.purge_storage()

Roadmap

  • CLIP scene understanding — zero-shot environment classification (v0.3.0)
  • Adaptive audio filtering — energy gating, anomaly detection, segment merging (v0.3.0)
  • Multi-video scoped queriesvc.ask(query, video_id="cam1") (v0.3.0)
  • Graceful degradation — every engine fails independently (v0.3.0)
  • Real-time streaming — live camera ingestion with low-latency indexing
  • Cross-video subject tracking — link the same person across multiple camera feeds
  • Export to CSV — structured timeline export for downstream analysis

Contributing

Contributions, issues, and feature requests are welcome. Open a GitHub issue or submit a pull request.


Author

Rahul Sharma — B.Tech CSE, IIIT Manipur

License

Distributed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidchain-0.4.0.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vidchain-0.4.0-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file vidchain-0.4.0.tar.gz.

File metadata

  • Download URL: vidchain-0.4.0.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.4.0.tar.gz
Algorithm Hash digest
SHA256 444e58ca24336564ea2fbc76caa48df04208c8c0c206a81f89bfba171938d3ed
MD5 616be2486d7dc7537504fef4126e0eb2
BLAKE2b-256 80d0d9f29f52e85872450e9fd6adfffbf6eadbf88a6e01002e29b6b54f77bc57

See more details on using hashes here.

File details

Details for the file vidchain-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: vidchain-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afd5a3d51087f9c09a1279587cc862785273860598614d895c87226da48eda8f
MD5 f5a8a84bbf9e7466f257be9cfcfcb209
BLAKE2b-256 57a796d96ab895a264f86427359764e22e62c58efa7c9245492a10f4ca53b456

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page