Skip to main content

Edge-optimized multimodal RAG framework for video understanding

Project description

VidChain: The "LangChain for Videos"

Edge-optimized, local-first multimodal RAG framework for video intelligence — compose modular nodes into custom pipelines, deploy as a microservice, or query with a conversational AI.

Python CUDA License Status PyPI version


Overview

VidChain v0.5.0 is a modular, composable framework for on-device multimodal video understanding. Inspired by LangChain's node-based design, it lets developers snap together processing components — Vision, Audio, OCR, VLM — into custom pipelines that run entirely on your local GPU.

At the heart is B.A.B.U.R.A.O. (Behavioral Analysis & Broadcasting Unit for Real-time Artificial Observation) — a conversational AI copilot that translates raw sensor logs into human-readable narratives using abductive reasoning.


What's New in v0.5.0 🚀

Composable Node Architecture

VidChain now works like LangChain — build your own pipelines by snapping together modular nodes:

from vidchain import VidChain
from vidchain.pipeline import VideoChain
from vidchain.nodes import YoloNode, WhisperNode, OcrNode, AdaptiveKeyframeNode
from vidchain.nodes import LlavaNode  # New: Vision Language Model node

# Build a fully custom pipeline
my_chain = VideoChain(nodes=[
    AdaptiveKeyframeNode(change_threshold=5.0),  # Skip identical frames
    LlavaNode(model_name="moondream"),           # Deep scene captioning
    WhisperNode(),                               # Speech transcription
    OcrNode(),                                   # Screen text extraction
])

vc = VidChain()
video_id = vc.ingest("surveillance.mp4", chain=my_chain)
print(vc.ask("Was anyone at the desk?"))

VLM Vision Node (LlavaNode)

Replace blind YOLO object tags with rich, contextual scene descriptions powered by a local Vision Language Model:

  • Before (YOLO): "1 person, 1 laptop"
  • After (LlavaNode): "A person is typing Python code in VS Code. A terminal window is open showing a running script. The screen displays file explorer with project files visible."

Supports any Ollama-compatible VLM model (recommended: moondream for speed, llava:7b for detail).

Adaptive Keyframe Firewall

The AdaptiveKeyframeNode acts as a compute firewall. It computes a Gaussian-blurred frame delta to detect visual change — identical frames are instantly rejected before reaching heavy models like YOLO or LLaVA, dramatically reducing GPU load.

FastAPI Edge Server (vidchain-serve)

Deploy VidChain as a local microservice accessible from any app or language:

# Terminal 1: Start the Edge Server
vidchain-serve

# Terminal 2: Ingest + Query via REST API
Invoke-RestMethod -Uri "http://localhost:8000/api/ingest" -Method Post -ContentType "application/json" -Body '{"video_source": "sample.mp4"}'
Invoke-RestMethod -Uri "http://localhost:8000/api/query" -Method Post -ContentType "application/json" -Body '{"query": "Summarize the video"}'

Interactive Swagger UI available at http://localhost:8000/docs


Installation

pip install vidchain

# GPU-accelerated PyTorch (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --force-reinstall

# For LlavaNode (VLM support)
# Install Ollama: https://ollama.com
ollama pull moondream   # Fast edge VLM (~1.7GB, fits 4GB VRAM)
ollama pull llava       # High quality VLM (~4.7GB, requires 8GB+ VRAM)

Run python scripts/check_gpu.py to verify CUDA is detected.


Quick Start

Python API (Library)

from vidchain import VidChain

# Initialize
vc = VidChain(config={
    "llm_provider": "ollama/llama3",   # Fully offline
    "db_path": "./vidchain_storage"
})

# Ingest a video (uses legacy YOLO pipeline by default)
video_id = vc.ingest("surveillance.mp4")

# Query
print(vc.ask("what happened in the video?"))
print(vc.ask("was anyone acting suspiciously?"))

# Multi-video: scope query to a specific video
vc.ingest("cam1.mp4", video_id="cam1")
vc.ingest("cam2.mp4", video_id="cam2")
print(vc.ask("did anyone enter the room?", video_id="cam1"))

Composable Node Pipeline

from vidchain import VidChain
from vidchain.pipeline import VideoChain
from vidchain.nodes import AdaptiveKeyframeNode, LlavaNode, WhisperNode

# Build a VLM-powered pipeline with adaptive keyframing
chain = VideoChain(
    nodes=[
        AdaptiveKeyframeNode(change_threshold=5.0),
        LlavaNode(model_name="moondream"),
        WhisperNode(),
    ],
    frame_skip=15  # 2 FPS extraction
)

vc = VidChain()
vc.ingest("video.mp4", chain=chain)
print(vc.ask("describe what is on the screen"))

CLI

# Analyze and chat
vidchain-analyze video.mp4

# Single-shot query
vidchain-analyze video.mp4 --query "what happened at the desk?"

# Offline with Ollama
vidchain-analyze video.mp4 --llm ollama/llama3

# Start Edge API Server
vidchain-serve

# Train Custom Action Engine
vidchain-train

Available Nodes

Node Description
YoloNode YOLOv8 object detection — outputs class labels and counts
WhisperNode Whisper speech-to-text transcription
OcrNode EasyOCR screen text extraction (triggered on readable surfaces)
ActionNode MobileNetV3 action intent classification (NORMAL/SUSPICIOUS/VIOLENCE)
LlavaNode Ollama VLM node — deep contextual scene captioning (NEW in v0.5.0)
AdaptiveKeyframeNode Frame-delta firewall — skips visually identical frames (NEW in v0.5.0)

Core Pipeline (Legacy)

Video → WAV Extraction → Whisper ASR → Frame Loop →
  ├── YOLO (Objects)
  ├── MobileNetV3 (Action)
  ├── EasyOCR (Screen Text)
  ├── DeepFace (Emotion, threaded)
  └── TemporalTracker (Object Persistence + Camera Motion)
→ Semantic Fusion → ChromaDB → B.A.B.U.R.A.O. RAG

Tech Stack

Component Technology
Object Detection YOLOv8s (Ultralytics)
VLM Vision LLaVA / Moondream (via Ollama) — NEW
Action Classification MobileNetV3 (custom fine-tuned)
Speech Recognition OpenAI Whisper (base)
OCR EasyOCR
Emotion Analysis DeepFace (opencv backend)
Temporal Tracking IoU tracker + Lucas-Kanade optical flow
Embedder BAAI/bge-base-en-v1.5
Reranker cross-encoder/ms-marco-MiniLM-L-6-v2
Vector Store ChromaDB (persistent)
LLM Routing LiteLLM (ollama/llama3 default, Gemini supported)
Edge API FastAPI + Uvicorn — NEW
GPU Runtime CUDA 12.1 (4GB+ VRAM, RTX 30-series tested)

Developer Utilities

# List all indexed videos
vc.list_indexed_videos()

# Generate a narrative summary
vc.summarize_video(video_id, depth="concise")  # or "detailed"

# Hot-swap LLM
vc.set_llm("ollama/llama3")

# Purge a specific video
vc.purge_storage(video_id="cam1")

# Purge everything
vc.purge_storage()

Roadmap

  • Dual-Brain Vision Engine — YOLO + MobileNetV3 (v0.2.0)
  • CLIP scene understanding — zero-shot environment classification (v0.3.0)
  • Adaptive audio filtering — energy gating, anomaly detection (v0.3.0)
  • Multi-video scoped queries (v0.3.0)
  • Composable Node Architecture — LangChain-style pipelines (v0.5.0)
  • VLM Node — LLaVA/Moondream contextual captioning (v0.5.0)
  • Adaptive Keyframe Firewall — GPU compute optimization (v0.5.0)
  • FastAPI Edge Microservicevidchain-serve (v0.5.0)
  • GraphRAG — temporal entity tracking with NetworkX (v0.6.0)
  • VidChain Studio — native desktop application (v0.6.0)
  • Real-time streaming — live camera ingestion

Contributing

Contributions, issues, and feature requests are welcome. Open a GitHub issue or submit a pull request.


Author

Rahul Sharma — B.Tech CSE, IIIT Manipur

License

Distributed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidchain-0.5.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vidchain-0.5.0-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file vidchain-0.5.0.tar.gz.

File metadata

  • Download URL: vidchain-0.5.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.5.0.tar.gz
Algorithm Hash digest
SHA256 323a245a70608fda9e3810c6eff5b9af2ebc2ea131166aaf9db4683f8bd947ea
MD5 b0f32b0cd88c5391228287b9256cbdd2
BLAKE2b-256 c04dbe3a3c1c5640dadd7bdd7008885c0fe0541a8afdba1d01d3815c08e8eb74

See more details on using hashes here.

File details

Details for the file vidchain-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: vidchain-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for vidchain-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9dec50ba98cd13dfb1e311d71013c14bedab02a7cb3ddd852e142bbc3eaa287
MD5 2ad23ec8954c5fe35f3b7ac6d0a651fd
BLAKE2b-256 08a3dc4987d81a88c2ae3ab7445e6430f8a89e5a243b2584c4f07c1ec0df0e39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page