Local-first vision-language pipeline inspired by VL-JEPA. Compress images, text, conversations, and RAG documents locally via Ollama before sending to any LLM API. Includes MCP server, FastAPI server, video processing, and more. ~80% token savings.
Project description
LatentGate
Process Locally. Send Smart. Pay Less.
A VL-JEPA-inspired pipeline that compresses images, text, conversations, and RAG documents locally via Ollama, then sends only compact semantic payloads to any LLM API — cutting token costs by ~80%.
Quick Start | Python API | REST API | AI Tool Integrations | Benchmarks | Contributing | Community
The Problem
Every time you send an image or long prompt to GPT-4o / Claude / Gemini, you burn 1,000+ tokens on processing that could happen locally for free.
Traditional: Image -> Cloud LLM (1,200 tokens) -> Answer
LatentGate: Image -> Local Ollama (FREE) -> Cloud LLM (200 tokens) -> Answer
Features
| Feature | Description |
|---|---|
| Local-First | Vision and text compression runs on Ollama (free, no API key needed) |
| ~80% Token Savings | Send ~200 tokens instead of ~1,200 for image queries |
| MCP Server | Works with Claude Desktop, Cursor, Cline, Continue, Zed |
| Selective Decoding | For video, only call API when scene changes (~2.85x fewer calls) with cosine similarity |
| Text Compression | Long prompts, conversations, RAG docs compressed locally |
| Speed Optimized | Connection pooling, model preloading, parallel processing |
| Multi-Provider | OpenAI, Anthropic, Google, Groq, DeepSeek, Together, Azure, AWS Bedrock, Ollama, or any OpenAI-compatible endpoint |
| REST API | FastAPI server for web application integration |
| Video Processing | Direct video file input with automatic frame extraction |
| Cost Tracking | Persistent cost tracking with SQLite analytics and exportable reports |
| Async Support | Non-blocking async methods for FastAPI, aiohttp, etc. |
| Streaming Responses | Stream responses from remote LLMs |
| Config Persistence | YAML/TOML config files with environment variable overrides |
| Structured Logging | JSON-formatted logging with rotation and correlation IDs |
| Docker Support | Dockerfile and docker-compose for easy deployment |
| Plugin System | Custom processors for domain-specific compression |
| Multi-Language | Support for 30+ languages with automatic detection |
Quick Start
Install
# Core install
pip install latent-gate
# With MCP server (for Claude Desktop, Cursor, Cline, etc.)
pip install latent-gate[mcp]
# With API server (for web applications)
pip install latent-gate[api]
# With video processing
pip install latent-gate[video]
# With embedding-based similarity (more accurate selective decoding)
pip install latent-gate[embeddings]
# With LangChain integration
pip install latent-gate[langchain]
# With AWS Bedrock support
pip install latent-gate[bedrock]
# With all features
pip install latent-gate[all]
Pull Ollama Models
ollama pull llava:7b # Vision model (required for image queries)
ollama pull llama3:8b # Text model (required for text compression & prediction)
CLI Usage
# Image query
python -m latent_gate photo.jpg "What is in this image?" --provider ollama -v
# Text compression
python -m latent_gate --text "Your long prompt here..." --provider ollama -v
# Text from file
python -m latent_gate --text-file prompt.txt --provider openai -v
# Image + Text combined
python -m latent_gate photo.jpg "Analyze" --text "Extra context..." -v
# Full JSON output
python -m latent_gate photo.jpg "Describe" --json -v
# Start API server (requires: pip install latent-gate[api])
latent-gate-api
Python API
Image Query
from latent_gate import LatentGatePipeline, PipelineConfig
config = PipelineConfig(
vision_model="llava:7b",
predictor_model="llama3:8b",
remote_provider="openai",
remote_model="gpt-4o-mini",
)
with LatentGatePipeline(config) as pipeline:
result = pipeline.query("photo.jpg", "What is in this image?")
print(result["answer"])
print(f"Tokens sent: ~{result['tokens_estimated']}")
print(f"Timing: {result['timing']}")
Text Compression
# Long prompt compression
result = pipeline.query_text("Your 500-word prompt here...", mode="auto")
# Conversation history compression
messages = [
{"role": "user", "content": "Help me with Kubernetes setup"},
{"role": "assistant", "content": "Sure! What's your target configuration?"},
{"role": "user", "content": "3 nodes, t3.large, us-east-1 with autoscaling"},
]
result = pipeline.query_conversation(messages, "Now give me the setup commands")
# RAG document compression
documents = ["doc1 text...", "doc2 text...", "doc3 text..."]
result = pipeline.query_documents(documents, "How do I implement JWT refresh?")
# Universal (auto-detect input type)
result = pipeline.query_universal(text="Explain this code...", image="screenshot.png")
Batch Processing
# Sequential with selective decoding (skips redundant API calls)
results = pipeline.query_batch(image_paths, "Describe each scene")
# Parallel processing
results = pipeline.query_batch(image_paths, "Describe each scene", parallel=True, max_workers=4)
# Text batch
results = pipeline.query_batch_texts(text_list, question="Summarize each")
Streaming
# Stream image query
for token in pipeline.query_stream("photo.jpg", "Describe this"):
print(token, end="", flush=True)
# Stream text query
for token in pipeline.query_text_stream("Long prompt...", mode="compress"):
print(token, end="", flush=True)
REST API
Start Server
# Default (0.0.0.0:8000)
latent-gate-api
# Custom host/port
# Linux/macOS:
LATENTGATE_HOST=127.0.0.1 LATENTGATE_PORT=9000 latent-gate-api
# Windows PowerShell:
$env:LATENTGATE_HOST="127.0.0.1"; $env:LATENTGATE_PORT="9000"; latent-gate-api
# Windows CMD:
set LATENTGATE_HOST=127.0.0.1 && set LATENTGATE_PORT=9000 && latent-gate-api
Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check (Ollama connection status) |
GET |
/stats |
Session usage statistics |
POST |
/query/image |
Image query |
POST |
/query/text |
Text compression |
POST |
/query/conversation |
Conversation compression |
POST |
/query/documents |
RAG document compression |
POST |
/query/universal |
Auto-detect input type |
POST |
/query/image/upload |
Upload image for query |
Example Requests
import requests
# Image query
response = requests.post("http://localhost:8000/query/image", json={
"image_path": "photo.jpg",
"question": "What is in this image?"
})
# Text query
response = requests.post("http://localhost:8000/query/text", json={
"text": "Your long prompt here...",
"question": "Summarize this",
"mode": "auto" # auto | compress | summarize | condense | code
})
# Health check
response = requests.get("http://localhost:8000/health")
print(response.json()) # {"status": "healthy", "ollama_connected": true, ...}
Async Support
import asyncio
from latent_gate import AsyncLatentGatePipeline, PipelineConfig
async def main():
async with AsyncLatentGatePipeline() as pipeline:
# Single queries
result = await pipeline.query("photo.jpg", "What is this?")
result = await pipeline.query_text("Long prompt...")
# Concurrent batch processing
results = await pipeline.query_many_images(
["img1.jpg", "img2.jpg", "img3.jpg"],
"Describe each image",
max_concurrent=3,
)
asyncio.run(main())
Video Processing
from latent_gate import LatentGatePipeline, PipelineConfig, VideoProcessor, VideoConfig
config = PipelineConfig(
vision_model="llava:7b",
remote_provider="ollama",
remote_model="llama3:8b",
)
video_config = VideoConfig(
fps=1.0, # Extract 1 frame per second
max_frames=100, # Max frames to process
quality=95, # JPEG quality
resize_width=640, # Resize frames (saves processing time)
)
with VideoProcessor(config, video_config) as processor:
result = processor.process_video("video.mp4", "Describe the action")
print(f"Frames processed: {result['total_frames']}")
print(f"Unique scenes: {result['statistics']['unique_scenes']}")
print(f"Skip rate: {result['statistics']['skip_rate']}")
Configuration
Config File
# latentgate.yaml
ollama_base_url: http://localhost:11434
vision_model: llava:7b
predictor_model: llama3:8b
remote_provider: openai
remote_model: gpt-4o-mini
selective_decoding: true
similarity_threshold: 0.85
use_embeddings: true
enable_caching: true
temperature: 0.1
request_timeout: 120
from latent_gate import get_config, LatentGatePipeline
config = get_config("latentgate.yaml")
with LatentGatePipeline(config) as pipeline:
result = pipeline.query("photo.jpg", "Describe this")
Environment Variables
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | - |
ANTHROPIC_API_KEY |
Anthropic API key | - |
GOOGLE_API_KEY |
Google API key | - |
LATENTGATE_REMOTE_PROVIDER |
Override remote provider | openai |
LATENTGATE_REMOTE_MODEL |
Override remote model | gpt-4o-mini |
LATENTGATE_VISION_MODEL |
Override vision model | llava:7b |
LATENTGATE_LOG_LEVEL |
Log level | INFO |
LATENTGATE_LOG_FILE |
Log file path | - |
LATENTGATE_LOG_JSON |
JSON log format | false |
Save Config
from latent_gate import PipelineConfig, save_config
config = PipelineConfig(remote_provider="anthropic", remote_model="claude-sonnet-4-20250514")
save_config(config, "my_config.yaml")
Docker
# Start with Docker Compose (includes Ollama)
docker-compose up -d
# Or build and run manually
docker build -t latent-gate .
docker run -p 8000:8000 latent-gate
The docker-compose setup includes:
- latent-gate API server (port 8000)
- Ollama local LLM server (port 11434)
- ollama-init container that auto-pulls required models
AI Coding Tool Integration (MCP)
LatentGate works as a Model Context Protocol (MCP) server with every major AI coding tool. Your AI assistant automatically compresses images, long prompts, and documents — saving ~80% on tokens.
Supported Tools
| Tool | Status | Setup |
|---|---|---|
| VS Code / Copilot | Supported | Extension |
| Claude Desktop | Supported | MCP Config |
| Claude Code (CLI) | Supported | Skill |
| Cursor | Supported | MCP Config |
| Cline (VS Code) | Supported | MCP Config |
| Continue.dev | Supported | MCP Config |
| Zed Editor | Supported | MCP Config |
VS Code Extension
code --install-extension KathanModh259.latent-gate-vscode
Features:
- Right-click any image to compress with LatentGate
- Select text and press
Ctrl+Shift+Alt+Cto compress - Cost dashboard in activity bar
- Auto-configures MCP for Copilot Chat
- Status bar showing token savings
MCP Setup
pip install latent-gate[mcp]
ollama pull llava:7b
ollama pull llama3:8b
Add to your tool's MCP config:
{
"mcpServers": {
"latent-gate": {
"command": "python",
"args": ["-m", "latent_gate.mcp_server"]
}
}
}
MCP Tools
| Tool | When AI Uses It |
|---|---|
compress_image |
Before analyzing any image |
compress_text |
For prompts longer than ~500 tokens |
compress_conversation |
When chat history is large |
compress_documents |
For RAG queries |
get_stats |
To check session savings |
See integrations/ folder for detailed setup guides per tool.
Speed Optimizations
| Optimization | What It Does | Impact |
|---|---|---|
| Connection Pooling | Reuses HTTP connections via requests.Session |
~30-50% faster per call |
| Model Preloading | Warms up Ollama models on init (keep_alive) |
Eliminates 5-15s cold start |
| Shorter Prompts | Optimized extraction prompts produce fewer output tokens | ~20% faster generation |
| 3-Tier JSON Parsing | Fast parse, extract from text, LLM fallback | Avoids slow LLM call 90% of time |
| Parallel Processing | Image and text processed simultaneously via ThreadPool | ~40% faster combined queries |
| Content-Hash Caching | Disk cache for repeated images | Instant on cache hit |
| Selective Decoding | Cosine similarity skips redundant API calls | ~2.85x fewer calls |
Cost Benchmarks
Image Queries (by provider)
| Provider | Raw Image Tokens | LatentGate Tokens | Savings |
|---|---|---|---|
| OpenAI GPT-4o (high detail) | ~1,105 | ~150 | ~86% |
| Claude 3.5 Sonnet (1MP image) | ~1,334 | ~150 | ~89% |
| Gemini 2.0 Flash | ~258 | ~150 | ~42% |
Text and Other Modes
| Scenario | Traditional | LatentGate | Savings |
|---|---|---|---|
| Long text prompt | ~800 | ~120 | ~85% |
| Conversation (10 turns) | ~2,500 | ~350 | ~86% |
| RAG documents (3 docs) | ~3,000 | ~450 | ~85% |
| Video stream (1 min)* | varies | ~2.85x fewer calls | ~65% |
*With selective decoding
At Scale (10,000 image queries with gpt-4o-mini)
| Metric | Traditional | LatentGate | Savings |
|---|---|---|---|
| Input tokens | 12,000,000 | 2,000,000 | 10M tokens |
| Cost | $1.80 | $0.30 | $1.50 (83%) |
Cost Tracking
from latent_gate import CostTracker
tracker = CostTracker()
tracker.record_usage(
query_type="image",
provider="openai",
model="gpt-4o-mini",
input_tokens=150,
output_tokens=200,
tokens_saved=1000,
compression_ratio=6.7,
latency_ms=1500,
)
# Session statistics
stats = tracker.get_session_statistics()
print(f"Total cost: ${stats['total_cost']:.4f}")
print(f"Tokens saved: {stats['total_tokens_saved']}")
# Cost projection
projection = tracker.get_cost_projection(
daily_queries=1000,
provider="openai",
model="gpt-4o-mini"
)
print(f"Monthly savings: ${projection['savings']['monthly']:.2f}")
# Export report
tracker.export_report("usage_report.json", fmt="json")
tracker.export_report("usage_report.csv", fmt="csv")
Multi-Language Support
from latent_gate import detect_language, MultiLanguageProcessor
# Detect language
lang = detect_language("Esto es un texto en español")
print(f"Detected: {lang.name} ({lang.confidence:.0%})")
# Process with auto-translation to English
processor = MultiLanguageProcessor()
text, lang_info = processor.process("Texto en español para analizar")
print(f"Language: {lang_info.name}, Translated: {text[:100]}...")
Project Structure
latent-gate/
├── latent_gate/
│ ├── __init__.py # Package exports and version
│ ├── config.py # PipelineConfig dataclass
│ ├── config_loader.py # YAML/TOML/JSON config loading
│ ├── payload.py # SemanticPayload (compact representation)
│ ├── text_processor.py # TextPayload + TextProcessor (local compression)
│ ├── local_processor.py # X-Encoder + Predictor (Ollama vision pipeline)
│ ├── remote_decoder.py # Y-Decoder (OpenAI, Anthropic, Google, Ollama)
│ ├── selective_decoder.py # Cosine/Jaccard similarity for skip decisions
│ ├── fast_client.py # Connection pooling + model preloading
│ ├── cache.py # Content-hash disk cache
│ ├── pipeline.py # LatentGatePipeline (main orchestrator)
│ ├── async_pipeline.py # AsyncLatentGatePipeline
│ ├── video_processor.py # Video frame extraction + batch processing
│ ├── cost_tracker.py # SQLite-based cost analytics
│ ├── mcp_server.py # MCP server (Model Context Protocol)
│ ├── api_server.py # FastAPI REST server
│ ├── cli.py # Command-line interface
│ ├── logging_config.py # Structured logging with rotation
│ ├── plugin_system.py # Custom processor plugins
│ └── multilang.py # Multi-language detection and translation
├── integrations/
│ ├── mcp_server/ # Standalone MCP server
│ ├── claude_code_skill/ # Claude Code skill + scripts
│ ├── cursor/ # Cursor MCP config
│ ├── continue_dev/ # Continue.dev config
│ └── openai_functions/ # OpenAI/Anthropic function schemas
├── examples/
│ ├── basic_usage.py
│ ├── text_compression.py
│ ├── advanced_features.py
│ ├── video_streaming.py
│ └── ...
├── tests/ # 62 tests (unit + integration)
├── vscode-extension/ # VS Code extension source
├── docs/
├── .github/workflows/ # CI + publish workflows
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── requirements.txt
Community
- Discord — Ask questions, share projects, get help
- GitHub Discussions — Feature requests, Q&A, showcases
- Twitter/X — Follow for updates and tips
- Awesome Lists — Found in awesome-mcp, awesome-ollama, awesome-local-ai
Contributing
Contributions welcome! See CONTRIBUTING.md.
Development Setup
git clone https://github.com/KathanModh259/latent-gate.git
cd latent-gate
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\Activate.ps1 # Windows
pip install -r requirements.txt
pip install -r requirements-dev.txt
Run Tests
pytest tests/ -v
Priority Areas
- Additional vision model support (Florence-2, InternVL, Qwen-VL)
- Custom similarity plugins for domain-specific use cases
- WebSocket support for real-time streaming
- Advanced cost analytics and optimization suggestions
- Plugin development for specialized industries
- Test coverage improvements
- Documentation and examples
Citation
@software{latentgate2026,
author = {Kathan Modh},
title = {LatentGate: Local-First Vision-Language Pipeline Inspired by VL-JEPA},
year = {2026},
version = {1.0.0},
url = {https://github.com/KathanModh259/latent-gate}
}
Inspired by VL-JEPA (Meta FAIR, 2025).
License
MIT License — see LICENSE.
Built by Kathan Modh
Process locally. Send smart. Pay less.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file latent_gate-1.2.2.tar.gz.
File metadata
- Download URL: latent_gate-1.2.2.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f77adb6b730bd4fb8a13b6e5983899b9bf56e2ef09565e2d9e52fd8644414c42
|
|
| MD5 |
2dd0c6a18402efb64edd7d1a18f2d645
|
|
| BLAKE2b-256 |
57d7b54928bbdcc67f45f9f28d49a0bf83a8338c2e519b575662b6686d648cf4
|
File details
Details for the file latent_gate-1.2.2-py3-none-any.whl.
File metadata
- Download URL: latent_gate-1.2.2-py3-none-any.whl
- Upload date:
- Size: 71.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21631d9c541073409189692ac0b60d87b101282d1b848a8d285b51c73fefd84b
|
|
| MD5 |
26e10daf007261a5a3daadb9eddf0261
|
|
| BLAKE2b-256 |
8f9a797263c4191e686bba3e4b04b953f48021c002ecbe431a03f4583a855774
|