Skip to main content

Local-first vision-language pipeline inspired by VL-JEPA. Compress images, text, conversations, and RAG documents locally via Ollama before sending to any LLM API. Includes MCP server, FastAPI server, video processing, and more. ~80% token savings.

Project description

LatentGate

Process Locally. Send Smart. Pay Less.

A VL-JEPA-inspired pipeline that compresses images, text, conversations, and RAG documents locally via Ollama, then sends only compact semantic payloads to any LLM API — cutting token costs by ~80%.

Python 3.10+ License: MIT Version PRs Welcome Ollama MCP Tests Codecov Downloads Discord Awesome Product Hunt

Quick Start | Python API | REST API | AI Tool Integrations | Benchmarks | Contributing | Community


The Problem

Every time you send an image or long prompt to GPT-4o / Claude / Gemini, you burn 1,000+ tokens on processing that could happen locally for free.

Traditional:  Image -> Cloud LLM (1,200 tokens) -> Answer
LatentGate:   Image -> Local Ollama (FREE) -> Cloud LLM (200 tokens) -> Answer

Features

Feature Description
Local-First Vision and text compression runs on Ollama (free, no API key needed)
~80% Token Savings Send ~200 tokens instead of ~1,200 for image queries
MCP Server Works with Claude Desktop, Cursor, Cline, Continue, Zed
Selective Decoding For video, only call API when scene changes (~2.85x fewer calls) with cosine similarity
Text Compression Long prompts, conversations, RAG docs compressed locally
Speed Optimized Connection pooling, model preloading, parallel processing
Multi-Provider OpenAI, Anthropic, Google, Groq, DeepSeek, Together, Azure, AWS Bedrock, Ollama, or any OpenAI-compatible endpoint
REST API FastAPI server for web application integration
Video Processing Direct video file input with automatic frame extraction
Cost Tracking Persistent cost tracking with SQLite analytics and exportable reports
Async Support Non-blocking async methods for FastAPI, aiohttp, etc.
Streaming Responses Stream responses from remote LLMs
Config Persistence YAML/TOML config files with environment variable overrides
Structured Logging JSON-formatted logging with rotation and correlation IDs
Docker Support Dockerfile and docker-compose for easy deployment
Plugin System Custom processors for domain-specific compression
Multi-Language Support for 30+ languages with automatic detection

Quick Start

Install

# Core install
pip install latent-gate

# With MCP server (for Claude Desktop, Cursor, Cline, etc.)
pip install latent-gate[mcp]

# With API server (for web applications)
pip install latent-gate[api]

# With video processing
pip install latent-gate[video]

# With embedding-based similarity (more accurate selective decoding)
pip install latent-gate[embeddings]

# With LangChain integration
pip install latent-gate[langchain]

# With AWS Bedrock support
pip install latent-gate[bedrock]

# With all features
pip install latent-gate[all]

Pull Ollama Models

ollama pull llava:7b      # Vision model (required for image queries)
ollama pull llama3:8b     # Text model (required for text compression & prediction)

CLI Usage

# Image query
python -m latent_gate photo.jpg "What is in this image?" --provider ollama -v

# Text compression
python -m latent_gate --text "Your long prompt here..." --provider ollama -v

# Text from file
python -m latent_gate --text-file prompt.txt --provider openai -v

# Image + Text combined
python -m latent_gate photo.jpg "Analyze" --text "Extra context..." -v

# Full JSON output
python -m latent_gate photo.jpg "Describe" --json -v

# Start API server (requires: pip install latent-gate[api])
latent-gate-api

Python API

Image Query

from latent_gate import LatentGatePipeline, PipelineConfig

config = PipelineConfig(
    vision_model="llava:7b",
    predictor_model="llama3:8b",
    remote_provider="openai",
    remote_model="gpt-4o-mini",
)

with LatentGatePipeline(config) as pipeline:
    result = pipeline.query("photo.jpg", "What is in this image?")

    print(result["answer"])
    print(f"Tokens sent: ~{result['tokens_estimated']}")
    print(f"Timing: {result['timing']}")

Text Compression

# Long prompt compression
result = pipeline.query_text("Your 500-word prompt here...", mode="auto")

# Conversation history compression
messages = [
    {"role": "user", "content": "Help me with Kubernetes setup"},
    {"role": "assistant", "content": "Sure! What's your target configuration?"},
    {"role": "user", "content": "3 nodes, t3.large, us-east-1 with autoscaling"},
]
result = pipeline.query_conversation(messages, "Now give me the setup commands")

# RAG document compression
documents = ["doc1 text...", "doc2 text...", "doc3 text..."]
result = pipeline.query_documents(documents, "How do I implement JWT refresh?")

# Universal (auto-detect input type)
result = pipeline.query_universal(text="Explain this code...", image="screenshot.png")

Batch Processing

# Sequential with selective decoding (skips redundant API calls)
results = pipeline.query_batch(image_paths, "Describe each scene")

# Parallel processing
results = pipeline.query_batch(image_paths, "Describe each scene", parallel=True, max_workers=4)

# Text batch
results = pipeline.query_batch_texts(text_list, question="Summarize each")

Streaming

# Stream image query
for token in pipeline.query_stream("photo.jpg", "Describe this"):
    print(token, end="", flush=True)

# Stream text query
for token in pipeline.query_text_stream("Long prompt...", mode="compress"):
    print(token, end="", flush=True)

REST API

Start Server

# Default (0.0.0.0:8000)
latent-gate-api

# Custom host/port
# Linux/macOS:
LATENTGATE_HOST=127.0.0.1 LATENTGATE_PORT=9000 latent-gate-api

# Windows PowerShell:
$env:LATENTGATE_HOST="127.0.0.1"; $env:LATENTGATE_PORT="9000"; latent-gate-api

# Windows CMD:
set LATENTGATE_HOST=127.0.0.1 && set LATENTGATE_PORT=9000 && latent-gate-api

Endpoints

Method Endpoint Description
GET /health Health check (Ollama connection status)
GET /stats Session usage statistics
POST /query/image Image query
POST /query/text Text compression
POST /query/conversation Conversation compression
POST /query/documents RAG document compression
POST /query/universal Auto-detect input type
POST /query/image/upload Upload image for query

Example Requests

import requests

# Image query
response = requests.post("http://localhost:8000/query/image", json={
    "image_path": "photo.jpg",
    "question": "What is in this image?"
})

# Text query
response = requests.post("http://localhost:8000/query/text", json={
    "text": "Your long prompt here...",
    "question": "Summarize this",
    "mode": "auto"  # auto | compress | summarize | condense | code
})

# Health check
response = requests.get("http://localhost:8000/health")
print(response.json())  # {"status": "healthy", "ollama_connected": true, ...}

Async Support

import asyncio
from latent_gate import AsyncLatentGatePipeline, PipelineConfig

async def main():
    async with AsyncLatentGatePipeline() as pipeline:
        # Single queries
        result = await pipeline.query("photo.jpg", "What is this?")
        result = await pipeline.query_text("Long prompt...")

        # Concurrent batch processing
        results = await pipeline.query_many_images(
            ["img1.jpg", "img2.jpg", "img3.jpg"],
            "Describe each image",
            max_concurrent=3,
        )

asyncio.run(main())

Video Processing

from latent_gate import LatentGatePipeline, PipelineConfig, VideoProcessor, VideoConfig

config = PipelineConfig(
    vision_model="llava:7b",
    remote_provider="ollama",
    remote_model="llama3:8b",
)

video_config = VideoConfig(
    fps=1.0,            # Extract 1 frame per second
    max_frames=100,     # Max frames to process
    quality=95,         # JPEG quality
    resize_width=640,   # Resize frames (saves processing time)
)

with VideoProcessor(config, video_config) as processor:
    result = processor.process_video("video.mp4", "Describe the action")

    print(f"Frames processed: {result['total_frames']}")
    print(f"Unique scenes: {result['statistics']['unique_scenes']}")
    print(f"Skip rate: {result['statistics']['skip_rate']}")

Configuration

Config File

# latentgate.yaml
ollama_base_url: http://localhost:11434
vision_model: llava:7b
predictor_model: llama3:8b
remote_provider: openai
remote_model: gpt-4o-mini
selective_decoding: true
similarity_threshold: 0.85
use_embeddings: true
enable_caching: true
temperature: 0.1
request_timeout: 120
from latent_gate import get_config, LatentGatePipeline

config = get_config("latentgate.yaml")
with LatentGatePipeline(config) as pipeline:
    result = pipeline.query("photo.jpg", "Describe this")

Environment Variables

Variable Description Default
OPENAI_API_KEY OpenAI API key -
ANTHROPIC_API_KEY Anthropic API key -
GOOGLE_API_KEY Google API key -
LATENTGATE_REMOTE_PROVIDER Override remote provider openai
LATENTGATE_REMOTE_MODEL Override remote model gpt-4o-mini
LATENTGATE_VISION_MODEL Override vision model llava:7b
LATENTGATE_LOG_LEVEL Log level INFO
LATENTGATE_LOG_FILE Log file path -
LATENTGATE_LOG_JSON JSON log format false

Save Config

from latent_gate import PipelineConfig, save_config

config = PipelineConfig(remote_provider="anthropic", remote_model="claude-sonnet-4-20250514")
save_config(config, "my_config.yaml")

Docker

# Start with Docker Compose (includes Ollama)
docker-compose up -d

# Or build and run manually
docker build -t latent-gate .
docker run -p 8000:8000 latent-gate

The docker-compose setup includes:

  • latent-gate API server (port 8000)
  • Ollama local LLM server (port 11434)
  • ollama-init container that auto-pulls required models

AI Coding Tool Integration (MCP)

LatentGate works as a Model Context Protocol (MCP) server with every major AI coding tool. Your AI assistant automatically compresses images, long prompts, and documents — saving ~80% on tokens.

Supported Tools

Tool Status Setup
VS Code / Copilot Supported Extension
Claude Desktop Supported MCP Config
Claude Code (CLI) Supported Skill
Cursor Supported MCP Config
Cline (VS Code) Supported MCP Config
Continue.dev Supported MCP Config
Zed Editor Supported MCP Config

VS Code Extension

code --install-extension KathanModh259.latent-gate-vscode

Features:

  • Right-click any image to compress with LatentGate
  • Select text and press Ctrl+Shift+Alt+C to compress
  • Cost dashboard in activity bar
  • Auto-configures MCP for Copilot Chat
  • Status bar showing token savings

MCP Setup

pip install latent-gate[mcp]
ollama pull llava:7b
ollama pull llama3:8b

Add to your tool's MCP config:

{
  "mcpServers": {
    "latent-gate": {
      "command": "python",
      "args": ["-m", "latent_gate.mcp_server"]
    }
  }
}

MCP Tools

Tool When AI Uses It
compress_image Before analyzing any image
compress_text For prompts longer than ~500 tokens
compress_conversation When chat history is large
compress_documents For RAG queries
get_stats To check session savings

See integrations/ folder for detailed setup guides per tool.


Speed Optimizations

Optimization What It Does Impact
Connection Pooling Reuses HTTP connections via requests.Session ~30-50% faster per call
Model Preloading Warms up Ollama models on init (keep_alive) Eliminates 5-15s cold start
Shorter Prompts Optimized extraction prompts produce fewer output tokens ~20% faster generation
3-Tier JSON Parsing Fast parse, extract from text, LLM fallback Avoids slow LLM call 90% of time
Parallel Processing Image and text processed simultaneously via ThreadPool ~40% faster combined queries
Content-Hash Caching Disk cache for repeated images Instant on cache hit
Selective Decoding Cosine similarity skips redundant API calls ~2.85x fewer calls

Cost Benchmarks

Image Queries (by provider)

Provider Raw Image Tokens LatentGate Tokens Savings
OpenAI GPT-4o (high detail) ~1,105 ~150 ~86%
Claude 3.5 Sonnet (1MP image) ~1,334 ~150 ~89%
Gemini 2.0 Flash ~258 ~150 ~42%

Text and Other Modes

Scenario Traditional LatentGate Savings
Long text prompt ~800 ~120 ~85%
Conversation (10 turns) ~2,500 ~350 ~86%
RAG documents (3 docs) ~3,000 ~450 ~85%
Video stream (1 min)* varies ~2.85x fewer calls ~65%

*With selective decoding

At Scale (10,000 image queries with gpt-4o-mini)

Metric Traditional LatentGate Savings
Input tokens 12,000,000 2,000,000 10M tokens
Cost $1.80 $0.30 $1.50 (83%)

Cost Tracking

from latent_gate import CostTracker

tracker = CostTracker()
tracker.record_usage(
    query_type="image",
    provider="openai",
    model="gpt-4o-mini",
    input_tokens=150,
    output_tokens=200,
    tokens_saved=1000,
    compression_ratio=6.7,
    latency_ms=1500,
)

# Session statistics
stats = tracker.get_session_statistics()
print(f"Total cost: ${stats['total_cost']:.4f}")
print(f"Tokens saved: {stats['total_tokens_saved']}")

# Cost projection
projection = tracker.get_cost_projection(
    daily_queries=1000,
    provider="openai",
    model="gpt-4o-mini"
)
print(f"Monthly savings: ${projection['savings']['monthly']:.2f}")

# Export report
tracker.export_report("usage_report.json", fmt="json")
tracker.export_report("usage_report.csv", fmt="csv")

Multi-Language Support

from latent_gate import detect_language, MultiLanguageProcessor

# Detect language
lang = detect_language("Esto es un texto en español")
print(f"Detected: {lang.name} ({lang.confidence:.0%})")

# Process with auto-translation to English
processor = MultiLanguageProcessor()
text, lang_info = processor.process("Texto en español para analizar")
print(f"Language: {lang_info.name}, Translated: {text[:100]}...")

Project Structure

latent-gate/
├── latent_gate/
│   ├── __init__.py           # Package exports and version
│   ├── config.py             # PipelineConfig dataclass
│   ├── config_loader.py      # YAML/TOML/JSON config loading
│   ├── payload.py            # SemanticPayload (compact representation)
│   ├── text_processor.py     # TextPayload + TextProcessor (local compression)
│   ├── local_processor.py    # X-Encoder + Predictor (Ollama vision pipeline)
│   ├── remote_decoder.py     # Y-Decoder (OpenAI, Anthropic, Google, Ollama)
│   ├── selective_decoder.py  # Cosine/Jaccard similarity for skip decisions
│   ├── fast_client.py        # Connection pooling + model preloading
│   ├── cache.py              # Content-hash disk cache
│   ├── pipeline.py           # LatentGatePipeline (main orchestrator)
│   ├── async_pipeline.py     # AsyncLatentGatePipeline
│   ├── video_processor.py    # Video frame extraction + batch processing
│   ├── cost_tracker.py       # SQLite-based cost analytics
│   ├── mcp_server.py         # MCP server (Model Context Protocol)
│   ├── api_server.py         # FastAPI REST server
│   ├── cli.py                # Command-line interface
│   ├── logging_config.py     # Structured logging with rotation
│   ├── plugin_system.py      # Custom processor plugins
│   └── multilang.py          # Multi-language detection and translation
├── integrations/
│   ├── mcp_server/           # Standalone MCP server
│   ├── claude_code_skill/    # Claude Code skill + scripts
│   ├── cursor/               # Cursor MCP config
│   ├── continue_dev/         # Continue.dev config
│   └── openai_functions/     # OpenAI/Anthropic function schemas
├── examples/
│   ├── basic_usage.py
│   ├── text_compression.py
│   ├── advanced_features.py
│   ├── video_streaming.py
│   └── ...
├── tests/                    # 62 tests (unit + integration)
├── vscode-extension/         # VS Code extension source
├── docs/
├── .github/workflows/        # CI + publish workflows
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── requirements.txt

Community


Contributing

Contributions welcome! See CONTRIBUTING.md.

Development Setup

git clone https://github.com/KathanModh259/latent-gate.git
cd latent-gate
python -m venv .venv
source .venv/bin/activate       # Linux/macOS
.venv\Scripts\Activate.ps1      # Windows

pip install -r requirements.txt
pip install -r requirements-dev.txt

Run Tests

pytest tests/ -v

Priority Areas

  • Additional vision model support (Florence-2, InternVL, Qwen-VL)
  • Custom similarity plugins for domain-specific use cases
  • WebSocket support for real-time streaming
  • Advanced cost analytics and optimization suggestions
  • Plugin development for specialized industries
  • Test coverage improvements
  • Documentation and examples

Citation

@software{latentgate2026,
  author  = {Kathan Modh},
  title   = {LatentGate: Local-First Vision-Language Pipeline Inspired by VL-JEPA},
  year    = {2026},
  version = {1.0.0},
  url     = {https://github.com/KathanModh259/latent-gate}
}

Inspired by VL-JEPA (Meta FAIR, 2025).


License

MIT License — see LICENSE.


Built by Kathan Modh

Process locally. Send smart. Pay less.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

latent_gate-1.2.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

latent_gate-1.2.2-py3-none-any.whl (71.4 kB view details)

Uploaded Python 3

File details

Details for the file latent_gate-1.2.2.tar.gz.

File metadata

  • Download URL: latent_gate-1.2.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for latent_gate-1.2.2.tar.gz
Algorithm Hash digest
SHA256 f77adb6b730bd4fb8a13b6e5983899b9bf56e2ef09565e2d9e52fd8644414c42
MD5 2dd0c6a18402efb64edd7d1a18f2d645
BLAKE2b-256 57d7b54928bbdcc67f45f9f28d49a0bf83a8338c2e519b575662b6686d648cf4

See more details on using hashes here.

File details

Details for the file latent_gate-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: latent_gate-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 71.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for latent_gate-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 21631d9c541073409189692ac0b60d87b101282d1b848a8d285b51c73fefd84b
MD5 26e10daf007261a5a3daadb9eddf0261
BLAKE2b-256 8f9a797263c4191e686bba3e4b04b953f48021c002ecbe431a03f4583a855774

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page