LLM Transport & Efficiency Layer — make LLM calls cheaper, faster, and smarter
Project description
LATTICE
LLM Transport & Efficiency Layer
Make every LLM call cheaper, faster, and safe — without changing your model.
LATTICE is an intelligent transport proxy that sits between your application and any LLM provider. It applies network-layer optimizations — congestion control, binary framing, delta encoding, speculation, batching — plus a safety-gated compression pipeline with 18 transforms. Your app sends standard OpenAI API requests; LATTICE makes them smaller, faster, safer, and cache-friendly.
It is not a router. LATTICE never changes your model, never falls back between providers, never guesses. You route to exactly one provider per request. LATTICE optimizes the transport and execution.
Table of Contents
- Installation
- Quick Start
- Architecture
- Novel Technology
- Compression Pipeline
- Safety
- Observability
- Supported Providers
- CLI Reference
- Agent Integration
- Development
- Documentation
- License
Installation
pip install lattice-transport
Optional dependencies:
pip install "lattice-transport[redis]" # Multi-process session store
pip install "lattice-transport[mcp]" # MCP tool support
pip install "lattice-transport[all]" # Everything
Requirements: Python 3.10+. No external services needed for single-process mode.
Quick Start
# Start the proxy
lattice proxy run --port 8787
# Point any OpenAI SDK at it
export OPENAI_BASE_URL=http://localhost:8787/v1
# Or route an agent through it
lattice lace claude
# Or use the SDK
from lattice import LatticeClient
client = LatticeClient()
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Explain transport protocols"}],
)
print(response.choices[0].message.content)
Every request is automatically compressed, cached, and optimized. Zero code changes in proxy mode.
Architecture
┌──────────────────────────┐
│ Application / Agent │
│ (Claude, Cursor, Codex, │
│ OpenAI SDK, curl) │
└────────────┬─────────────┘
│ OpenAI API format
┌────────────▼─────────────┐
│ LATTICE PROXY :8787 │
│ │
┌────────────────┼───────────────────────┐ │
│ │ │ │
▼ ▼ ▼ │
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Session │ │ Transform │ │ Semantic │
│ Manager │ │ Pipeline │ │ Cache │
│ │ │ │ │ │
│ Memory or │ │ 18 transforms │ │ Exact-hash │
│ Redis store │ │ priority-ordered│ │ + approximate │
│ │ │ risk-gated │ │ semantic match │
│ CAS version │ │ expansion-capped│ │ LRU + TTL │
└──────┬──────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────────────┼────────────────────┘
│
┌───────────▼──────────────────┐
│ DirectHTTPProvider │
├──────────────────────────────┤
│ ProviderRegistry (17 adapt) │
│ ConnectionPool (HTTP/2) │
│ StreamStallDetector │
│ TACC Congestion Controller │
└───────────┬──────────────────┘
│
▼
┌───────────────────────┐
│ LLM Provider │
│ (exactly one per req)│
└───────────────────────┘
Request Flow
1. Client sends OpenAI-compatible POST /v1/chat/completions
2. SessionManager creates or retrieves session (with CAS versioning)
3. 18 transforms run in priority order, each gated by:
config → policy → runtime budget → risk gate → expansion guard
4. SemanticCache checks exact hash, then approximate fingerprint
5. [cache miss] Provider adapter serializes → HTTP/2 pool → provider
6. [streaming] StallDetector monitors per-provider tolerance windows
7. TACC controller manages concurrency window (token-based, not request-count)
8. Response deserialized → pipeline reverse pass → OpenAI JSON → client
9. Session updated, response cached, headers attached
Novel Technology
LATTICE adapts classical systems techniques for LLM workloads. These are not LLM features — they are transport, network, and execution innovations.
TACC
Token-Aware Congestion Control — AIMD-style adaptive concurrency. Manages per-provider admission using token pressure (not request counts: a 100K-token request uses more provider capacity than a 10-token one). Priority-ordered waiting queue, stall-aware window collapse, cache-aware latency smoothing. → Deep Dive
Binary Framing
15-byte fixed header format. 17 semantic frame types (PING, REQUEST, STREAM_CHUNK, RESUME_TOKEN...). CRC32 per-frame integrity. Semantic boundary flags for sentence/tool/reasoning boundaries. O(1) parsing — no JSON overhead per chunk. → Deep Dive
Delta Encoding
After turn 1, sends only new messages — server reconstructs full context from session store. CAS-style optimistic concurrency via anchor versioning prevents lost updates. Graceful fallback on version/sequence mismatch. → Deep Dive
Stream Architecture
Per-provider dynamic stall detection with phase-aware tolerance multipliers (first_chunk=1.5×, streaming=1.0×, thinking=2.0×, tool_call=1.2×). Token velocity tracking catches trickle-stalls. Multi-stream multiplex (QUIC-inspired) with independent lifecycle per stream. HMAC-signed resume tokens with circular replay windows. → Deep Dive
Request Batching
Groups independent requests sharing model/temperature/tools into single provider calls. 30-60% per-request overhead reduction from shared prompts. Streaming requests excluded. Compatibility-keyed grouping. → Deep Dive
Speculative Execution
Sidecar prediction of next-turn content. Rule-based (zero-cost). Runs in parallel with real request — discard if wrong, instant if right. Never blocks the main request. Confidence threshold ≥0.7. → Deep Dive
Compression Pipeline
18 transforms in priority order. Every transform is safety-classified and risk-gated.
| P | Transform | Safety | What It Does |
|---|---|---|---|
| 1 | content_profiler | SAFE | Classifies content type, computes 0-100 risk score |
| 2 | runtime_contract | SAFE | Enforces transform time budget per-complexity tier |
| 9 | cache_arbitrage | SAFE | Reorders for KV-cache alignment, sets provider hints |
| 10 | prefix_optimizer | SAFE | Deduplicates common message prefixes |
| 15 | message_dedup | CONDITIONAL | Removes exact/near-duplicate turns |
| 20 | reference_sub | CONDITIONAL | UUIDs, URLs, paths → <ref_N> short references |
| 22 | rate_distortion | CONDITIONAL | Extractive text compression of long-form content |
| 24 | grammar_compress | CONDITIONAL | Grammar-based structured data compression |
| 25 | dictionary_compress | CONDITIONAL | Learned phrase dictionary (HPACK-style) |
| 25 | format_conversion | CONDITIONAL | Markdown tables, JSON → compact CSV/TSV |
| 30 | tool_filter | SAFE | Strips internal fields from tool output |
| 40 | output_cleanup | SAFE | Normalizes whitespace, trims boilerplate |
Execution transforms (outside main pipeline): batching, speculative execution, delta encoding, auto-continuation.
Safety
Every transform is classified into one of three buckets. A 0-100 semantic risk score (8 dimensions) gates CONDITIONAL and DANGEROUS transforms.
Risk Score SAFE CONDITIONAL DANGEROUS Expansion Guard
───────── ──── ─────────── ────────── ───────────────
LOW (0-20) ✓ ✓ ✓ tokens × 1.5 max
MEDIUM (20-40) ✓ ✓ ✗ tokens × 1.5 max
HIGH (40-60) ✓ ✗ ✗ tokens × 1.5 max
CRITICAL (>60) ✓ ✗ ✗ tokens × 1.5 max
Observability
Every request returns routing metadata. Full runtime state in /stats.
curl http://localhost:8787/stats | jq
Key surfaces:
- /stats — Full JSON: transforms, sessions, pools, TACC state, maintenance, downgrades, ignored chunks
- /metrics — Prometheus format: counters, gauges, latency histograms per provider
- Response headers —
x-lattice-compression,x-lattice-session-id,x-lattice-delta,x-lattice-cost-usd - Maintenance — Background cleanup every 60s (stale streams, cache expiry), visible in /stats/maintenance
Supported Providers
17 direct adapters. No routing — one provider per request.
| Provider | Prefix | HTTP/2 | Cache | Streaming |
|---|---|---|---|---|
| OpenAI | openai/ |
✅ | AUTO_PREFIX | SSE delta |
| Anthropic | anthropic/, claude- |
✅ | EXPLICIT_BREAKPOINT | SSE |
| Groq | groq/ |
✅ | — | SSE |
| DeepSeek | deepseek/ |
✅ | — | SSE |
| Mistral | mistral/ |
✅ | — | SSE |
| Cohere | cohere/ |
✅ | — | SSE |
| Gemini | gemini/, google/ |
✅ | EXPLICIT_CONTEXT | SSE |
| Vertex AI | vertex/ |
✅ | EXPLICIT_CONTEXT | SSE |
| Azure | azure/ |
✅ | AUTO_PREFIX | SSE |
| Bedrock | bedrock/ |
✅ | EXPLICIT_BREAKPOINT | SSE |
| Ollama | ollama/ |
— | — | SSE |
| Ollama Cloud | ollama-cloud/ |
✅ | — | SSE |
| OpenRouter | openrouter/ |
✅ | — | SSE |
| Fireworks | fireworks/ |
✅ | — | SSE |
| Together | together/ |
✅ | — | SSE |
| Perplexity | perplexity/ |
✅ | — | SSE |
| AI21 | ai21/ |
✅ | — | SSE |
CLI Reference
lattice proxy run --port 8787 # Start foreground
lattice proxy start --port 8787 # Start daemon
lattice proxy stop # Graceful shutdown
lattice proxy status # PID, uptime, health
lattice init # Auto-detect + configure agents
lattice lace claude # Route agent through proxy
lattice unlace claude # Restore original config
lattice info # Version, transforms, config
lattice status # Proxy + agent health
lattice health # Connectivity check
lattice doctor # Diagnose routing issues
lattice config # Resolved configuration
Agent Integration
Route coding agents through LATTICE with a single command:
lattice lace claude # Claude Code
lattice lace codex # OpenAI Codex
lattice lace cursor # Cursor
lattice lace opencode # OpenCode
lattice lace copilot # GitHub Copilot
lattice lace starts the proxy, configures the agent's environment, launches the agent, and cleans up on exit. No permanent changes.
For permanent configuration: lattice init patches agent config files. lattice unlace reverses.
Development
git clone https://github.com/Harsh-Daga/lattice
cd lattice
uv sync # Install all deps
uv run pytest # 1584 tests, 7 skipped
# Lint + typecheck
uv run ruff check src/
uv run mypy src/lattice/
# Run benchmarks
uv run python benchmarks/evals/cli.py --suite all \
--providers ollama-cloud \
--provider-model ollama-cloud=kimi-k2.6:cloud \
--iterations 1 --warmup 0 --provider-warmup 0
Documentation
| Section | Documents |
|---|---|
| Getting Started | Quick Start · Installation · CLI |
| Concepts | Architecture · Proxy · SDK · Observability · Safety |
| Novel Tech | TACC · Binary Framing · Delta Encoding · Streaming · Batching & Speculation |
| Compression | Transforms · Caching · Protocol |
| Providers | 17 Providers |
| Evaluation | Benchmarks |
| Operations | Agent Integrations |
License
MIT © Harsh Daga
GitHub · Issues · PyPI · Changelog · Contributing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lattice_transport-0.1.0.tar.gz.
File metadata
- Download URL: lattice_transport-0.1.0.tar.gz
- Upload date:
- Size: 550.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfae50890ebac4ec5616a00eae1c3dde5301d4360297f91998b07cd583fba9ef
|
|
| MD5 |
a70dff6aa859d80cf3574225339c6a4f
|
|
| BLAKE2b-256 |
cf1c2fcb747f83da5df43489f0f9d5482e86f34a1ab3850ca0cdafa6e42c6065
|
Provenance
The following attestation bundles were made for lattice_transport-0.1.0.tar.gz:
Publisher:
publish.yml on Harsh-Daga/Lattice
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lattice_transport-0.1.0.tar.gz -
Subject digest:
cfae50890ebac4ec5616a00eae1c3dde5301d4360297f91998b07cd583fba9ef - Sigstore transparency entry: 1422764088
- Sigstore integration time:
-
Permalink:
Harsh-Daga/Lattice@dccad1fe278492fdb82e065bd67f4cbb481f5864 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/Harsh-Daga
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dccad1fe278492fdb82e065bd67f4cbb481f5864 -
Trigger Event:
release
-
Statement type:
File details
Details for the file lattice_transport-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lattice_transport-0.1.0-py3-none-any.whl
- Upload date:
- Size: 379.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc4105985b33d796c45093c6a8d3f8fce58926105a911d4c21b9306440ddf41d
|
|
| MD5 |
ca3bda0f1c1bf74a2aef45ee75ae0cf7
|
|
| BLAKE2b-256 |
139a495effcfdb62c354bed90434c32dfc45b9adc2f387383ff73c635aebe9bd
|
Provenance
The following attestation bundles were made for lattice_transport-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Harsh-Daga/Lattice
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lattice_transport-0.1.0-py3-none-any.whl -
Subject digest:
dc4105985b33d796c45093c6a8d3f8fce58926105a911d4c21b9306440ddf41d - Sigstore transparency entry: 1422764196
- Sigstore integration time:
-
Permalink:
Harsh-Daga/Lattice@dccad1fe278492fdb82e065bd67f4cbb481f5864 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/Harsh-Daga
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dccad1fe278492fdb82e065bd67f4cbb481f5864 -
Trigger Event:
release
-
Statement type: