Skip to main content

LLM Transport & Efficiency Layer — make LLM calls cheaper, faster, and smarter

Project description

LATTICE

LLM Transport & Efficiency Layer
Make every LLM call cheaper, faster, and safe — without changing your model.

PyPI CI License Tests Python


LATTICE is an intelligent transport proxy that sits between your application and any LLM provider. It applies network-layer optimizations — congestion control, binary framing, delta encoding, speculation, batching — plus a safety-gated compression pipeline with 18 transforms. Your app sends standard OpenAI API requests; LATTICE makes them smaller, faster, safer, and cache-friendly.

It is not a router. LATTICE never changes your model, never falls back between providers, never guesses. You route to exactly one provider per request. LATTICE optimizes the transport and execution.

Table of Contents


Installation

pip install lattice-transport

Optional dependencies:

pip install "lattice-transport[redis]"   # Multi-process session store
pip install "lattice-transport[mcp]"    # MCP tool support
pip install "lattice-transport[all]"     # Everything

Requirements: Python 3.10+. No external services needed for single-process mode.

Quick Start

# Start the proxy
lattice proxy run --port 8787

# Point any OpenAI SDK at it
export OPENAI_BASE_URL=http://localhost:8787/v1

# Or route an agent through it
lattice lace claude
# Or use the SDK
from lattice import LatticeClient

client = LatticeClient()
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Explain transport protocols"}],
)
print(response.choices[0].message.content)

Every request is automatically compressed, cached, and optimized. Zero code changes in proxy mode.


Architecture

                          ┌──────────────────────────┐
                          │   Application / Agent     │
                          │  (Claude, Cursor, Codex,  │
                          │   OpenAI SDK, curl)       │
                          └────────────┬─────────────┘
                                       │ OpenAI API format
                          ┌────────────▼─────────────┐
                          │   LATTICE PROXY :8787     │
                          │                           │
         ┌────────────────┼───────────────────────┐   │
         │                │                       │   │
         ▼                ▼                       ▼   │
┌─────────────┐  ┌─────────────────┐  ┌─────────────────┐
│   Session   │  │    Transform    │  │    Semantic     │
│   Manager   │  │    Pipeline     │  │     Cache       │
│             │  │                 │  │                 │
│ Memory or   │  │ 18 transforms   │  │ Exact-hash      │
│ Redis store │  │ priority-ordered│  │ + approximate   │
│             │  │ risk-gated       │  │ semantic match  │
│ CAS version │  │ expansion-capped│  │ LRU + TTL       │
└──────┬──────┘  └────────┬────────┘  └────────┬────────┘
       │                  │                    │
       └──────────────────┼────────────────────┘
                          │
              ┌───────────▼──────────────────┐
              │      DirectHTTPProvider      │
              ├──────────────────────────────┤
              │  ProviderRegistry (17 adapt)  │
              │  ConnectionPool (HTTP/2)      │
              │  StreamStallDetector          │
              │  TACC Congestion Controller   │
              └───────────┬──────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │     LLM Provider      │
              │  (exactly one per req)│
              └───────────────────────┘

Request Flow

1. Client sends OpenAI-compatible POST /v1/chat/completions
2. SessionManager creates or retrieves session (with CAS versioning)
3. 18 transforms run in priority order, each gated by:
   config → policy → runtime budget → risk gate → expansion guard
4. SemanticCache checks exact hash, then approximate fingerprint
5. [cache miss] Provider adapter serializes → HTTP/2 pool → provider
6. [streaming] StallDetector monitors per-provider tolerance windows
7. TACC controller manages concurrency window (token-based, not request-count)
8. Response deserialized → pipeline reverse pass → OpenAI JSON → client
9. Session updated, response cached, headers attached

Novel Technology

LATTICE adapts classical systems techniques for LLM workloads. These are not LLM features — they are transport, network, and execution innovations.

TACC

Token-Aware Congestion Control — AIMD-style adaptive concurrency. Manages per-provider admission using token pressure (not request counts: a 100K-token request uses more provider capacity than a 10-token one). Priority-ordered waiting queue, stall-aware window collapse, cache-aware latency smoothing. → Deep Dive

Binary Framing

15-byte fixed header format. 17 semantic frame types (PING, REQUEST, STREAM_CHUNK, RESUME_TOKEN...). CRC32 per-frame integrity. Semantic boundary flags for sentence/tool/reasoning boundaries. O(1) parsing — no JSON overhead per chunk. → Deep Dive

Delta Encoding

After turn 1, sends only new messages — server reconstructs full context from session store. CAS-style optimistic concurrency via anchor versioning prevents lost updates. Graceful fallback on version/sequence mismatch. → Deep Dive

Stream Architecture

Per-provider dynamic stall detection with phase-aware tolerance multipliers (first_chunk=1.5×, streaming=1.0×, thinking=2.0×, tool_call=1.2×). Token velocity tracking catches trickle-stalls. Multi-stream multiplex (QUIC-inspired) with independent lifecycle per stream. HMAC-signed resume tokens with circular replay windows. → Deep Dive

Request Batching

Groups independent requests sharing model/temperature/tools into single provider calls. 30-60% per-request overhead reduction from shared prompts. Streaming requests excluded. Compatibility-keyed grouping. → Deep Dive

Speculative Execution

Sidecar prediction of next-turn content. Rule-based (zero-cost). Runs in parallel with real request — discard if wrong, instant if right. Never blocks the main request. Confidence threshold ≥0.7. → Deep Dive


Compression Pipeline

18 transforms in priority order. Every transform is safety-classified and risk-gated.

P Transform Safety What It Does
1 content_profiler SAFE Classifies content type, computes 0-100 risk score
2 runtime_contract SAFE Enforces transform time budget per-complexity tier
9 cache_arbitrage SAFE Reorders for KV-cache alignment, sets provider hints
10 prefix_optimizer SAFE Deduplicates common message prefixes
15 message_dedup CONDITIONAL Removes exact/near-duplicate turns
20 reference_sub CONDITIONAL UUIDs, URLs, paths → <ref_N> short references
22 rate_distortion CONDITIONAL Extractive text compression of long-form content
24 grammar_compress CONDITIONAL Grammar-based structured data compression
25 dictionary_compress CONDITIONAL Learned phrase dictionary (HPACK-style)
25 format_conversion CONDITIONAL Markdown tables, JSON → compact CSV/TSV
30 tool_filter SAFE Strips internal fields from tool output
40 output_cleanup SAFE Normalizes whitespace, trims boilerplate

Execution transforms (outside main pipeline): batching, speculative execution, delta encoding, auto-continuation.

Full Transform Reference


Safety

Every transform is classified into one of three buckets. A 0-100 semantic risk score (8 dimensions) gates CONDITIONAL and DANGEROUS transforms.

Risk Score     SAFE       CONDITIONAL    DANGEROUS     Expansion Guard
─────────      ────       ───────────    ──────────    ───────────────
LOW (0-20)     ✓          ✓              ✓             tokens × 1.5 max
MEDIUM (20-40)  ✓          ✓              ✗             tokens × 1.5 max
HIGH (40-60)   ✓          ✗              ✗             tokens × 1.5 max
CRITICAL (>60) ✓          ✗              ✗             tokens × 1.5 max

Safety Deep Dive


Observability

Every request returns routing metadata. Full runtime state in /stats.

curl http://localhost:8787/stats | jq

Key surfaces:

  • /stats — Full JSON: transforms, sessions, pools, TACC state, maintenance, downgrades, ignored chunks
  • /metrics — Prometheus format: counters, gauges, latency histograms per provider
  • Response headersx-lattice-compression, x-lattice-session-id, x-lattice-delta, x-lattice-cost-usd
  • Maintenance — Background cleanup every 60s (stale streams, cache expiry), visible in /stats/maintenance

Observability Guide


Supported Providers

17 direct adapters. No routing — one provider per request.

Provider Prefix HTTP/2 Cache Streaming
OpenAI openai/ AUTO_PREFIX SSE delta
Anthropic anthropic/, claude- EXPLICIT_BREAKPOINT SSE
Groq groq/ SSE
DeepSeek deepseek/ SSE
Mistral mistral/ SSE
Cohere cohere/ SSE
Gemini gemini/, google/ EXPLICIT_CONTEXT SSE
Vertex AI vertex/ EXPLICIT_CONTEXT SSE
Azure azure/ AUTO_PREFIX SSE
Bedrock bedrock/ EXPLICIT_BREAKPOINT SSE
Ollama ollama/ SSE
Ollama Cloud ollama-cloud/ SSE
OpenRouter openrouter/ SSE
Fireworks fireworks/ SSE
Together together/ SSE
Perplexity perplexity/ SSE
AI21 ai21/ SSE

Provider Details


CLI Reference

lattice proxy run --port 8787          # Start foreground
lattice proxy start --port 8787        # Start daemon
lattice proxy stop                     # Graceful shutdown
lattice proxy status                   # PID, uptime, health

lattice init                           # Auto-detect + configure agents
lattice lace claude                    # Route agent through proxy
lattice unlace claude                  # Restore original config

lattice info                           # Version, transforms, config
lattice status                         # Proxy + agent health
lattice health                         # Connectivity check
lattice doctor                         # Diagnose routing issues
lattice config                         # Resolved configuration

Full CLI Reference


Agent Integration

Route coding agents through LATTICE with a single command:

lattice lace claude       # Claude Code
lattice lace codex        # OpenAI Codex  
lattice lace cursor       # Cursor
lattice lace opencode     # OpenCode
lattice lace copilot      # GitHub Copilot

lattice lace starts the proxy, configures the agent's environment, launches the agent, and cleans up on exit. No permanent changes.

For permanent configuration: lattice init patches agent config files. lattice unlace reverses.

Integration Guide


Development

git clone https://github.com/Harsh-Daga/lattice
cd lattice
uv sync          # Install all deps
uv run pytest    # 1584 tests, 7 skipped

# Lint + typecheck
uv run ruff check src/
uv run mypy src/lattice/

# Run benchmarks
uv run python benchmarks/evals/cli.py --suite all \
  --providers ollama-cloud \
  --provider-model ollama-cloud=kimi-k2.6:cloud \
  --iterations 1 --warmup 0 --provider-warmup 0

Documentation

Section Documents
Getting Started Quick Start · Installation · CLI
Concepts Architecture · Proxy · SDK · Observability · Safety
Novel Tech TACC · Binary Framing · Delta Encoding · Streaming · Batching & Speculation
Compression Transforms · Caching · Protocol
Providers 17 Providers
Evaluation Benchmarks
Operations Agent Integrations

Full Documentation Index


License

MIT © Harsh Daga

GitHub · Issues · PyPI · Changelog · Contributing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lattice_transport-0.1.0.tar.gz (550.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lattice_transport-0.1.0-py3-none-any.whl (379.5 kB view details)

Uploaded Python 3

File details

Details for the file lattice_transport-0.1.0.tar.gz.

File metadata

  • Download URL: lattice_transport-0.1.0.tar.gz
  • Upload date:
  • Size: 550.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lattice_transport-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cfae50890ebac4ec5616a00eae1c3dde5301d4360297f91998b07cd583fba9ef
MD5 a70dff6aa859d80cf3574225339c6a4f
BLAKE2b-256 cf1c2fcb747f83da5df43489f0f9d5482e86f34a1ab3850ca0cdafa6e42c6065

See more details on using hashes here.

Provenance

The following attestation bundles were made for lattice_transport-0.1.0.tar.gz:

Publisher: publish.yml on Harsh-Daga/Lattice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lattice_transport-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for lattice_transport-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc4105985b33d796c45093c6a8d3f8fce58926105a911d4c21b9306440ddf41d
MD5 ca3bda0f1c1bf74a2aef45ee75ae0cf7
BLAKE2b-256 139a495effcfdb62c354bed90434c32dfc45b9adc2f387383ff73c635aebe9bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for lattice_transport-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Harsh-Daga/Lattice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page