Skip to main content

Drop-in prompt compression for production LLM applications. Cut Anthropic / OpenAI / Gemini bills by 40-60%.

Project description

leanctx

Drop-in prompt compression for production LLM applications. Cut your LLM token bill by 40–60% without changing your code.

# before
from openai import OpenAI

# after
from leanctx import OpenAI  # same interface, compressed requests

Open-source models. No API keys to anyone but your existing provider. Your prompts and user data never leave your infrastructure by default.


Status: v0.3 code is feature-complete on main — OTel observability across 12 wrapper paths + reproducible leanctx bench CLI on top of v0.1 (drop-in wrappers + LLMLingua-2/SelfLLM compression) and v0.2 (multi-provider SelfLLM + block-aware Lingua). Track progress in the roadmap below.

Who this is for

You're building a production LLM app and your token bill is a line item:

  • RAG apps with large retrieved documents
  • Long-running conversational agents
  • LangChain / LangGraph / CrewAI workflows with growing tool chains
  • Document-processing pipelines
  • Anything where input tokens accumulate and you pay for every one

If your code calls a hosted LLM API in production and input tokens are a meaningful line item, this is for you.

How it works

Three compression modes, one config switch:

  • local — runs Microsoft's open-source LLMLingua-2 locally. Free marginal cost.
  • self_llm — lets your own configured LLM do the compression. Highest quality.
  • hybrid (default) — routes by content type: code stays verbatim, prose goes through LLMLingua-2, long important spans fall back to self_llm.

Content-aware routing means code blocks, diffs, stack traces, and tool schemas are preserved verbatim — no corrupted syntax.

from leanctx import OpenAI

client = OpenAI(leanctx_config={
    "mode": "on",
    "trigger": {"threshold_tokens": 2000},
    "routing": {
        "code":           "verbatim",   # never touch code
        "error":          "verbatim",   # never touch stack traces
        "prose":          "lingua",     # local LLMLingua-2
        "long_important": "selfllm",    # cheap LLM summarization
    },
    "lingua":  {"ratio": 0.5, "device": "cpu"},
    "selfllm": {"model": "gpt-4o-mini", "api_key": "sk-...", "ratio": 0.3},
})

response = client.chat.completions.create(
    model="gpt-4o",
    max_tokens=1024,
    messages=[{"role": "user", "content": long_document}],
)

# Compression telemetry attached to the response
print(response.usage.leanctx_tokens_saved)
print(response.usage.leanctx_ratio)

Real compression numbers

Coding-agent workload (the main use case)

A realistic 9-message agent transcript — user question, file reads, grep, log dumps, failed edit, error trace — totaling ~2.1K tokens. Run through leanctx.Anthropic with mode="on" and content-aware routing (code → verbatim, errors → verbatim, prose → Lingua):

Metric Before After Reduction
Tokens 2148 1384 35.6%
Chars 7898 5701 27.8%
Tokens saved per request 768

What got preserved verbatim (asserted programmatically):

  • ✅ A 2 KB Python source file inside a tool_result block — byte-identical
  • ✅ A Python traceback in an is_error tool result — byte-identical
  • ✅ Every tool_use_id and the name / input of every tool_use block — so tool linkage and tool calls don't break
  • edit_file's new_str argument — so the actual code edit isn't rewritten

What actually compressed:

  • A 3.4 KB log dump shrank to 1.9 KB (45% reduction) — the legitimate compression target
  • A grep result and prose reasoning blocks shrank by 30-50%

Reproducible: python scripts/integration_test_agent_workload.py — runs the real LLMLingua-2 model, takes ~30s on Apple Silicon, no API key required.

SelfLLM cross-provider comparison

Same 1.7 KB SRE-incident document through SelfLLM against each provider's cheapest tier:

Provider Model Compression Latency Cost per call
Anthropic claude-haiku-4-5 41.6% 3.05s ~$0.0016
OpenAI gpt-4o-mini 49.1% 6.42s ~$0.0003
Gemini gemini-2.5-flash 48.7% 2.25s ~$0.0001

All three preserved every timestamp, metric value, and action item with no hallucination. Combined with Lingua (LLMLingua-2 local) hitting 44.7% char reduction on the same document at zero marginal cost, leanctx covers the full speed/cost/quality trade-off space.

Full methodology, per-provider output samples, cost analysis, and bugs we found in flight: docs/benchmarks/.

Observability (v0.3)

leanctx emits OpenTelemetry spans and metrics for every compression call, opt-in via leanctx_config. The library is API-only: it never owns the OTel SDK or registers providers. The application configures OTel; leanctx emits.

client = leanctx.Anthropic(
    leanctx_config={
        "mode": "on",
        "observability": {"otel": True},
    }
)

Every wrapper-routed call produces one root leanctx.compress span with provider, method, input_tokens, output_tokens, cost_usd, and duration_ms, plus per-compressor child spans for granular tracing. Five metrics (4 counters + 1 histogram) are recorded with provider/method/status labels.

See docs/observability.md for the full attribute reference, span lifetime contract for streaming paths, sample app-side OTel SDK setup, and the closed leanctx.method taxonomy.

Reproducible benchmarks (v0.3)

The leanctx bench CLI runs the offline integration scenarios with deterministic input and emits versioned JSON records:

leanctx bench list                                  # registered scenarios
leanctx bench run lingua-local --workload rag       # offline lingua compression
leanctx bench run agent-structural --workload agent # 5 structural-integrity invariants enforced
leanctx bench run anthropic-e2e --workload chat     # full stack, respx-mocked Anthropic
leanctx bench run selfllm-anthropic --workload rag  # live API (requires ANTHROPIC_API_KEY)

Output is one JSON record per run with schema_version: "1" and a documented field set so downstream tooling can consume it.

Roadmap

  • v0.1 — Python SDK, drop-in Anthropic/OpenAI/Gemini wrappers, local (LLMLingua-2) + self_llm (Anthropic), content classifier, router, dedup + purge-errors strategies, LangChain format helpers, Docker image
  • v0.2 — self_llm on OpenAI + Gemini, block-aware compression (tool_use / tool_result preserved through Lingua), Gemini contents normalization (middleware actually runs), LangChain LCEL compress_runnable
  • v0.3.0 PyPI publish — released 2026-04-26 (pypi.org/project/leanctx/0.3.0)
  • v0.3 — OTel observability (API-only spans + metrics on every compression call across 12 wrapper paths), leanctx bench CLI (6 named scenarios + versioned JSON schema)
  • v0.3.x — ghcr.io Docker publish workflow, OpenAI responses-API intercept, multimodal + function-call compression for Gemini, LlamaIndex helpers, TypeScript SDK compression port
  • v0.4 — Helm chart, Kubernetes sidecar proxy deployment, stateful session dedup with explicit session IDs

Install

# Once v0.1.0 is published:
pip install leanctx
pip install 'leanctx[anthropic,openai,gemini]'  # pick your providers
pip install 'leanctx[lingua]'                   # + LLMLingua-2 local compression
pip install 'leanctx[all]'                      # everything

# Today (from source, main branch):
pip install git+https://github.com/jia-gao/leanctx.git

Docker images:

docker build -t leanctx:slim .                             # 341 MB, all provider SDKs
docker build -t leanctx:lingua --build-arg LINGUA=true .   # + LLMLingua-2, ~3 GB

Supported providers

Provider Drop-in client Streaming Compression applied SelfLLM target
Anthropic leanctx.Anthropic / AsyncAnthropic
OpenAI leanctx.OpenAI / AsyncOpenAI
Gemini leanctx.Gemini (.models + .aio.models) ✅ *

Gemini asterisk: text-only requests compress fully. Requests that include function_call, function_response, or multimodal (inline_data) parts automatically bail out to passthrough — we never rewrite tool-call payloads, as that would change tool semantics. Multimodal + function-calling compression lands in v0.3.

Architecture

your code
   ↓
leanctx.Anthropic / OpenAI / Gemini
   ↓
Middleware
   ├── Strategies (deterministic, no LLM):
   │     DedupStrategy, PurgeErrorsStrategy
   ↓
   ├── Per-message pipeline:
   │     classify → router → compressor
   ↓
Compressor:  Verbatim | Lingua (LLMLingua-2) | SelfLLM (your LLM)
   ↓
real Anthropic / OpenAI / Gemini SDK → API

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leanctx-0.3.1.tar.gz (123.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leanctx-0.3.1-py3-none-any.whl (67.0 kB view details)

Uploaded Python 3

File details

Details for the file leanctx-0.3.1.tar.gz.

File metadata

  • Download URL: leanctx-0.3.1.tar.gz
  • Upload date:
  • Size: 123.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for leanctx-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c5c58b7662a3a63785571058035ebf3635b1b200a7604730adcd8d7a3fbc1d6b
MD5 093dfa215a2a311d3ebd79789886eb73
BLAKE2b-256 9cc9a703541d69449d5508e3229be22b4b8f81bbbc0dd3103f137ea9c23709f7

See more details on using hashes here.

File details

Details for the file leanctx-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: leanctx-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 67.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for leanctx-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1aab6114ef1b214aa55f5ac8ccd3660687b75fb1287ed37b0788bb3fed10e0e4
MD5 1d96c9c6ef78a3fbbb1f9f96ee93ccab
BLAKE2b-256 a31ec62a1b3c31a2ee2cb8ddaca7a8a0909014f07e32e9596a5cfd6d2b8f82de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page