Deterministic, type-aware reduction of agent tool outputs at the source. Cut LLM token cost without making the agent do less.
Project description
Trim the tool output your AI agent re-sends every turn. Keep the signal, drop the noise.
AI agents re-send every tool result (logs, JSON, diffs, stack traces, HTML) to the model on every turn, and most of it is redundancy you pay for again and again. LeanContext sits between your agent and the model and reduces those payloads to their signal: deterministically, with a fidelity score on every reduction, and without ever breaking the agent.
from leancontext import reduce
@reduce
def search_logs(query: str) -> str:
return run_log_search(query) # ~10k tokens of logs in, ~1k out, error lines kept
See it
$ python bench.py
sample kind before after saved fidelity
-----------------------------------------------------------------
log (incident) log 52642 100 100% 100%
json (RAG chunks) json 1862 1391 25% 100%
html (web fetch) html 1672 1093 35% 100%
diff (patch) diff 639 81 87% 100%
stacktrace stacktrace 896 94 90% 100%
-----------------------------------------------------------------
TOTAL 57711 2759 95%
Counts above use the built-in heuristic tokenizer (≈4 chars/token). Install the
tiktoken extra for exact model token counts — the ratios are similar (~92% on
this sample). The reduced text is identical either way.
A real incident log, before and after:
# before (902 lines)
2026-06-21T09:00:01Z INFO [gateway] req id=a1 path="/v1/render" status=200 ms=12
... 900 near-identical INFO lines ...
2026-06-21T09:10:43Z FATAL [render] OOM killed worker=7 doc="deck-8842" root cause
# after
2026-06-21T09:00:01Z INFO [gateway] req id=a1 path="/v1/render" status=200 ms=12 ⟪×900 similar⟫
2026-06-21T09:10:43Z FATAL [render] OOM killed worker=7 doc="deck-8842" root cause
The redundant lines collapse to a count. The FATAL line that explains the crash is kept intact.
Why it works
The model API is the bulk of an agent's cost, and most of that is input tokens. A tool result added on one turn is re-sent on every later turn, so the bill grows with the length of the conversation, not just the work done. Those payloads are mostly repetition. LeanContext keeps the errors, anomalies, and identifiers, and collapses the rest.
How it compares
| LeanContext | LLM-based compressor | Wire-level proxy | |
|---|---|---|---|
| No model in the reduction path | ✓ | ✗ | varies |
| Deterministic | ✓ | ✗ | varies |
| Prompt-cache safe | ✓ | often ✗ | often ✗ |
| Type-aware (keeps error lines) | ✓ | ✗ | ✗ |
| Fidelity score per reduction | ✓ | ✗ | ✗ |
| Added latency / cost | none | a model call | a network hop |
Install
pip install leancontext # core, standard library only
pip install "leancontext[integrations]" # openai, anthropic, litellm, fastapi adapters
pip install "leancontext[otel]" # OpenTelemetry metrics
pip install "leancontext[mcp]" # MCP server
pip install "leancontext[tiktoken]" # exact token counts (used automatically when present)
Use it
Three levels, one core. Every path fails open: if anything goes wrong, you get the original text back.
import leancontext
clean = leancontext.reduce(tool_output).text # 1) manual
@leancontext.reduce # 2) decorator, one line per tool
def search_logs(q: str) -> str:
...
tools = leancontext.wrap(tools) # 3) wrap all tools, or an SDK client
client = leancontext.wrap(openai_client) # (wrap_anthropic / wrap_gemini too)
Every reduction is inspectable:
r = leancontext.reduce(tool_output)
r.text # what to send to the model
r.tokens_before, r.tokens_after
r.ratio # fraction saved
r.fidelity # 0..1 signal preserved
Integrations
| Surface | How |
|---|---|
| Decorator / tools | @leancontext.reduce, leancontext.wrap(tools) |
| OpenAI / Anthropic / Gemini SDK | wrap_openai(c), wrap_anthropic(c), wrap_gemini(c) |
| LiteLLM (proxy) | callbacks: leancontext.integrations.litellm.proxy_handler_instance |
| LiteLLM (SDK) | import leancontext.integrations.litellm as ll; ll.patch() |
| Standalone proxy | from leancontext.integrations.proxy import create_app (OpenAI-compatible, any language) |
| Messages | leancontext.reduce_messages(messages) (OpenAI, Anthropic, Gemini) |
| Telemetry | import leancontext.integrations.otel as o; o.instrument() |
| Anthropic native | wrap_anthropic_native(client, ...) composes with clear_tool_uses context editing |
| Frameworks | LangChain, LangGraph, Agno via wrap(tools); any framework via @reduce on tool functions (sync or async) |
| MCP server | python -m leancontext.integrations.mcp_server — reduce / expand / stats over stdio |
CI exercises OpenAI (chat + Responses), Anthropic, LiteLLM, the standalone proxy, OpenTelemetry, and the MCP server against the real packages. Message reduction for all formats (including Gemini) is unit-tested directly. The framework adapters (LangChain / LangGraph / Agno) and the SDK-level Gemini client wrapper are provided best-effort and are not yet covered in CI against the live SDKs.
Reducers
| Kind | What it does |
|---|---|
log |
Collapse near-identical lines, keep every error, anomaly, and unique line verbatim |
json |
Factor repeated keys out once, lay values out columnar (near-lossless) |
diff |
Keep all change, hunk, and header lines, collapse unchanged context |
stacktrace |
Keep the exception and boundary frames, collapse the deep middle |
html |
Strip tags, scripts, and styles, keep visible text and links |
table |
Collapse whitespace-aligned command-line tables, keep header and data |
Anything else, or any payload below the size, saving, or fidelity thresholds, passes through unchanged.
How it works
Each tool output flows through fail-open gates (hash, size check, type detection, the typed reducer, then a saving and fidelity check) and returns either the reduced text or the original. Results are cached by content hash, so a payload re-sent across turns is reduced only once. See docs/ARCHITECTURE.md for diagrams.
Cost and telemetry
from leancontext.cost import CostTracker
tracker = CostTracker(model="claude-sonnet-4-6").install()
# ... run your agent ...
tracker.report() # {tokens_saved, usd_saved, ratio, cache_safe: True}
Configuration
leancontext.disable() # global kill switch (or env LEANCONTEXT_DISABLED=1)
leancontext.reduce(x, min_saving=0.1, min_fidelity=0.85)
leancontext.on_reduction(callback) # telemetry hook (composable)
leancontext.use_tiktoken("gpt-4o") # force a specific model's tokenizer
Roadmap
CI-verified LangChain / LlamaIndex / CrewAI / Agno adapters, accurate provider tokenizers by default, and broader Anthropic native interop.
Contributing
Issues and PRs welcome. Run pytest. Reducers are pure functions, str -> (reduced, notes),
and must be deterministic and value-preserving. See AGENTS.md for the design rules.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leancontext-2.0.7.tar.gz.
File metadata
- Download URL: leancontext-2.0.7.tar.gz
- Upload date:
- Size: 698.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f20ff2ac0b6b692015e86fb35b08ebfb8c85a0bb9f68c9d7b3ef68b8055025ed
|
|
| MD5 |
1ea1a9ad76542d6b7442a20be4a3459d
|
|
| BLAKE2b-256 |
cbb5a75b5c71e0db2f04cd96d95ff270a82fcf0064b937c041cf6c5742951d4f
|
File details
Details for the file leancontext-2.0.7-py3-none-any.whl.
File metadata
- Download URL: leancontext-2.0.7-py3-none-any.whl
- Upload date:
- Size: 42.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
280abeb47d696401d6f67da0bbaeb931e3c8d96ebf2d24dc049d778d4402ca62
|
|
| MD5 |
1faf50d14cd61fcc26fb3e426366088d
|
|
| BLAKE2b-256 |
1bc5e5128175e693fbf6456d583f2cdce29507690a345dabaad283fe1ca1cfee
|