Drop-in compression for LLM agent tool outputs. Shrink bloated HTML/JSON/log results before they re-enter context — cut tokens, stay on-task, keep full output retrievable.

These details have not been verified by PyPI

Project links

Project description

tooltrim

Drop-in compression for LLM agent tool outputs. Shrink bloated tool results — fetched web pages, paginated JSON, log dumps, CSV exports, long documents — before they re-enter your agent's context window. Keep the facts the model needs, drop the boilerplate, and keep the full output one expand() away.

from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def web_fetch(url: str) -> str:
    ...                      # returns a 3,000-token HTML page
# your agent now receives a compact, on-topic extract instead

Zero dependencies in the core. Pure-stdlib, deterministic, reproducible.
Provider-agnostic. Works with OpenAI, Anthropic, local models, LangChain, LlamaIndex, raw function-calling — anything. It compresses strings, not APIs.
Lossless by reference. Compression is extractive, and the full output stays retrievable via a short ref — so it's compression plus retrieval, not blind truncation.
Content-aware. Separate compressors for HTML, JSON, tabular data, logs, and free text. Optionally query-aware (BM25) to keep what the agent is actually looking for.
Faithfulness-tested. A built-in harness measures whether the model still answers correctly on compressed output (with Wilson 95% CIs) — not just how many tokens you saved.
Deploy as a proxy. An OpenAI-compatible compression proxy trims role:"tool" messages in flight, so any app/language adopts it with zero code changes — just a base_url.

Why

In a real agent loop, the prompt isn't what blows up your context — tool outputs are. A single web_fetch returns thousands of tokens of nav bars and footers; a REST call returns a 300-item paginated array; a log tool dumps 10,000 lines of INFO heartbeat. And because the agent's transcript is replayed on every turn, you pay for that bloat again and again — slower responses, higher bills, and a model that loses the thread.

Routers, caches, and prompt compressors don't touch this. tooltrim targets the tool output directly, at the exact point it enters context.

Benchmark

Realistic tool outputs compressed to a 400-token budget, exact tiktoken (cl100k_base) counts. Each output contains one planted fact ("needle") that the agent needs; tooltrim is given the task as its relevance query. Reproduce with benchmark.py.

Tool output	before	after	saved	needle kept
Web page (HTML)	2,816	13	99.5%	yes
REST response (JSON)	15,119	325	97.9%	yes
Server logs	7,606	390	94.9%	yes
CSV export	7,895	373	95.3%	yes
Long document (text)	6,139	10	99.8%	yes
Total	39,575	1,111	97.2%	5/5

39,575 → 1,111 tokens — a 35.6× smaller context, with the relevant fact kept in every case. (HTML/text collapse to the matching passage when the query pinpoints it; structured types keep a representative, schema-preserving sample.)

Does compression lose information? (it can help)

Throwing away 99% of the tokens is only safe if the model still answers correctly. We measure that directly: for 62 curated (tool output, question, gold answer) cases across all five content types — including multi-fact cases (the answer needs several facts from different parts of the output) and distractor cases (a deprecated value sits next to the current one) — a model is asked the question twice: once on the full output, once on the tooltrim-compressed output. Accuracy is reported with Wilson 95% confidence intervals. Reproduce with run_faithfulness.py — it runs offline by default (no API key) and has adapters for Claude / OpenAI / Groq / Ollama.

On small local models, compression doesn't just preserve accuracy — it improves it, because the model is no longer distracted by thousands of tokens of noise. The effect reproduces across two independent model families:

model	full	@128 (−98.6%)	@256 (−97.3%)	@400 (−96.5%)
`mistral:7b`	13% [7–23%]	84% [73–91%]	81% [69–89%]	82% [71–90%]
`llama3.1:8b`	23% [14–34%]	73% [60–82%]	66% [54–77%]	66% [54–77%]

The compressed intervals don't overlap the full-context intervals — at n=62 this is a significant improvement for both models, not noise. Full provenance, per-case answers, and the cross-model table are saved as citable artifacts under benchmarks/runs/ and benchmarks/COMPARISON.md.

Stated plainly: these are small 7–8B models. A frontier long-context model handles the full context far better, so its baseline is higher and the accuracy uplift shrinks — but the token/cost savings remain. The uplift is largest for smaller/cheaper models and longer contexts. The harness is wired so a frontier run (--model claude) drops a new row into the same table when an API key is available; n=62 is a pilot, which is why the CIs are reported.

Install

pip install tooltrim          # zero-dependency core (heuristic token counts)
pip install tooltrim[tokens]  # add tiktoken for exact token counts

Usage

1. Decorate a tool

from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def read_file(path: str) -> str:
    return open(path).read()

2. Make it query-aware

Pull the relevance query from the call arguments…

@compressed_tool(max_tokens=400, query_from=lambda query, **_: query)
def web_search(query: str) -> str:
    ...

…or set the agent's current goal ambiently, so every tool call this turn keeps what's relevant to it:

from tooltrim import query_scope

with query_scope("find the customer's refund status"):
    result = run_agent_step()   # all @compressed_tool calls inside use this query

3. Imperative API + expand-on-demand

from tooltrim import ToolCompressor

tc = ToolCompressor(max_tokens=400)
res = tc.compress(huge_json_response, query="refund status for customer C-1007")

res.text             # compact text to feed back to the model
res.saved_tokens     # e.g. 14794
res.saved_ratio      # e.g. 0.979
res.ref              # e.g. "a1b2c3d4"

full = tc.expand(res.ref)                    # get the original back
slice_ = tc.expand(res.ref, start=0, length=2000)

By default the compressed output ends with a small footer the model can act on:

…compressed extract…

[tooltrim: compressed 15119->325 tokens (saved 14794); full output ref=a1b2c3d4]

Expose an expand(ref) tool to your agent and it can pull the full output back whenever the extract isn't enough — turning aggressive compression into a safe default. tooltrim hands you both the tool schema and the handler:

tools = my_tools + [tc.expand_tool_spec(style="openai")]   # or style="anthropic"

# when the model calls expand_tool_output(ref=..., start=..., length=...):
result_text = tc.handle_expand(ref, start=start, length=length)   # paged, safe

See examples/04_expand_tool.py for a full wiring. Extractive compressors also keep neighbor context (a line/sentence around each match) so the model gets context, not just the bare matching line.

4. Optional: LLM distillation (any provider)

The deterministic compressors need no LLM. When you want summarization instead of extraction, plug in any model with a one-line completion function — use a small/cheap one; distilling 15k → 300 tokens once saves your expensive model from re-reading the blob every turn.

from tooltrim import LLMDistiller

def complete(prompt: str) -> str:
    # wrap OpenAI / Anthropic / local — your choice
    return my_client.responses(prompt)

distiller = LLMDistiller(complete, max_tokens=300)
summary = distiller.compress(huge_output, query="refund status")

5. Drop into LangChain — one line per tool

Already have LangChain tools? Wrap any of them and you get back a tool with the same name, description, and argument schema, so the agent calls it unchanged — but its (string) output is compressed before it lands in the scratchpad. The relevance query comes from the tool's own arguments.

pip install tooltrim[langchain]

from tooltrim.integrations import compress_langchain_tool, compress_langchain_tools

fetch = compress_langchain_tool(my_tool, max_tokens=400,
                                query_from=lambda query, **_: query)

# or wrap the whole toolset at once (sharing one compressor + expand store):
tools = compress_langchain_tools(my_tools, max_tokens=400)

See examples/03_langchain_tool.py.

6. Or run it as a proxy — zero code changes

Point your client at the tooltrim proxy; every tool result is compressed (using the latest user message as the relevance query) before being forwarded upstream. Both wire formats are understood, routed by request path — you only change base_url.

python run_proxy.py --upstream https://api.openai.com/v1     # OpenAI-compatible
python run_proxy.py --upstream https://api.anthropic.com/v1  # Claude

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8800/v1", api_key="<upstream key>")

from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:8800")

/v1/chat/completions compresses OpenAI role:"tool" messages; /v1/messages compresses Anthropic tool_result blocks. The proxy is stdlib-only and fails open: if anything goes wrong it forwards the original request untouched, so it never breaks a production call.

Online, it also keeps you under provider rate limits. Against a live hosted model (Groq free tier, 6,000-tokens-per-request cap), 45% of raw tool outputs are rejected (HTTP 413) but 100% of tooltrim-compressed calls fit — a 14,415-token result is compressed to 26 tokens in flight and the call succeeds. See benchmarks/ONLINE_GROQ.md.

7. Scale out — shared expand-store + metrics

The default expand-store is in-process, fine for one worker. To run multiple workers/replicas behind a load balancer, the store must be shared — otherwise a ref minted by one worker can't be expanded by another. Swap in a backend (all are content-addressed, so writes dedup automatically):

from tooltrim import ToolCompressor, FileStore, RedisStore, S3Store

tc = ToolCompressor(store=FileStore("/mnt/shared/tooltrim"))         # zero-dep, shared volume
tc = ToolCompressor(store=RedisStore(url="redis://cache:6379/0",     # pip install tooltrim[redis]
                                     ttl_seconds=86_400))
tc = ToolCompressor(store=S3Store(bucket="my-bucket"))               # pip install tooltrim[s3]

The proxy exposes Prometheus metrics at GET /metrics (tokens in/out/saved, messages compressed, fail-open count, upstream errors, latency) — scrape it to quantify savings fleet-wide:

tooltrim_tokens_saved_total 14389
tooltrim_messages_compressed_total 1
tooltrim_fail_open_total 0

How it works

Pass-through if the output already fits the budget (zero overhead).
Detect the content type (JSON / HTML / tabular / logs / text).
Compress with a type-specific strategy:
- JSON — preserve structure; sample arrays (keeping the key schema), note (+N more items), truncate long strings; tighten until it fits.
- HTML — extract readable text (drop script/style/nav/footer), then fit the budget.
- Tabular — keep the header + a sample of rows + (+N more rows).
- Logs — collapse repeated lines (x42), always keep errors/warnings, fill with head/tail context.
- Text — query-aware extractive selection (BM25), with […] elisions.
Stash the full output under a content-addressed ref for expand().

With a query, every compressor keeps the most relevant parts; without one, it falls back to structure-preserving head/tail selection.

How it's different

Tool class	What it optimizes	tooltrim
Routers (RouteLLM…)	which model gets the call	orthogonal
Semantic caches	repeated identical calls	orthogonal
Prompt compressors (LLMLingua)	the prompt/instructions	different target
Memory frameworks (MemGPT…)	conversation history, as a framework you adopt	tooltrim is a drop-in on the tool boundary

tooltrim targets the tool-output boundary — the largest and most-ignored token sink in agentic apps — and works alongside all of the above.

Status

v0.1 — deterministic zero-dependency core, 79-test suite, reproducible token + faithfulness benchmarks (with Wilson CIs, cross-model), a proxy speaking both OpenAI and Anthropic wire formats with Prometheus /metrics, a LangChain adapter, pluggable File/Redis/S3 expand-stores for horizontal scale, and citable run artifacts under benchmarks/.

Roadmap: PyPI release + tooltrim CLI, frontier-model faithfulness runs, embedding-based relevance, streaming compression, and native LlamaIndex / OpenAI-Agents wrappers.

Contributions and benchmark cases welcome. MIT licensed.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tooltrim-0.1.0.tar.gz (46.2 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tooltrim-0.1.0-py3-none-any.whl (37.8 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file tooltrim-0.1.0.tar.gz.

File metadata

Download URL: tooltrim-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 46.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tooltrim-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`766a5798eee158ff26f9d830e530abd0fcbe81bf652ff383bc179aa7fc1c9162`
MD5	`cac06ebd4af69fc0262041eed47c8ff5`
BLAKE2b-256	`3871c1e4de87aa74880cf4a8543f5fbcf3b798a5cfa017978e7f45e1c9955fb4`

See more details on using hashes here.

File details

Details for the file tooltrim-0.1.0-py3-none-any.whl.

File metadata

Download URL: tooltrim-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tooltrim-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05e6383f68c723e642de68bbaa00196d2d4846f120fc33ca20445c303d28756c`
MD5	`0c36b51063c95fd1b2291e6240b35b7d`
BLAKE2b-256	`4a889f2ec766a9616d6f92f8c60905536ee76e420f45c5ed384dc829f27fe4c9`

See more details on using hashes here.

tooltrim 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tooltrim

Why

Benchmark

Does compression lose information? (it can help)

Install

Usage

1. Decorate a tool

2. Make it query-aware

3. Imperative API + expand-on-demand

4. Optional: LLM distillation (any provider)

5. Drop into LangChain — one line per tool

6. Or run it as a proxy — zero code changes

7. Scale out — shared expand-store + metrics

How it works

How it's different

Status

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes