Skip to main content

Drop-in compression for LLM agent tool outputs. Shrink bloated HTML/JSON/log results before they re-enter context — cut tokens, stay on-task, keep full output retrievable.

Project description

tooltrim

Drop-in compression for LLM agent tool outputs. Shrink bloated tool results — fetched web pages, paginated JSON, log dumps, CSV exports, long documents — before they re-enter your agent's context window. Keep the facts the model needs, drop the boilerplate, and keep the full output one expand() away.

from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def web_fetch(url: str) -> str:
    ...                      # returns a 3,000-token HTML page
# your agent now receives a compact, on-topic extract instead
  • Zero dependencies in the core. Pure-stdlib, deterministic, reproducible.
  • Provider-agnostic. Works with OpenAI, Anthropic, local models, LangChain, LlamaIndex, raw function-calling — anything. It compresses strings, not APIs.
  • Lossless by reference. Compression is extractive, and the full output stays retrievable via a short ref — so it's compression plus retrieval, not blind truncation.
  • Content-aware. Separate compressors for HTML, JSON, tabular data, logs, and free text. Optionally query-aware (BM25) to keep what the agent is actually looking for.
  • Faithfulness-tested. A built-in harness measures whether the model still answers correctly on compressed output (with Wilson 95% CIs) — not just how many tokens you saved.
  • Deploy as a proxy. An OpenAI-compatible compression proxy trims role:"tool" messages in flight, so any app/language adopts it with zero code changes — just a base_url.

Why

In a real agent loop, the prompt isn't what blows up your context — tool outputs are. A single web_fetch returns thousands of tokens of nav bars and footers; a REST call returns a 300-item paginated array; a log tool dumps 10,000 lines of INFO heartbeat. And because the agent's transcript is replayed on every turn, you pay for that bloat again and again — slower responses, higher bills, and a model that loses the thread.

Routers, caches, and prompt compressors don't touch this. tooltrim targets the tool output directly, at the exact point it enters context.

Benchmark

Realistic tool outputs compressed to a 400-token budget, exact tiktoken (cl100k_base) counts. Each output contains one planted fact ("needle") that the agent needs; tooltrim is given the task as its relevance query. Reproduce with benchmark.py.

Tool output before after saved needle kept
Web page (HTML) 2,816 13 99.5% yes
REST response (JSON) 15,119 325 97.9% yes
Server logs 7,606 390 94.9% yes
CSV export 7,895 373 95.3% yes
Long document (text) 6,139 10 99.8% yes
Total 39,575 1,111 97.2% 5/5

39,575 → 1,111 tokens — a 35.6× smaller context, with the relevant fact kept in every case. (HTML/text collapse to the matching passage when the query pinpoints it; structured types keep a representative, schema-preserving sample.)

Does compression lose information? (it can help)

Throwing away 99% of the tokens is only safe if the model still answers correctly. We measure that directly: for 62 curated (tool output, question, gold answer) cases across all five content types — including multi-fact cases (the answer needs several facts from different parts of the output) and distractor cases (a deprecated value sits next to the current one) — a model is asked the question twice: once on the full output, once on the tooltrim-compressed output. Accuracy is reported with Wilson 95% confidence intervals. Reproduce with run_faithfulness.py — it runs offline by default (no API key) and has adapters for Claude / OpenAI / Groq / Ollama.

On small local models, compression doesn't just preserve accuracy — it improves it, because the model is no longer distracted by thousands of tokens of noise. The effect reproduces across two independent model families:

model full @128 (−98.6%) @256 (−97.3%) @400 (−96.5%)
mistral:7b 13% [7–23%] 84% [73–91%] 81% [69–89%] 82% [71–90%]
llama3.1:8b 23% [14–34%] 73% [60–82%] 66% [54–77%] 66% [54–77%]

The compressed intervals don't overlap the full-context intervals — at n=62 this is a significant improvement for both models, not noise. Full provenance, per-case answers, and the cross-model table are saved as citable artifacts under benchmarks/runs/ and benchmarks/COMPARISON.md.

Stated plainly: these are small 7–8B models. A frontier long-context model handles the full context far better, so its baseline is higher and the accuracy uplift shrinks — but the token/cost savings remain. The uplift is largest for smaller/cheaper models and longer contexts. The harness is wired so a frontier run (--model claude) drops a new row into the same table when an API key is available; n=62 is a pilot, which is why the CIs are reported.

Install

pip install tooltrim          # zero-dependency core (heuristic token counts)
pip install tooltrim[tokens]  # add tiktoken for exact token counts

Usage

1. Decorate a tool

from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def read_file(path: str) -> str:
    return open(path).read()

2. Make it query-aware

Pull the relevance query from the call arguments…

@compressed_tool(max_tokens=400, query_from=lambda query, **_: query)
def web_search(query: str) -> str:
    ...

…or set the agent's current goal ambiently, so every tool call this turn keeps what's relevant to it:

from tooltrim import query_scope

with query_scope("find the customer's refund status"):
    result = run_agent_step()   # all @compressed_tool calls inside use this query

3. Imperative API + expand-on-demand

from tooltrim import ToolCompressor

tc = ToolCompressor(max_tokens=400)
res = tc.compress(huge_json_response, query="refund status for customer C-1007")

res.text             # compact text to feed back to the model
res.saved_tokens     # e.g. 14794
res.saved_ratio      # e.g. 0.979
res.ref              # e.g. "a1b2c3d4"

full = tc.expand(res.ref)                    # get the original back
slice_ = tc.expand(res.ref, start=0, length=2000)

By default the compressed output ends with a small footer the model can act on:

…compressed extract…

[tooltrim: compressed 15119->325 tokens (saved 14794); full output ref=a1b2c3d4]

Expose an expand(ref) tool to your agent and it can pull the full output back whenever the extract isn't enough — turning aggressive compression into a safe default. tooltrim hands you both the tool schema and the handler:

tools = my_tools + [tc.expand_tool_spec(style="openai")]   # or style="anthropic"

# when the model calls expand_tool_output(ref=..., start=..., length=...):
result_text = tc.handle_expand(ref, start=start, length=length)   # paged, safe

See examples/04_expand_tool.py for a full wiring. Extractive compressors also keep neighbor context (a line/sentence around each match) so the model gets context, not just the bare matching line.

4. Optional: LLM distillation (any provider)

The deterministic compressors need no LLM. When you want summarization instead of extraction, plug in any model with a one-line completion function — use a small/cheap one; distilling 15k → 300 tokens once saves your expensive model from re-reading the blob every turn.

from tooltrim import LLMDistiller

def complete(prompt: str) -> str:
    # wrap OpenAI / Anthropic / local — your choice
    return my_client.responses(prompt)

distiller = LLMDistiller(complete, max_tokens=300)
summary = distiller.compress(huge_output, query="refund status")

5. Drop into LangChain — one line per tool

Already have LangChain tools? Wrap any of them and you get back a tool with the same name, description, and argument schema, so the agent calls it unchanged — but its (string) output is compressed before it lands in the scratchpad. The relevance query comes from the tool's own arguments.

pip install tooltrim[langchain]
from tooltrim.integrations import compress_langchain_tool, compress_langchain_tools

fetch = compress_langchain_tool(my_tool, max_tokens=400,
                                query_from=lambda query, **_: query)

# or wrap the whole toolset at once (sharing one compressor + expand store):
tools = compress_langchain_tools(my_tools, max_tokens=400)

See examples/03_langchain_tool.py.

6. Or run it as a proxy — zero code changes

Point your client at the tooltrim proxy; every tool result is compressed (using the latest user message as the relevance query) before being forwarded upstream. Both wire formats are understood, routed by request path — you only change base_url.

python run_proxy.py --upstream https://api.openai.com/v1     # OpenAI-compatible
python run_proxy.py --upstream https://api.anthropic.com/v1  # Claude
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8800/v1", api_key="<upstream key>")

from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:8800")

/v1/chat/completions compresses OpenAI role:"tool" messages; /v1/messages compresses Anthropic tool_result blocks. The proxy is stdlib-only and fails open: if anything goes wrong it forwards the original request untouched, so it never breaks a production call.

Online, it also keeps you under provider rate limits. Against a live hosted model (Groq free tier, 6,000-tokens-per-request cap), 45% of raw tool outputs are rejected (HTTP 413) but 100% of tooltrim-compressed calls fit — a 14,415-token result is compressed to 26 tokens in flight and the call succeeds. See benchmarks/ONLINE_GROQ.md.

7. Scale out — shared expand-store + metrics

The default expand-store is in-process, fine for one worker. To run multiple workers/replicas behind a load balancer, the store must be shared — otherwise a ref minted by one worker can't be expanded by another. Swap in a backend (all are content-addressed, so writes dedup automatically):

from tooltrim import ToolCompressor, FileStore, RedisStore, S3Store

tc = ToolCompressor(store=FileStore("/mnt/shared/tooltrim"))         # zero-dep, shared volume
tc = ToolCompressor(store=RedisStore(url="redis://cache:6379/0",     # pip install tooltrim[redis]
                                     ttl_seconds=86_400))
tc = ToolCompressor(store=S3Store(bucket="my-bucket"))               # pip install tooltrim[s3]

The proxy exposes Prometheus metrics at GET /metrics (tokens in/out/saved, messages compressed, fail-open count, upstream errors, latency) — scrape it to quantify savings fleet-wide:

tooltrim_tokens_saved_total 14389
tooltrim_messages_compressed_total 1
tooltrim_fail_open_total 0

How it works

  1. Pass-through if the output already fits the budget (zero overhead).
  2. Detect the content type (JSON / HTML / tabular / logs / text).
  3. Compress with a type-specific strategy:
    • JSON — preserve structure; sample arrays (keeping the key schema), note (+N more items), truncate long strings; tighten until it fits.
    • HTML — extract readable text (drop script/style/nav/footer), then fit the budget.
    • Tabular — keep the header + a sample of rows + (+N more rows).
    • Logs — collapse repeated lines (x42), always keep errors/warnings, fill with head/tail context.
    • Text — query-aware extractive selection (BM25), with […] elisions.
  4. Stash the full output under a content-addressed ref for expand().

With a query, every compressor keeps the most relevant parts; without one, it falls back to structure-preserving head/tail selection.

How it's different

Tool class What it optimizes tooltrim
Routers (RouteLLM…) which model gets the call orthogonal
Semantic caches repeated identical calls orthogonal
Prompt compressors (LLMLingua) the prompt/instructions different target
Memory frameworks (MemGPT…) conversation history, as a framework you adopt tooltrim is a drop-in on the tool boundary

tooltrim targets the tool-output boundary — the largest and most-ignored token sink in agentic apps — and works alongside all of the above.

Status

v0.1 — deterministic zero-dependency core, 79-test suite, reproducible token + faithfulness benchmarks (with Wilson CIs, cross-model), a proxy speaking both OpenAI and Anthropic wire formats with Prometheus /metrics, a LangChain adapter, pluggable File/Redis/S3 expand-stores for horizontal scale, and citable run artifacts under benchmarks/.

Roadmap: PyPI release + tooltrim CLI, frontier-model faithfulness runs, embedding-based relevance, streaming compression, and native LlamaIndex / OpenAI-Agents wrappers.

Contributions and benchmark cases welcome. MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tooltrim-0.1.0.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tooltrim-0.1.0-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file tooltrim-0.1.0.tar.gz.

File metadata

  • Download URL: tooltrim-0.1.0.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tooltrim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 766a5798eee158ff26f9d830e530abd0fcbe81bf652ff383bc179aa7fc1c9162
MD5 cac06ebd4af69fc0262041eed47c8ff5
BLAKE2b-256 3871c1e4de87aa74880cf4a8543f5fbcf3b798a5cfa017978e7f45e1c9955fb4

See more details on using hashes here.

File details

Details for the file tooltrim-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tooltrim-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tooltrim-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05e6383f68c723e642de68bbaa00196d2d4846f120fc33ca20445c303d28756c
MD5 0c36b51063c95fd1b2291e6240b35b7d
BLAKE2b-256 4a889f2ec766a9616d6f92f8c60905536ee76e420f45c5ed384dc829f27fe4c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page