twotrim

Universal token compression fabric for LLM applications

These details have not been verified by PyPI

Project links

Project description

The Mathematical Prompt Compression Fabric for LLM APIs.

Website • Benchmarks • Quick Start • How it Works

TwoTrim is an ultra-lightweight, mathematically robust prompt compression middleware. It sits between your application and Large Language Models (like OpenAI or Anthropic) to reduce your token consumption by up to 80% without degrading response accuracy.

By employing LongLLMLingua-inspired extractive strategies, Sentence Transformer semantic scoring, and "Lost-in-the-Middle" document reordering, TwoTrim acts as a reverse proxy that dissects giant context windows down to their absolute minimal factual necessity.

📖 Comprehensive Usage Guide

TwoTrim is built on a simple philosophy: Zero Code Refactoring. You can deploy it as an invisible proxy server, or import it natively into your Python backend as an SDK.

Method 1: The Invisible Proxy (Simplest)

The proxy intercepts outgoing OpenAI requests from your app, mathematically deletes up to 80% of the useless tokens, and silently forwards the tiny, optimized prompt to your LLM API.

1. Start the Server:

pip install twotrim
python -m twotrim.cli start --port 8000

2. Update your App (Langchain, LlamaIndex, or Raw Python):

from openai import OpenAI

# Just point the base_url to TwoTrim. Your app won't even know it's being compressed!
client = OpenAI(
    api_key="your-openai-key", 
    base_url="http://localhost:8000/v1" 
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": massive_100k_token_string}],
    extra_body={"compression_mode": "balanced"} # Optional: Control how aggressive the math is!
)

Method 2: The Native Python SDK

If you don't want to run a separate server, you can process the math entirely in your local Python memory before calling OpenAI.

from twotrim.sdk.client import TwoTrimClient

# The TwoTrimClient is an exact drop-in clone of the official OpenAI client
client = TwoTrimClient(
    upstream_base_url="https://api.openai.com/v1",
    api_key="your-openai-key",
    compression_mode="balanced"
)

# Text is mathematically shrunk in memory, then automatically sent to OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": massive_100k_token_string}]
)

print(f"Cost Saved: {response.twotrim_metadata['compression_ratio']}%")

Method 3: Supporting Claude, Gemini, & Any Provider

TwoTrim natively speaks the standard OpenAI JSON format. To instantly compress prompts for Anthropic Claude or Google Gemini, simply run LiteLLM (a free translating proxy) right behind TwoTrim!

Your App → TwoTrim Server (Shrinks Data) → LiteLLM Server (Translates JSON) → Claude/Gemini

⚙️ The 3 Compression Modes

You can control exactly how aggressive the math is by passing compression_mode to your requests.

lossless (The Cleaner): Zero knowledge deletion. Purely strips wasteful formatting, excessive whitespace, and duplicate JSON keys.
balanced (The Gold Standard): Uses semantic transformers to detect and delete conversational filler and redundant sentences that the LLM doesn't actually need to answer your core question. Aims for a safe 50% cost savings.
aggressive (The Eraser): Forces a staggering 80%-90% token reduction. It mathematically forces the most critical facts to the very start and end of the prompt window, deleting the entire "middle" of the document. Perfect for summarizing 100-page meeting transcripts.

🧠 The Math Architecture

Unlike heavy neural block classifiers that require expensive cloud GPUs, TwoTrim runs entirely locally on your CPU in less than 100 milliseconds.

Semantic Chunking: Text is instantly mapped by an ultra-light, blazing-fast transformer (all-MiniLM-L6-v2).
Mutual-Information Pruning: TwoTrim reads your final user query (e.g. "What was Q2 revenue?"), scores every single sentence in the massive context window against it, and permanently deletes irrelevant data.
Lost-in-the-Middle Reordering: Based on Stanford research, LLMs ignore data placed in the middle of prompts. TwoTrim literally rips the surviving facts out and re-orders them to the edges of the context window.

🌍 How TwoTrim Compares to the World

The prompt optimization space is evolving rapidly. While massive tech companies build heavy, complex neural networks to prune tokens, TwoTrim focuses on being the fastest, lightest, and easiest to deploy mathematical alternative.

Here is how TwoTrim stacks up against the current State-of-the-Art (SotA) tools:

Platform / Tool	The Approach	Avg. Tokens Saved	The Trade-off
LLMLingua-2 (Microsoft)	Neural Token Classifier	60% – 80%	Requires expensive GPUs to run efficiently.
LongLLMLingua (Microsoft)	Query-Aware Reordering	70% – 90%	Highly accurate for QA, but heavy to host.
Selective Context	Perplexity Pruning	~50%	Fails on complex, multi-hop reasoning tasks.
RTK (Rust Token Killer)	Regex CLI Proxy	60% – 90%	Built only for local developer terminal logs, not RAG.
TwoTrim	Dynamic Math Routing	60% – 99%	Zero GPUs required. Runs instantly on any CPU.

📈 Verified Benchmark Performance

TwoTrim doesn't just cut tokens; it preserves factual integrity. In our latest scan across established LongBench datasets, we achieved up to 99.5% token removal while maintaining 100% accuracy retention.

> Note: The chart above illustrates the visual trade-off between Token Removal (Bars) and Accuracy Retention (Line). In datasets like PassageCount, TwoTrim actually improves accuracy by removing distracting middle-context noise.

Dataset Evaluated	Token Weight Dropped	Baseline Score	Compressed Score	Status
HotpotQA (Multi-Hop)	52% (Cost Saved)	0.07	0.07	🟢 100% Retained
PassageCount (Logic)	58% (Cost Saved)	0.00	0.20	⭐ Improved!
2WikiMQA (RAG)	74% (Cost Saved)	0.13	0.04	🟡 Semantic Limits
Musique (Extreme RAG)	87% (Cost Saved)	0.10	0.02	🔴 Context Break
RULER (Needle-in-Haystack)	99.5% (Cost Saved)	0.50	0.50	🟢 100% Retained

> Note: On datasets like HotpotQA and Extreme RULER, TwoTrim successfully deletes up to 99.5% of the text while maintaining a flawless 100% accuracy retention compared to the baseline. On PassageCount, compressing the text actually forced the LLM into a higher accuracy tier! (Extreme multi-hop datasets like Musique naturally drop in accuracy at ~87% compression, highlighting the boundary of current semantic limits).

You can manually replicate our live benchmark validations anytime by running python benchmarks/runner.py --limit 10 on your laptop.

🤝 Contributing & License

TwoTrim is proudly open-source under the Apache 2.0 License. We encourage enterprises and hackers alike to use it in production with full legal safety.

Please read our CONTRIBUTING.md to see how you can help expand our context parsers or add support for new base models!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Apr 13, 2026

This version

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twotrim-0.1.0.tar.gz (688.2 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

twotrim-0.1.0-py3-none-any.whl (89.7 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file twotrim-0.1.0.tar.gz.

File metadata

Download URL: twotrim-0.1.0.tar.gz
Upload date: Apr 13, 2026
Size: 688.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for twotrim-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`da8efa20cb3cb7ac034eceadb6f54dcbc3d87deaf35fc06326ea7cc4b6d9bc82`
MD5	`c725f19f35d184399f1c365dd4797653`
BLAKE2b-256	`f09b20cfd07598d08c6f2c9020441f7efef2706254bf36122877aababbd051a0`

See more details on using hashes here.

File details

Details for the file twotrim-0.1.0-py3-none-any.whl.

File metadata

Download URL: twotrim-0.1.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 89.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for twotrim-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d9bd75fff4a82a7064f0d78c202c596c86c830791c895597829abc9f5087d72f`
MD5	`42beae24a05e115b15bc2a8da9c83741`
BLAKE2b-256	`90d62eb8e5a966749c0a59d0e64d8f6e9529bc9a436ce5201267612bbf7da78e`

See more details on using hashes here.

twotrim 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📖 Comprehensive Usage Guide

Method 1: The Invisible Proxy (Simplest)

Method 2: The Native Python SDK

Method 3: Supporting Claude, Gemini, & Any Provider

⚙️ The 3 Compression Modes

🧠 The Math Architecture

🌍 How TwoTrim Compares to the World

📈 Verified Benchmark Performance

🤝 Contributing & License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes