MCP server that compresses prompts 40-60% using local LLM + embedding validation
Project description
token-compressor
mcp-name: io.github.base76-research-lab/token-compressor
Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.
Built by Base76 Research Lab — research into epistemic AI architecture.
What it does
token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:
- LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
- Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.90) — if not, the original is sent unchanged
The result: shorter prompts, lower costs, same intent.
Input prompt (300 tokens)
↓
LLM compresses
↓
Embedding validates (cosine ≥ 0.90?)
↓
Pass → compressed (120 tokens) Fail → original (300 tokens)
Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.
Requirements
- Python 3.10+
- Ollama running locally
- Two models pulled:
ollama pull llama3.2:1b
ollama pull nomic-embed-text
- Python dependencies:
pip install ollama numpy
Quick start
from compressor import LLMCompressEmbedValidate
pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")
print(result.output_text) # compressed (or original if validation failed)
print(result.report()) # MODE / COVERAGE / TOKENS saved
Result object:
| Field | Description |
|---|---|
output_text |
Text to send to your LLM |
mode |
compressed / raw_fallback / skipped |
coverage |
Cosine similarity (0.0–1.0) |
tokens_in |
Estimated input tokens |
tokens_out |
Estimated output tokens |
tokens_saved |
Difference |
CLI usage
echo "Your long prompt here..." | python3 cli.py
Output: compressed text on stdout, stats on stderr.
Claude Code hook (recommended setup)
Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:
{
"type": "command",
"command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}
This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.
MCP server
The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.
Install:
pip install token-compressor-mcp
Tool: compress_prompt
- Input:
text(string) - Output: compressed text + stats footer
Claude Code MCP config (~/.claude/settings.json):
{
"mcpServers": {
"token-compressor": {
"command": "uvx",
"args": ["token-compressor-mcp"]
}
}
}
Or from source:
{
"mcpServers": {
"token-compressor": {
"command": "python3",
"args": ["-m", "token_compressor_mcp"],
"cwd": "/path/to/token-compressor"
}
}
}
Configuration
pipeline = LLMCompressEmbedValidate(
threshold=0.90, # cosine similarity floor (lower = more aggressive)
min_tokens=80, # skip pipeline below this (not worth compressing)
compress_model="llama3.2:1b",
embed_model="nomic-embed-text",
)
How it works
Stage 1 — LLM compression
The compression prompt instructs the model to:
- Preserve all conditionals (
if,only if,unless,when,but only) - Preserve all negations
- Remove filler, hedging, redundancy
- Target 40–60% of original length
Stage 2 — Embedding validation
Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.
Results
Tested across Swedish and English prompts, technical and natural language:
| Input | Tokens in | Tokens out | Saved |
|---|---|---|---|
| Research abstract (EN) | 89 | 38 | 57% |
| Session intent (SV) | 32 | 18 | 44% |
| Technical instruction | 47 | 22 | 53% |
| Short command (<80t) | — | — | skipped |
Research background
This tool implements the architecture from:
Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535
Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.
License
MIT — Base76 Research Lab, Sweden
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file token_compressor_mcp-0.1.1.tar.gz.
File metadata
- Download URL: token_compressor_mcp-0.1.1.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13a8d2bf7b580c870c8af01820b1087ef52192ab46d8bd7487ab6e7dc22362d5
|
|
| MD5 |
6f92bcf5085ece7199098a23413f4a1a
|
|
| BLAKE2b-256 |
2b3dd720af8706a0fc1025300aa786f74a36ebba3d165cfbeb175c1a1b8d79a0
|
File details
Details for the file token_compressor_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: token_compressor_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03fec7aed566b4c00c63fdf10b649871aaa74d9cb5c423c71366ba6f290f7eef
|
|
| MD5 |
dde57c7fa3beb19f83f44a5843487535
|
|
| BLAKE2b-256 |
1049ff9ddfc632418542ad3de5ab95e7636fb07bad47b880c0a669eda31fc200
|