Skip to main content

MCP server that compresses prompts 40-60% using local LLM + embedding validation

Project description

token-compressor

mcp-name: io.github.base76-research-lab/token-compressor

Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.

License: MIT Requires: Ollama MCP Compatible

Built by Base76 Research Lab — research into epistemic AI architecture.


What it does

token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:

  1. LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
  2. Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.90) — if not, the original is sent unchanged

The result: shorter prompts, lower costs, same intent.

Input prompt (300 tokens)
        ↓
  LLM compresses
        ↓
  Embedding validates (cosine ≥ 0.90?)
        ↓
  Pass → compressed (120 tokens)   Fail → original (300 tokens)

Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.


Requirements

  • Python 3.10+
  • Ollama running locally
  • Two models pulled:
ollama pull llama3.2:1b
ollama pull nomic-embed-text
  • Python dependencies:
pip install ollama numpy

Quick start

from compressor import LLMCompressEmbedValidate

pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")

print(result.output_text)   # compressed (or original if validation failed)
print(result.report())      # MODE / COVERAGE / TOKENS saved

Result object:

Field Description
output_text Text to send to your LLM
mode compressed / raw_fallback / skipped
coverage Cosine similarity (0.0–1.0)
tokens_in Estimated input tokens
tokens_out Estimated output tokens
tokens_saved Difference

CLI usage

echo "Your long prompt here..." | python3 cli.py

Output: compressed text on stdout, stats on stderr.


Claude Code hook (recommended setup)

Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:

{
  "type": "command",
  "command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}

This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.


MCP server

The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.

Install:

pip install token-compressor-mcp

Tool: compress_prompt

  • Input: text (string)
  • Output: compressed text + stats footer

Claude Code MCP config (~/.claude/settings.json):

{
  "mcpServers": {
    "token-compressor": {
      "command": "uvx",
      "args": ["token-compressor-mcp"]
    }
  }
}

Or from source:

{
  "mcpServers": {
    "token-compressor": {
      "command": "python3",
      "args": ["-m", "token_compressor_mcp"],
      "cwd": "/path/to/token-compressor"
    }
  }
}

Configuration

pipeline = LLMCompressEmbedValidate(
    threshold=0.90,          # cosine similarity floor (lower = more aggressive)
    min_tokens=80,           # skip pipeline below this (not worth compressing)
    compress_model="llama3.2:1b",
    embed_model="nomic-embed-text",
)

How it works

Stage 1 — LLM compression

The compression prompt instructs the model to:

  • Preserve all conditionals (if, only if, unless, when, but only)
  • Preserve all negations
  • Remove filler, hedging, redundancy
  • Target 40–60% of original length

Stage 2 — Embedding validation

Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.


Results

Tested across Swedish and English prompts, technical and natural language:

Input Tokens in Tokens out Saved
Research abstract (EN) 89 38 57%
Session intent (SV) 32 18 44%
Technical instruction 47 22 53%
Short command (<80t) skipped

Research background

This tool implements the architecture from:

Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535

Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.


License

MIT — Base76 Research Lab, Sweden

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

token_compressor_mcp-0.1.1.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

token_compressor_mcp-0.1.1-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file token_compressor_mcp-0.1.1.tar.gz.

File metadata

  • Download URL: token_compressor_mcp-0.1.1.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for token_compressor_mcp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 13a8d2bf7b580c870c8af01820b1087ef52192ab46d8bd7487ab6e7dc22362d5
MD5 6f92bcf5085ece7199098a23413f4a1a
BLAKE2b-256 2b3dd720af8706a0fc1025300aa786f74a36ebba3d165cfbeb175c1a1b8d79a0

See more details on using hashes here.

File details

Details for the file token_compressor_mcp-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for token_compressor_mcp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 03fec7aed566b4c00c63fdf10b649871aaa74d9cb5c423c71366ba6f290f7eef
MD5 dde57c7fa3beb19f83f44a5843487535
BLAKE2b-256 1049ff9ddfc632418542ad3de5ab95e7636fb07bad47b880c0a669eda31fc200

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page