Skip to main content

Intelligent token compression + routing MCP server

Project description

AK-Primus v0.2.0

Intelligent token compression + routing MCP server with adaptive self-healing.

Every LLM request passes through a 7-layer pipeline selected by an 8-class hybrid classifier. The right compression and retrieval stack runs automatically for each request type. Nothing is applied blindly. The system gets cheaper the more it is used.

Request → Classifier → Router → [Cache | Memory | Compress | Search | Prompt-Opt] → LLM
                                                                                       ↓
                                                        Quality Score ← Response ←────┘
                                                              ↓
                                                    Adaptive Profile Update

What's new in 0.2.0

Area Change
Compression Expansion guard — never inflates token count; LLMLingua-1 uses GPT-2 (was 7B Llama)
Classifier ML hybrid (400-example logistic regression + rule fast-path); 88%+ accuracy on 275-scenario suite
Cache 3-level lookup: SHA-256 exact → ChromaDB HNSW ANN → SQLite cosine fallback
Memory L1 working memory + L2 episodic (session consolidation) + L3 semantic (cross-session ChromaDB)
DSPy 4 typed Signatures wired end-to-end; BootstrapFewShot + MIPRO2 with disk cache
Quality ROUGE-L + BERTScore + LLM-as-judge blended score; score_async() variant
Transport AK_PRIMUS_TRANSPORT=http — Starlette ASGI with /health, /ready, /metrics, SSE
Testing 84 pytest tests + 275-scenario benchmark suite with per-class accuracy CI gate

Architecture

ak-primus/
├── core/
│   └── ak_primus/
│       ├── classifier.py          # 8-class hybrid (rule + ML logistic regression)
│       ├── router.py              # Stack selection per request type
│       ├── server.py              # MCP server — 7 tools, stdio + HTTP/SSE
│       ├── layers/
│       │   ├── cache.py           # 3-level semantic cache (exact / HNSW / cosine)
│       │   ├── compression.py     # LLMLingua-2, LLMLingua-1, LongLLMLingua, SelectiveContext
│       │   ├── memory.py          # L1 working | L2 episodic | L3 semantic (ChromaDB)
│       │   ├── metrics.py         # tiktoken real token counting + cost accounting
│       │   ├── prompt_opt.py      # DSPy BootstrapFewShot + MIPRO2, OPRO, Medprompt
│       │   ├── quality.py         # ROUGE-L + BERTScore + LLM-as-judge + adaptive profile
│       │   └── search.py          # HyDE, RAPTOR, FLARE, ColBERT retrieval
│       ├── ml/
│       │   └── classifier_ml.py   # Embedding-based logistic regression (400 training examples)
│       └── storage/
│           ├── session_store.py   # SQLite WAL — sessions, cache, profiles, memory_facts
│           └── vector_store.py    # ChromaDB HNSW — semantic_cache + memory_facts
├── extension/
│   └── src/
│       ├── extension.ts           # VS Code extension entry point
│       ├── dashboard.ts           # Real-time metrics webview
│       └── mcp-client.ts          # MCP stdio bridge
├── tests/
│   ├── unit/                      # 70 unit tests (all layers)
│   └── integration/               # 14 integration + benchmark threshold tests
└── benchmarks/
    └── run_1200_scenarios.py      # 275-scenario accuracy + latency suite

7 MCP Tools

Tool Purpose
classify_request Detect request type + return recommended stack with expected token reduction
compress_history Apply compression stack to message history; returns compressed messages + savings
build_context HyDE / RAPTOR / FLARE retrieval-augmented context building
optimize_prompt DSPy BootstrapFewShot / OPRO / Medprompt prompt optimisation
get_token_report Real tiktoken metrics, cost savings, session stats
process_request Master pipeline: classify → cache → memory → compress → quality → adapt
report_quality Feed quality signal (0–1) back into adaptive compression profile

process_request — master pipeline

{
  "optimized_messages": [...],
  "tokens_before": 1240,
  "tokens_after": 487,
  "tokens_saved": 753,
  "savings_pct": 60.7,
  "request_type": "code",
  "confidence": 0.91,
  "quality_score": 0.876,
  "adapted_ratio": 0.382,
  "cache_hit": false,
  "session_id": "sess-abc123",
  "lifetime_savings": 41250
}

8 Request Types

Type Classifier Trigger Default Stack
code Code keywords + task verbs + C++/C# detection SelectiveContext → PrefixCache
rag_doc Documents present + QA-style question HyDE retrieval + LLMLingua-2
agent_session Multi-turn history + follow-up phrases WorkingMemory + L3 semantic
domain_expert Legal / medical / finance domain terms Medprompt + SelectiveContext
multi_hop "relationship", "compare", "trace" patterns RAPTOR + ChainOfThought
math Equations, proof, calculate keywords LLMLingua-1 (formula-aware)
fixed_template Long system prompt (>600 tokens) PrefixCache (no compression)
simple_qa Short conversational question Light SelectiveContext

Installation

# Minimal (core MCP server, no ML)
pip install ak-primus

# Full (all optional groups)
pip install "ak-primus[all]"

# From source
git clone https://github.com/ak-primus/ak-primus
pip install -e "core[all]"
Group Installs When to use
compress LLMLingua, transformers, torch Token compression
retrieval sentence-transformers, chromadb, scikit-learn Semantic cache + search
optimize dspy-ai, evaluate DSPy + ROUGE/BERTScore
http uvicorn, starlette HTTP/SSE transport
dev pytest, ruff, mypy Development

Quick Start

Claude Desktop

{
  "mcpServers": {
    "ak-primus": {
      "command": "ak-primus",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Docker

# stdio (MCP)
docker build -f Dockerfile.akprimus --target runtime -t ak-primus:0.2.0 .
docker run -i --rm -e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" -v ak_data:/data ak-primus:0.2.0

# HTTP/SSE with health probes
docker compose --profile http up
curl http://localhost:8080/health

Self-Healing Compression

The adaptive profile tunes compression ratios over time per (workspace, request_type):

quality ≥ 0.85  → compress more aggressively next time
quality < 0.70  → back off compression

After ~50 samples per type, the ratio converges to the Pareto-optimal point.

# Feed quality signal back after reviewing LLM response
curl -s http://localhost:8080/mcp -d '{"tool":"report_quality","quality_score":0.92,"request_type":"code"}'

Running Tests

pytest tests/ -q                              # 84 tests
python benchmarks/run_1200_scenarios.py --quick   # accuracy + latency

Benchmark (v0.2.0, 275 scenarios)

Class Accuracy
math 96%
fixed_template 100%
agent_session 92%
domain_expert 90%
multi_hop 90%
simple_qa 88%
code 82%
rag_doc 80%
Overall ~88%

Classifier latency: p50 < 1ms, p95 < 5ms (rule-based path, no model load).


Environment Variables

Variable Default Description
AK_PRIMUS_MODEL claude-sonnet-4-6 LLM model for DSPy / OPRO / judge
AK_PRIMUS_TRANSPORT stdio stdio or http
AK_PRIMUS_HOST 127.0.0.1 HTTP bind host
AK_PRIMUS_PORT 8080 HTTP bind port
AK_PRIMUS_DB ~/.ak_primus/store.db SQLite path
AK_PRIMUS_DSPY_CACHE ~/.ak_primus/dspy_cache DSPy program cache

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ak_primus-0.2.1.tar.gz (121.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ak_primus-0.2.1-py3-none-any.whl (127.4 kB view details)

Uploaded Python 3

File details

Details for the file ak_primus-0.2.1.tar.gz.

File metadata

  • Download URL: ak_primus-0.2.1.tar.gz
  • Upload date:
  • Size: 121.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c2271cea0db9e9b6c49d1ea7b92aaaf48648665007e9c8e6ef45affd4abd6cd4
MD5 512d2976669a54db7ad57b554da6665c
BLAKE2b-256 057838eb8b71d7d31edc508a3576130c4f9bb274dc4d8e332e14795220469659

See more details on using hashes here.

File details

Details for the file ak_primus-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: ak_primus-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 127.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6345ba1ba4f33de94ed7681d2c83fb91f86c2ba2e76c8b316550f0d6c2da542a
MD5 448ced86ce257dc81400c9402bcacaa5
BLAKE2b-256 d368fa713aca68af94b1411233ab8e02536b55fde1efab55141b7a71e61c121f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page