Intelligent token compression + routing MCP server
Project description
AK-Primus v0.2.0
Intelligent token compression + routing MCP server with adaptive self-healing.
Every LLM request passes through a 7-layer pipeline selected by an 8-class hybrid classifier. The right compression and retrieval stack runs automatically for each request type. Nothing is applied blindly. The system gets cheaper the more it is used.
Request → Classifier → Router → [Cache | Memory | Compress | Search | Prompt-Opt] → LLM
↓
Quality Score ← Response ←────┘
↓
Adaptive Profile Update
What's new in 0.2.0
| Area | Change |
|---|---|
| Compression | Expansion guard — never inflates token count; LLMLingua-1 uses GPT-2 (was 7B Llama) |
| Classifier | ML hybrid (400-example logistic regression + rule fast-path); 88%+ accuracy on 275-scenario suite |
| Cache | 3-level lookup: SHA-256 exact → ChromaDB HNSW ANN → SQLite cosine fallback |
| Memory | L1 working memory + L2 episodic (session consolidation) + L3 semantic (cross-session ChromaDB) |
| DSPy | 4 typed Signatures wired end-to-end; BootstrapFewShot + MIPRO2 with disk cache |
| Quality | ROUGE-L + BERTScore + LLM-as-judge blended score; score_async() variant |
| Transport | AK_PRIMUS_TRANSPORT=http — Starlette ASGI with /health, /ready, /metrics, SSE |
| Testing | 84 pytest tests + 275-scenario benchmark suite with per-class accuracy CI gate |
Architecture
ak-primus/
├── core/
│ └── ak_primus/
│ ├── classifier.py # 8-class hybrid (rule + ML logistic regression)
│ ├── router.py # Stack selection per request type
│ ├── server.py # MCP server — 7 tools, stdio + HTTP/SSE
│ ├── layers/
│ │ ├── cache.py # 3-level semantic cache (exact / HNSW / cosine)
│ │ ├── compression.py # LLMLingua-2, LLMLingua-1, LongLLMLingua, SelectiveContext
│ │ ├── memory.py # L1 working | L2 episodic | L3 semantic (ChromaDB)
│ │ ├── metrics.py # tiktoken real token counting + cost accounting
│ │ ├── prompt_opt.py # DSPy BootstrapFewShot + MIPRO2, OPRO, Medprompt
│ │ ├── quality.py # ROUGE-L + BERTScore + LLM-as-judge + adaptive profile
│ │ └── search.py # HyDE, RAPTOR, FLARE, ColBERT retrieval
│ ├── ml/
│ │ └── classifier_ml.py # Embedding-based logistic regression (400 training examples)
│ └── storage/
│ ├── session_store.py # SQLite WAL — sessions, cache, profiles, memory_facts
│ └── vector_store.py # ChromaDB HNSW — semantic_cache + memory_facts
├── extension/
│ └── src/
│ ├── extension.ts # VS Code extension entry point
│ ├── dashboard.ts # Real-time metrics webview
│ └── mcp-client.ts # MCP stdio bridge
├── tests/
│ ├── unit/ # 70 unit tests (all layers)
│ └── integration/ # 14 integration + benchmark threshold tests
└── benchmarks/
└── run_1200_scenarios.py # 275-scenario accuracy + latency suite
7 MCP Tools
| Tool | Purpose |
|---|---|
classify_request |
Detect request type + return recommended stack with expected token reduction |
compress_history |
Apply compression stack to message history; returns compressed messages + savings |
build_context |
HyDE / RAPTOR / FLARE retrieval-augmented context building |
optimize_prompt |
DSPy BootstrapFewShot / OPRO / Medprompt prompt optimisation |
get_token_report |
Real tiktoken metrics, cost savings, session stats |
process_request |
Master pipeline: classify → cache → memory → compress → quality → adapt |
report_quality |
Feed quality signal (0–1) back into adaptive compression profile |
process_request — master pipeline
{
"optimized_messages": [...],
"tokens_before": 1240,
"tokens_after": 487,
"tokens_saved": 753,
"savings_pct": 60.7,
"request_type": "code",
"confidence": 0.91,
"quality_score": 0.876,
"adapted_ratio": 0.382,
"cache_hit": false,
"session_id": "sess-abc123",
"lifetime_savings": 41250
}
8 Request Types
| Type | Classifier Trigger | Default Stack |
|---|---|---|
code |
Code keywords + task verbs + C++/C# detection | SelectiveContext → PrefixCache |
rag_doc |
Documents present + QA-style question | HyDE retrieval + LLMLingua-2 |
agent_session |
Multi-turn history + follow-up phrases | WorkingMemory + L3 semantic |
domain_expert |
Legal / medical / finance domain terms | Medprompt + SelectiveContext |
multi_hop |
"relationship", "compare", "trace" patterns | RAPTOR + ChainOfThought |
math |
Equations, proof, calculate keywords | LLMLingua-1 (formula-aware) |
fixed_template |
Long system prompt (>600 tokens) | PrefixCache (no compression) |
simple_qa |
Short conversational question | Light SelectiveContext |
Installation
# Minimal (core MCP server, no ML)
pip install ak-primus
# Full (all optional groups)
pip install "ak-primus[all]"
# From source
git clone https://github.com/ak-primus/ak-primus
pip install -e "core[all]"
| Group | Installs | When to use |
|---|---|---|
compress |
LLMLingua, transformers, torch | Token compression |
retrieval |
sentence-transformers, chromadb, scikit-learn | Semantic cache + search |
optimize |
dspy-ai, evaluate | DSPy + ROUGE/BERTScore |
http |
uvicorn, starlette | HTTP/SSE transport |
dev |
pytest, ruff, mypy | Development |
Quick Start
Claude Desktop
{
"mcpServers": {
"ak-primus": {
"command": "ak-primus",
"args": ["serve"],
"env": {
"ANTHROPIC_API_KEY": "sk-ant-..."
}
}
}
}
Docker
# stdio (MCP)
docker build -f Dockerfile.akprimus --target runtime -t ak-primus:0.2.0 .
docker run -i --rm -e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" -v ak_data:/data ak-primus:0.2.0
# HTTP/SSE with health probes
docker compose --profile http up
curl http://localhost:8080/health
Self-Healing Compression
The adaptive profile tunes compression ratios over time per (workspace, request_type):
quality ≥ 0.85 → compress more aggressively next time
quality < 0.70 → back off compression
After ~50 samples per type, the ratio converges to the Pareto-optimal point.
# Feed quality signal back after reviewing LLM response
curl -s http://localhost:8080/mcp -d '{"tool":"report_quality","quality_score":0.92,"request_type":"code"}'
Running Tests
pytest tests/ -q # 84 tests
python benchmarks/run_1200_scenarios.py --quick # accuracy + latency
Benchmark (v0.2.0, 275 scenarios)
| Class | Accuracy |
|---|---|
math |
96% |
fixed_template |
100% |
agent_session |
92% |
domain_expert |
90% |
multi_hop |
90% |
simple_qa |
88% |
code |
82% |
rag_doc |
80% |
| Overall | ~88% |
Classifier latency: p50 < 1ms, p95 < 5ms (rule-based path, no model load).
Environment Variables
| Variable | Default | Description |
|---|---|---|
AK_PRIMUS_MODEL |
claude-sonnet-4-6 |
LLM model for DSPy / OPRO / judge |
AK_PRIMUS_TRANSPORT |
stdio |
stdio or http |
AK_PRIMUS_HOST |
127.0.0.1 |
HTTP bind host |
AK_PRIMUS_PORT |
8080 |
HTTP bind port |
AK_PRIMUS_DB |
~/.ak_primus/store.db |
SQLite path |
AK_PRIMUS_DSPY_CACHE |
~/.ak_primus/dspy_cache |
DSPy program cache |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ak_primus-0.2.1.tar.gz.
File metadata
- Download URL: ak_primus-0.2.1.tar.gz
- Upload date:
- Size: 121.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2271cea0db9e9b6c49d1ea7b92aaaf48648665007e9c8e6ef45affd4abd6cd4
|
|
| MD5 |
512d2976669a54db7ad57b554da6665c
|
|
| BLAKE2b-256 |
057838eb8b71d7d31edc508a3576130c4f9bb274dc4d8e332e14795220469659
|
File details
Details for the file ak_primus-0.2.1-py3-none-any.whl.
File metadata
- Download URL: ak_primus-0.2.1-py3-none-any.whl
- Upload date:
- Size: 127.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6345ba1ba4f33de94ed7681d2c83fb91f86c2ba2e76c8b316550f0d6c2da542a
|
|
| MD5 |
448ced86ce257dc81400c9402bcacaa5
|
|
| BLAKE2b-256 |
d368fa713aca68af94b1411233ab8e02536b55fde1efab55141b7a71e61c121f
|