Skip to main content

Intelligent token compression + routing MCP server

Project description

AK-Primus v0.2.0

Intelligent token compression + routing MCP server with adaptive self-healing.

Every LLM request passes through a 7-layer pipeline selected by an 8-class hybrid classifier. The right compression and retrieval stack runs automatically for each request type. Nothing is applied blindly. The system gets cheaper the more it is used.

Request → Classifier → Router → [Cache | Memory | Compress | Search | Prompt-Opt] → LLM
                                                                                       ↓
                                                        Quality Score ← Response ←────┘
                                                              ↓
                                                    Adaptive Profile Update

What's new in 0.2.0

Area Change
Compression Expansion guard — never inflates token count; LLMLingua-1 uses GPT-2 (was 7B Llama)
Classifier ML hybrid (400-example logistic regression + rule fast-path); 88%+ accuracy on 275-scenario suite
Cache 3-level lookup: SHA-256 exact → ChromaDB HNSW ANN → SQLite cosine fallback
Memory L1 working memory + L2 episodic (session consolidation) + L3 semantic (cross-session ChromaDB)
DSPy 4 typed Signatures wired end-to-end; BootstrapFewShot + MIPRO2 with disk cache
Quality ROUGE-L + BERTScore + LLM-as-judge blended score; score_async() variant
Transport AK_PRIMUS_TRANSPORT=http — Starlette ASGI with /health, /ready, /metrics, SSE
Testing 84 pytest tests + 275-scenario benchmark suite with per-class accuracy CI gate

Architecture

ak-primus/
├── core/
│   └── ak_primus/
│       ├── classifier.py          # 8-class hybrid (rule + ML logistic regression)
│       ├── router.py              # Stack selection per request type
│       ├── server.py              # MCP server — 7 tools, stdio + HTTP/SSE
│       ├── layers/
│       │   ├── cache.py           # 3-level semantic cache (exact / HNSW / cosine)
│       │   ├── compression.py     # LLMLingua-2, LLMLingua-1, LongLLMLingua, SelectiveContext
│       │   ├── memory.py          # L1 working | L2 episodic | L3 semantic (ChromaDB)
│       │   ├── metrics.py         # tiktoken real token counting + cost accounting
│       │   ├── prompt_opt.py      # DSPy BootstrapFewShot + MIPRO2, OPRO, Medprompt
│       │   ├── quality.py         # ROUGE-L + BERTScore + LLM-as-judge + adaptive profile
│       │   └── search.py          # HyDE, RAPTOR, FLARE, ColBERT retrieval
│       ├── ml/
│       │   └── classifier_ml.py   # Embedding-based logistic regression (400 training examples)
│       └── storage/
│           ├── session_store.py   # SQLite WAL — sessions, cache, profiles, memory_facts
│           └── vector_store.py    # ChromaDB HNSW — semantic_cache + memory_facts
├── extension/
│   └── src/
│       ├── extension.ts           # VS Code extension entry point
│       ├── dashboard.ts           # Real-time metrics webview
│       └── mcp-client.ts          # MCP stdio bridge
├── tests/
│   ├── unit/                      # 70 unit tests (all layers)
│   └── integration/               # 14 integration + benchmark threshold tests
└── benchmarks/
    └── run_1200_scenarios.py      # 275-scenario accuracy + latency suite

7 MCP Tools

Tool Purpose
classify_request Detect request type + return recommended stack with expected token reduction
compress_history Apply compression stack to message history; returns compressed messages + savings
build_context HyDE / RAPTOR / FLARE retrieval-augmented context building
optimize_prompt DSPy BootstrapFewShot / OPRO / Medprompt prompt optimisation
get_token_report Real tiktoken metrics, cost savings, session stats
process_request Master pipeline: classify → cache → memory → compress → quality → adapt
report_quality Feed quality signal (0–1) back into adaptive compression profile

process_request — master pipeline

{
  "optimized_messages": [...],
  "tokens_before": 1240,
  "tokens_after": 487,
  "tokens_saved": 753,
  "savings_pct": 60.7,
  "request_type": "code",
  "confidence": 0.91,
  "quality_score": 0.876,
  "adapted_ratio": 0.382,
  "cache_hit": false,
  "session_id": "sess-abc123",
  "lifetime_savings": 41250
}

8 Request Types

Type Classifier Trigger Default Stack
code Code keywords + task verbs + C++/C# detection SelectiveContext → PrefixCache
rag_doc Documents present + QA-style question HyDE retrieval + LLMLingua-2
agent_session Multi-turn history + follow-up phrases WorkingMemory + L3 semantic
domain_expert Legal / medical / finance domain terms Medprompt + SelectiveContext
multi_hop "relationship", "compare", "trace" patterns RAPTOR + ChainOfThought
math Equations, proof, calculate keywords LLMLingua-1 (formula-aware)
fixed_template Long system prompt (>600 tokens) PrefixCache (no compression)
simple_qa Short conversational question Light SelectiveContext

Installation

# Minimal (core MCP server, no ML)
pip install ak-primus

# Full (all optional groups)
pip install "ak-primus[all]"

# From source
git clone https://github.com/ak-primus/ak-primus
pip install -e "core[all]"
Group Installs When to use
compress LLMLingua, transformers, torch Token compression
retrieval sentence-transformers, chromadb, scikit-learn Semantic cache + search
optimize dspy-ai, evaluate DSPy + ROUGE/BERTScore
http uvicorn, starlette HTTP/SSE transport
dev pytest, ruff, mypy Development

Quick Start

Claude Desktop

{
  "mcpServers": {
    "ak-primus": {
      "command": "ak-primus",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Docker

# stdio (MCP)
docker build -f Dockerfile.akprimus --target runtime -t ak-primus:0.2.0 .
docker run -i --rm -e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" -v ak_data:/data ak-primus:0.2.0

# HTTP/SSE with health probes
docker compose --profile http up
curl http://localhost:8080/health

Self-Healing Compression

The adaptive profile tunes compression ratios over time per (workspace, request_type):

quality ≥ 0.85  → compress more aggressively next time
quality < 0.70  → back off compression

After ~50 samples per type, the ratio converges to the Pareto-optimal point.

# Feed quality signal back after reviewing LLM response
curl -s http://localhost:8080/mcp -d '{"tool":"report_quality","quality_score":0.92,"request_type":"code"}'

Running Tests

pytest tests/ -q                              # 84 tests
python benchmarks/run_1200_scenarios.py --quick   # accuracy + latency

Benchmark (v0.2.0, 275 scenarios)

Class Accuracy
math 96%
fixed_template 100%
agent_session 92%
domain_expert 90%
multi_hop 90%
simple_qa 88%
code 82%
rag_doc 80%
Overall ~88%

Classifier latency: p50 < 1ms, p95 < 5ms (rule-based path, no model load).


Environment Variables

Variable Default Description
AK_PRIMUS_MODEL claude-sonnet-4-6 LLM model for DSPy / OPRO / judge
AK_PRIMUS_TRANSPORT stdio stdio or http
AK_PRIMUS_HOST 127.0.0.1 HTTP bind host
AK_PRIMUS_PORT 8080 HTTP bind port
AK_PRIMUS_DB ~/.ak_primus/store.db SQLite path
AK_PRIMUS_DSPY_CACHE ~/.ak_primus/dspy_cache DSPy program cache

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ak_primus-0.2.0.tar.gz (121.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ak_primus-0.2.0-py3-none-any.whl (127.0 kB view details)

Uploaded Python 3

File details

Details for the file ak_primus-0.2.0.tar.gz.

File metadata

  • Download URL: ak_primus-0.2.0.tar.gz
  • Upload date:
  • Size: 121.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0db3fa2966780e2cb67cd252ba179c55c4f9c47eba4fda025e04ddd67c0e7a7e
MD5 3ffcbf677f3648cc5121f2b934712c7e
BLAKE2b-256 4d6f84192aba385d455eb13678ed49065865f8073d0ef19e7ee5aa308cea436f

See more details on using hashes here.

File details

Details for the file ak_primus-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ak_primus-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 127.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fc5fc088dc6894031be55105e21ced991d14503dd7cefd82236f4818f65745a
MD5 6955fc2ec453a8e001348d1e65c257ad
BLAKE2b-256 0e7a9ac3e834497a9bf770ffc9fe965a0d192c119716a6eccdb019718349377e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page