Intelligent token compression + routing MCP server

These details have not been verified by PyPI

Project links

Project description

AK-Primus v0.2.0

Intelligent token compression + routing MCP server with adaptive self-healing.

Every LLM request passes through a 7-layer pipeline selected by an 8-class hybrid classifier. The right compression and retrieval stack runs automatically for each request type. Nothing is applied blindly. The system gets cheaper the more it is used.

Request → Classifier → Router → [Cache | Memory | Compress | Search | Prompt-Opt] → LLM
                                                                                       ↓
                                                        Quality Score ← Response ←────┘
                                                              ↓
                                                    Adaptive Profile Update

What's new in 0.2.0

Area	Change
Compression	Expansion guard — never inflates token count; LLMLingua-1 uses GPT-2 (was 7B Llama)
Classifier	ML hybrid (400-example logistic regression + rule fast-path); 88%+ accuracy on 275-scenario suite
Cache	3-level lookup: SHA-256 exact → ChromaDB HNSW ANN → SQLite cosine fallback
Memory	L1 working memory + L2 episodic (session consolidation) + L3 semantic (cross-session ChromaDB)
DSPy	4 typed Signatures wired end-to-end; BootstrapFewShot + MIPRO2 with disk cache
Quality	ROUGE-L + BERTScore + LLM-as-judge blended score; `score_async()` variant
Transport	`AK_PRIMUS_TRANSPORT=http` — Starlette ASGI with `/health`, `/ready`, `/metrics`, SSE
Testing	84 pytest tests + 275-scenario benchmark suite with per-class accuracy CI gate

Architecture

ak-primus/
├── core/
│   └── ak_primus/
│       ├── classifier.py          # 8-class hybrid (rule + ML logistic regression)
│       ├── router.py              # Stack selection per request type
│       ├── server.py              # MCP server — 7 tools, stdio + HTTP/SSE
│       ├── layers/
│       │   ├── cache.py           # 3-level semantic cache (exact / HNSW / cosine)
│       │   ├── compression.py     # LLMLingua-2, LLMLingua-1, LongLLMLingua, SelectiveContext
│       │   ├── memory.py          # L1 working | L2 episodic | L3 semantic (ChromaDB)
│       │   ├── metrics.py         # tiktoken real token counting + cost accounting
│       │   ├── prompt_opt.py      # DSPy BootstrapFewShot + MIPRO2, OPRO, Medprompt
│       │   ├── quality.py         # ROUGE-L + BERTScore + LLM-as-judge + adaptive profile
│       │   └── search.py          # HyDE, RAPTOR, FLARE, ColBERT retrieval
│       ├── ml/
│       │   └── classifier_ml.py   # Embedding-based logistic regression (400 training examples)
│       └── storage/
│           ├── session_store.py   # SQLite WAL — sessions, cache, profiles, memory_facts
│           └── vector_store.py    # ChromaDB HNSW — semantic_cache + memory_facts
├── extension/
│   └── src/
│       ├── extension.ts           # VS Code extension entry point
│       ├── dashboard.ts           # Real-time metrics webview
│       └── mcp-client.ts          # MCP stdio bridge
├── tests/
│   ├── unit/                      # 70 unit tests (all layers)
│   └── integration/               # 14 integration + benchmark threshold tests
└── benchmarks/
    └── run_1200_scenarios.py      # 275-scenario accuracy + latency suite

7 MCP Tools

Tool	Purpose
`classify_request`	Detect request type + return recommended stack with expected token reduction
`compress_history`	Apply compression stack to message history; returns compressed messages + savings
`build_context`	HyDE / RAPTOR / FLARE retrieval-augmented context building
`optimize_prompt`	DSPy BootstrapFewShot / OPRO / Medprompt prompt optimisation
`get_token_report`	Real tiktoken metrics, cost savings, session stats
`process_request`	Master pipeline: classify → cache → memory → compress → quality → adapt
`report_quality`	Feed quality signal (0–1) back into adaptive compression profile

`process_request` — master pipeline

{
  "optimized_messages": [...],
  "tokens_before": 1240,
  "tokens_after": 487,
  "tokens_saved": 753,
  "savings_pct": 60.7,
  "request_type": "code",
  "confidence": 0.91,
  "quality_score": 0.876,
  "adapted_ratio": 0.382,
  "cache_hit": false,
  "session_id": "sess-abc123",
  "lifetime_savings": 41250
}

8 Request Types

Type	Classifier Trigger	Default Stack
`code`	Code keywords + task verbs + C++/C# detection	SelectiveContext → PrefixCache
`rag_doc`	Documents present + QA-style question	HyDE retrieval + LLMLingua-2
`agent_session`	Multi-turn history + follow-up phrases	WorkingMemory + L3 semantic
`domain_expert`	Legal / medical / finance domain terms	Medprompt + SelectiveContext
`multi_hop`	"relationship", "compare", "trace" patterns	RAPTOR + ChainOfThought
`math`	Equations, proof, calculate keywords	LLMLingua-1 (formula-aware)
`fixed_template`	Long system prompt (>600 tokens)	PrefixCache (no compression)
`simple_qa`	Short conversational question	Light SelectiveContext

Installation

# Minimal (core MCP server, no ML)
pip install ak-primus

# Full (all optional groups)
pip install "ak-primus[all]"

# From source
git clone https://github.com/ak-primus/ak-primus
pip install -e "core[all]"

Group	Installs	When to use
`compress`	LLMLingua, transformers, torch	Token compression
`retrieval`	sentence-transformers, chromadb, scikit-learn	Semantic cache + search
`optimize`	dspy-ai, evaluate	DSPy + ROUGE/BERTScore
`http`	uvicorn, starlette	HTTP/SSE transport
`dev`	pytest, ruff, mypy	Development

Quick Start

Claude Desktop

{
  "mcpServers": {
    "ak-primus": {
      "command": "ak-primus",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Docker

# stdio (MCP)
docker build -f Dockerfile.akprimus --target runtime -t ak-primus:0.2.0 .
docker run -i --rm -e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" -v ak_data:/data ak-primus:0.2.0

# HTTP/SSE with health probes
docker compose --profile http up
curl http://localhost:8080/health

Self-Healing Compression

The adaptive profile tunes compression ratios over time per (workspace, request_type):

quality ≥ 0.85  → compress more aggressively next time
quality < 0.70  → back off compression

After ~50 samples per type, the ratio converges to the Pareto-optimal point.

# Feed quality signal back after reviewing LLM response
curl -s http://localhost:8080/mcp -d '{"tool":"report_quality","quality_score":0.92,"request_type":"code"}'

Running Tests

pytest tests/ -q                              # 84 tests
python benchmarks/run_1200_scenarios.py --quick   # accuracy + latency

Benchmark (v0.2.0, 275 scenarios)

Class	Accuracy
`math`	96%
`fixed_template`	100%
`agent_session`	92%
`domain_expert`	90%
`multi_hop`	90%
`simple_qa`	88%
`code`	82%
`rag_doc`	80%
Overall	~88%

Classifier latency: p50 < 1ms, p95 < 5ms (rule-based path, no model load).

Environment Variables

Variable	Default	Description
`AK_PRIMUS_MODEL`	`claude-sonnet-4-6`	LLM model for DSPy / OPRO / judge
`AK_PRIMUS_TRANSPORT`	`stdio`	`stdio` or `http`
`AK_PRIMUS_HOST`	`127.0.0.1`	HTTP bind host
`AK_PRIMUS_PORT`	`8080`	HTTP bind port
`AK_PRIMUS_DB`	`~/.ak_primus/store.db`	SQLite path
`AK_PRIMUS_DSPY_CACHE`	`~/.ak_primus/dspy_cache`	DSPy program cache

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jul 3, 2026

This version

0.2.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ak_primus-0.2.0.tar.gz (121.5 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ak_primus-0.2.0-py3-none-any.whl (127.0 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file ak_primus-0.2.0.tar.gz.

File metadata

Download URL: ak_primus-0.2.0.tar.gz
Upload date: Jul 3, 2026
Size: 121.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0db3fa2966780e2cb67cd252ba179c55c4f9c47eba4fda025e04ddd67c0e7a7e`
MD5	`3ffcbf677f3648cc5121f2b934712c7e`
BLAKE2b-256	`4d6f84192aba385d455eb13678ed49065865f8073d0ef19e7ee5aa308cea436f`

See more details on using hashes here.

File details

Details for the file ak_primus-0.2.0-py3-none-any.whl.

File metadata

Download URL: ak_primus-0.2.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 127.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ak_primus-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fc5fc088dc6894031be55105e21ced991d14503dd7cefd82236f4818f65745a`
MD5	`6955fc2ec453a8e001348d1e65c257ad`
BLAKE2b-256	`0e7a9ac3e834497a9bf770ffc9fe965a0d192c119716a6eccdb019718349377e`

See more details on using hashes here.

ak-primus 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AK-Primus v0.2.0

What's new in 0.2.0

Architecture

7 MCP Tools

process_request — master pipeline

8 Request Types

Installation

Quick Start

Claude Desktop

Docker

Self-Healing Compression

Running Tests

Benchmark (v0.2.0, 275 scenarios)

Environment Variables

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`process_request` — master pipeline