Python SDK for RAG serving optimization — RAGO Pareto scheduler + RAGCache KV cache manager

These details have not been verified by PyPI

Project description

HyperRAG

KV cache + Pareto scheduling middleware for RAG pipelines. Plugs in between your vector search and your LLM. Built on two systems papers: RAGO (ISCA'25) and RAGCache (TOCS'25).

The problem

Every RAG request re-processes the same documents from scratch. At 70B params, that's ~650ms of wasted prefill before you see a single output token — and you just paid for the same compute last request.

The fix: cache the transformer's KV state per document. When a document appears again, load the cache instead of recomputing. TTFT drops proportionally to hit rate.

This SDK manages that cache and finds the optimal GPU/batch/cache configuration for your workload.

Install

pip install hyperrag           # core (schedule optimizer + cache planner)
pip install "hyperrag[gpu]"    # + vLLM for real inference
pip install "hyperrag[all]"    # + dev + eval tooling

Quickstart

→ quickstart/README.md

Install, configure, optimize, serve. Five steps.

Pipeline Integration

→ docs/pipeline-integration.md

Drop this into a pipeline you already have (LangChain, LlamaIndex, custom).

How it works

optimize() — Pareto search over GPU counts, batch sizes, and cache hit rates. Returns the config that minimises TTFT (or maximises QPS) for your workload.
recommend_cache() — Sweeps GPU/host DRAM split. Tells you how to allocate cache budget.
build_controller() — Returns a live serving controller backed by vLLM. Every request goes through cache lookup → speculative pipelining → KV cache admission.

from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query

rago = RAGOptimize(RAGOptimizeConfig(
    paradigm="long_context",
    model=LLMModel.LLAMA_3_1_70B,
    gpu_budget_gb=8.0,
    host_budget_gb=32.0,
))

# Pre-flight: find the best schedule before you commit hardware
result = rago.optimize()
print(result.summary())
# TTFT=3.4ms  QPS=12.5  hit_rate=0.82  gpus=4  batch=8

# Production: real vLLM inference with KV caching active
ctrl = rago.build_controller()   # requires NVIDIA GPU + pip install "hyperrag[gpu]"
resp = ctrl.process(Query("q1", "What is transformer attention?", ["d1", "d2"], [512, 256]))
print(f"TTFT={resp.ttft_s*1000:.1f}ms  cache_hit={resp.cache_hit}")

Model presets

from hyperrag import LLMModel

# LLMs
LLMModel.LLAMA_3_1_8B    LLMModel.LLAMA_3_1_70B    LLMModel.LLAMA_3_1_405B
LLMModel.MISTRAL_7B      LLMModel.MISTRAL_NEMO_12B
LLMModel.GEMMA_2_9B      LLMModel.GEMMA_2_27B
LLMModel.QWEN_2_5_72B    LLMModel.DEEPSEEK_R1_70B

# SLMs
LLMModel.LLAMA_3_2_1B    LLMModel.LLAMA_3_2_3B
LLMModel.PHI_3_5_MINI    LLMModel.GEMMA_2_2B
LLMModel.QWEN_2_5_7B     LLMModel.DEEPSEEK_R1_7B

# Custom
LLMModel.custom("MyModel-7B", "myorg/mymodel-7b", 7.0,
                num_layers=32, q_heads=32, kv_heads=8, head_dim=128)

All presets: from hyperrag import ALL_MODELS.

RAG paradigms

`paradigm=`	Default model	Bottleneck	Use when
`"hyperscale"`	8B	FAISS scan	Standard single-hop RAG
`"long_context"`	70B	LLM prefill	1M+ token context, no retrieval
`"iterative"`	70B	FAISS × 4	Multi-hop / agentic retrieval
`"rewriter_reranker"`	70B	Encoder + rewriter	Query rewrite + cross-encoder rerank

Config reference

RAGOptimizeConfig(
    paradigm="hyperscale",     # see table above
    model=LLMModel.LLAMA_3_1_8B,
    gpu_budget_gb=4.0,         # GPU HBM for KV cache (GB)
    host_budget_gb=16.0,       # host DRAM for KV cache (GB)
    hardware_profile=None,     # path to JSON from scripts/profile_hardware.py
    num_gpus=None,             # override GPU count (or RAGO_NUM_GPUS env)
    max_ttft_s=None,           # filter: reject schedules with TTFT > this
    min_qps=None,              # filter: reject schedules with QPS < this
)

Key classes

Class	What it does
`RAGOptimize`	Facade: `optimize()`, `recommend_cache()`, `build_controller()`
`RAGServeController`	Serving: `process()`, `process_batch()`, `warmup()`, `metrics()`, `reset()`
`LLMModel`	Model spec with 17 built-in presets + `custom()`
`Query`	`(query_id, text, doc_ids, doc_tokens)`
`QueryResult`	`(ttft_s, latency_s, cache_hit, cached_tokens, speculative)`
`OptimizeResult`	`(ttft_s, qps_per_chip, cache_hit_rate, pareto_size)`
`CacheRecommendation`	`(gpu_gb, host_gb, estimated_hit_rate, reasoning)`
`ServeMetrics`	`(hit_rate, avg_ttft_s, gpu_used_mib, host_used_mib, ...)`

Exceptions: RAGOptimizeError → ConfigError, ScheduleError, HardwareError, ServeError.

Hardware calibration

For tighter schedule predictions, profile once and pass the result:

python scripts/profile_hardware.py --output profiles/my_gpu.json

RAGOptimizeConfig(model=LLMModel.LLAMA_3_1_8B, hardware_profile="profiles/my_gpu.json")

Benchmarks

4× A100-SXM4-40GB, 1000 queries, Zipfian workload, calibrated cost model.

Paradigm	Baseline	+RAGCache	+RAGO	Speedup
Hyperscale 8B	264.8 ms	251.6 ms	243.6 ms	1.09×
Long-context 70B	30.9 ms	13.7 ms	3.4 ms	9.02×
Iterative 70B	264.8 ms	251.6 ms	243.6 ms	1.09×
Rewriter-Reranker 70B	649.2 ms	635.9 ms	339.7 ms	1.91×

Tests

pytest tests/ -m "not gpu" -v    # 100 tests, no GPU needed
pytest tests/ -m gpu -v          # serving tests (requires NVIDIA GPU + vllm)

MS-MARCO v2.1 fixture (50 queries) at tests/fixtures/ms_marco_sample.json.

Source layout

src/
├── hyperrag/        SDK (public API)
│   ├── client.py            RAGOptimize facade
│   ├── config.py            RAGOptimizeConfig
│   ├── serve.py             RAGServeController
│   ├── models.py            LLMModel + result types
│   └── exceptions.py
└── engine/                  Engine (RAGO + RAGCache algorithms)
    ├── schema/              RAGSchema workload model
    ├── cost_model/          Roofline / calibrated / adaptive
    ├── knowledge_tree/      Prefix trie for KV cache
    ├── cache/               Multi-tier cache + PGDSF
    ├── request_scheduler/   Cache-aware reorder + speculative pipeline
    ├── scheduler/           Pareto scheduler
    ├── inference/           vLLM backend
    └── serving/             RAGController

References

Nawras Alnaasan et al. "RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving." ISCA 2025. arXiv:2503.14649
Chao Jin et al. "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation." ACM TOCS 2025. arXiv:2404.12457

MIT License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dv_hyperrag-0.1.0.tar.gz (54.7 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dv_hyperrag-0.1.0-py3-none-any.whl (62.5 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file dv_hyperrag-0.1.0.tar.gz.

File metadata

Download URL: dv_hyperrag-0.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 54.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dv_hyperrag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`91e4b62af41c5622dc658b4a35c834a9ee37ae412f1fe01bffdf586b64446b51`
MD5	`3a770e0aab02595c53a381bc02d0bc12`
BLAKE2b-256	`89d71caf596f276cfb1f3104e390b4e3ec0dd0dc217aa207ca9a3afabc6c98df`

See more details on using hashes here.

File details

Details for the file dv_hyperrag-0.1.0-py3-none-any.whl.

File metadata

Download URL: dv_hyperrag-0.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 62.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dv_hyperrag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d312b9700cf295772b653b108ad4f859f7e660a062c687a1f079a606afc82abe`
MD5	`b9dbbcd9950fd49a82fb0f4e6abd8260`
BLAKE2b-256	`e3eda242a8023ef569841f54f88a3252bfe8a3c99eb132df89abc79351c55726`

See more details on using hashes here.

dv-hyperrag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

HyperRAG

The problem

Install

Quickstart

Pipeline Integration

How it works

Model presets

RAG paradigms

Config reference

Key classes

Hardware calibration

Benchmarks

Tests

Source layout

References

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes