Python SDK for RAG serving optimization — RAGO Pareto scheduler + RAGCache KV cache manager
Project description
HyperRAG
KV cache + Pareto scheduling middleware for RAG pipelines. Plugs in between your vector search and your LLM. Built on two systems papers: RAGO (ISCA'25) and RAGCache (TOCS'25).
The problem
Every RAG request re-processes the same documents from scratch. At 70B params, that's ~650ms of wasted prefill before you see a single output token — and you just paid for the same compute last request.
The fix: cache the transformer's KV state per document. When a document appears again, load the cache instead of recomputing. TTFT drops proportionally to hit rate.
This SDK manages that cache and finds the optimal GPU/batch/cache configuration for your workload.
Install
pip install hyperrag # core (schedule optimizer + cache planner)
pip install "hyperrag[gpu]" # + vLLM for real inference
pip install "hyperrag[all]" # + dev + eval tooling
Quickstart
Install, configure, optimize, serve. Five steps.
Pipeline Integration
→ docs/pipeline-integration.md
Drop this into a pipeline you already have (LangChain, LlamaIndex, custom).
How it works
optimize()— Pareto search over GPU counts, batch sizes, and cache hit rates. Returns the config that minimises TTFT (or maximises QPS) for your workload.recommend_cache()— Sweeps GPU/host DRAM split. Tells you how to allocate cache budget.build_controller()— Returns a live serving controller backed by vLLM. Every request goes through cache lookup → speculative pipelining → KV cache admission.
from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query
rago = RAGOptimize(RAGOptimizeConfig(
paradigm="long_context",
model=LLMModel.LLAMA_3_1_70B,
gpu_budget_gb=8.0,
host_budget_gb=32.0,
))
# Pre-flight: find the best schedule before you commit hardware
result = rago.optimize()
print(result.summary())
# TTFT=3.4ms QPS=12.5 hit_rate=0.82 gpus=4 batch=8
# Production: real vLLM inference with KV caching active
ctrl = rago.build_controller() # requires NVIDIA GPU + pip install "hyperrag[gpu]"
resp = ctrl.process(Query("q1", "What is transformer attention?", ["d1", "d2"], [512, 256]))
print(f"TTFT={resp.ttft_s*1000:.1f}ms cache_hit={resp.cache_hit}")
Model presets
from hyperrag import LLMModel
# LLMs
LLMModel.LLAMA_3_1_8B LLMModel.LLAMA_3_1_70B LLMModel.LLAMA_3_1_405B
LLMModel.MISTRAL_7B LLMModel.MISTRAL_NEMO_12B
LLMModel.GEMMA_2_9B LLMModel.GEMMA_2_27B
LLMModel.QWEN_2_5_72B LLMModel.DEEPSEEK_R1_70B
# SLMs
LLMModel.LLAMA_3_2_1B LLMModel.LLAMA_3_2_3B
LLMModel.PHI_3_5_MINI LLMModel.GEMMA_2_2B
LLMModel.QWEN_2_5_7B LLMModel.DEEPSEEK_R1_7B
# Custom
LLMModel.custom("MyModel-7B", "myorg/mymodel-7b", 7.0,
num_layers=32, q_heads=32, kv_heads=8, head_dim=128)
All presets: from hyperrag import ALL_MODELS.
RAG paradigms
paradigm= |
Default model | Bottleneck | Use when |
|---|---|---|---|
"hyperscale" |
8B | FAISS scan | Standard single-hop RAG |
"long_context" |
70B | LLM prefill | 1M+ token context, no retrieval |
"iterative" |
70B | FAISS × 4 | Multi-hop / agentic retrieval |
"rewriter_reranker" |
70B | Encoder + rewriter | Query rewrite + cross-encoder rerank |
Config reference
RAGOptimizeConfig(
paradigm="hyperscale", # see table above
model=LLMModel.LLAMA_3_1_8B,
gpu_budget_gb=4.0, # GPU HBM for KV cache (GB)
host_budget_gb=16.0, # host DRAM for KV cache (GB)
hardware_profile=None, # path to JSON from scripts/profile_hardware.py
num_gpus=None, # override GPU count (or RAGO_NUM_GPUS env)
max_ttft_s=None, # filter: reject schedules with TTFT > this
min_qps=None, # filter: reject schedules with QPS < this
)
Key classes
| Class | What it does |
|---|---|
RAGOptimize |
Facade: optimize(), recommend_cache(), build_controller() |
RAGServeController |
Serving: process(), process_batch(), warmup(), metrics(), reset() |
LLMModel |
Model spec with 17 built-in presets + custom() |
Query |
(query_id, text, doc_ids, doc_tokens) |
QueryResult |
(ttft_s, latency_s, cache_hit, cached_tokens, speculative) |
OptimizeResult |
(ttft_s, qps_per_chip, cache_hit_rate, pareto_size) |
CacheRecommendation |
(gpu_gb, host_gb, estimated_hit_rate, reasoning) |
ServeMetrics |
(hit_rate, avg_ttft_s, gpu_used_mib, host_used_mib, ...) |
Exceptions: RAGOptimizeError → ConfigError, ScheduleError, HardwareError, ServeError.
Hardware calibration
For tighter schedule predictions, profile once and pass the result:
python scripts/profile_hardware.py --output profiles/my_gpu.json
RAGOptimizeConfig(model=LLMModel.LLAMA_3_1_8B, hardware_profile="profiles/my_gpu.json")
Benchmarks
4× A100-SXM4-40GB, 1000 queries, Zipfian workload, calibrated cost model.
| Paradigm | Baseline | +RAGCache | +RAGO | Speedup |
|---|---|---|---|---|
| Hyperscale 8B | 264.8 ms | 251.6 ms | 243.6 ms | 1.09× |
| Long-context 70B | 30.9 ms | 13.7 ms | 3.4 ms | 9.02× |
| Iterative 70B | 264.8 ms | 251.6 ms | 243.6 ms | 1.09× |
| Rewriter-Reranker 70B | 649.2 ms | 635.9 ms | 339.7 ms | 1.91× |
Tests
pytest tests/ -m "not gpu" -v # 100 tests, no GPU needed
pytest tests/ -m gpu -v # serving tests (requires NVIDIA GPU + vllm)
MS-MARCO v2.1 fixture (50 queries) at tests/fixtures/ms_marco_sample.json.
Source layout
src/
├── hyperrag/ SDK (public API)
│ ├── client.py RAGOptimize facade
│ ├── config.py RAGOptimizeConfig
│ ├── serve.py RAGServeController
│ ├── models.py LLMModel + result types
│ └── exceptions.py
└── engine/ Engine (RAGO + RAGCache algorithms)
├── schema/ RAGSchema workload model
├── cost_model/ Roofline / calibrated / adaptive
├── knowledge_tree/ Prefix trie for KV cache
├── cache/ Multi-tier cache + PGDSF
├── request_scheduler/ Cache-aware reorder + speculative pipeline
├── scheduler/ Pareto scheduler
├── inference/ vLLM backend
└── serving/ RAGController
References
- Nawras Alnaasan et al. "RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving." ISCA 2025. arXiv:2503.14649
- Chao Jin et al. "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation." ACM TOCS 2025. arXiv:2404.12457
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dv_hyperrag-0.1.0.tar.gz.
File metadata
- Download URL: dv_hyperrag-0.1.0.tar.gz
- Upload date:
- Size: 54.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91e4b62af41c5622dc658b4a35c834a9ee37ae412f1fe01bffdf586b64446b51
|
|
| MD5 |
3a770e0aab02595c53a381bc02d0bc12
|
|
| BLAKE2b-256 |
89d71caf596f276cfb1f3104e390b4e3ec0dd0dc217aa207ca9a3afabc6c98df
|
File details
Details for the file dv_hyperrag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dv_hyperrag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 62.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d312b9700cf295772b653b108ad4f859f7e660a062c687a1f079a606afc82abe
|
|
| MD5 |
b9dbbcd9950fd49a82fb0f4e6abd8260
|
|
| BLAKE2b-256 |
e3eda242a8023ef569841f54f88a3252bfe8a3c99eb132df89abc79351c55726
|