Skip to main content

Python SDK for RAG serving optimization — RAGO Pareto scheduler + RAGCache KV cache manager

Project description

HyperRAG

KV cache + Pareto scheduling middleware for RAG pipelines. Plugs in between your vector search and your LLM. Built on two systems papers: RAGO (ISCA'25) and RAGCache (TOCS'25).


The problem

Every RAG request re-processes the same documents from scratch. At 70B params, that's ~650ms of wasted prefill before you see a single output token — and you just paid for the same compute last request.

The fix: cache the transformer's KV state per document. When a document appears again, load the cache instead of recomputing. TTFT drops proportionally to hit rate.

This SDK manages that cache and finds the optimal GPU/batch/cache configuration for your workload.


Install

pip install hyperrag           # core (schedule optimizer + cache planner)
pip install "hyperrag[gpu]"    # + vLLM for real inference
pip install "hyperrag[all]"    # + dev + eval tooling

Quickstart

quickstart/README.md

Install, configure, optimize, serve. Five steps.


Pipeline Integration

docs/pipeline-integration.md

Drop this into a pipeline you already have (LangChain, LlamaIndex, custom).


How it works

  1. optimize() — Pareto search over GPU counts, batch sizes, and cache hit rates. Returns the config that minimises TTFT (or maximises QPS) for your workload.
  2. recommend_cache() — Sweeps GPU/host DRAM split. Tells you how to allocate cache budget.
  3. build_controller() — Returns a live serving controller backed by vLLM. Every request goes through cache lookup → speculative pipelining → KV cache admission.
from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query

rago = RAGOptimize(RAGOptimizeConfig(
    paradigm="long_context",
    model=LLMModel.LLAMA_3_1_70B,
    gpu_budget_gb=8.0,
    host_budget_gb=32.0,
))

# Pre-flight: find the best schedule before you commit hardware
result = rago.optimize()
print(result.summary())
# TTFT=3.4ms  QPS=12.5  hit_rate=0.82  gpus=4  batch=8

# Production: real vLLM inference with KV caching active
ctrl = rago.build_controller()   # requires NVIDIA GPU + pip install "hyperrag[gpu]"
resp = ctrl.process(Query("q1", "What is transformer attention?", ["d1", "d2"], [512, 256]))
print(f"TTFT={resp.ttft_s*1000:.1f}ms  cache_hit={resp.cache_hit}")

Model presets

from hyperrag import LLMModel

# LLMs
LLMModel.LLAMA_3_1_8B    LLMModel.LLAMA_3_1_70B    LLMModel.LLAMA_3_1_405B
LLMModel.MISTRAL_7B      LLMModel.MISTRAL_NEMO_12B
LLMModel.GEMMA_2_9B      LLMModel.GEMMA_2_27B
LLMModel.QWEN_2_5_72B    LLMModel.DEEPSEEK_R1_70B

# SLMs
LLMModel.LLAMA_3_2_1B    LLMModel.LLAMA_3_2_3B
LLMModel.PHI_3_5_MINI    LLMModel.GEMMA_2_2B
LLMModel.QWEN_2_5_7B     LLMModel.DEEPSEEK_R1_7B

# Custom
LLMModel.custom("MyModel-7B", "myorg/mymodel-7b", 7.0,
                num_layers=32, q_heads=32, kv_heads=8, head_dim=128)

All presets: from hyperrag import ALL_MODELS.


RAG paradigms

paradigm= Default model Bottleneck Use when
"hyperscale" 8B FAISS scan Standard single-hop RAG
"long_context" 70B LLM prefill 1M+ token context, no retrieval
"iterative" 70B FAISS × 4 Multi-hop / agentic retrieval
"rewriter_reranker" 70B Encoder + rewriter Query rewrite + cross-encoder rerank

Config reference

RAGOptimizeConfig(
    paradigm="hyperscale",     # see table above
    model=LLMModel.LLAMA_3_1_8B,
    gpu_budget_gb=4.0,         # GPU HBM for KV cache (GB)
    host_budget_gb=16.0,       # host DRAM for KV cache (GB)
    hardware_profile=None,     # path to JSON from scripts/profile_hardware.py
    num_gpus=None,             # override GPU count (or RAGO_NUM_GPUS env)
    max_ttft_s=None,           # filter: reject schedules with TTFT > this
    min_qps=None,              # filter: reject schedules with QPS < this
)

Key classes

Class What it does
RAGOptimize Facade: optimize(), recommend_cache(), build_controller()
RAGServeController Serving: process(), process_batch(), warmup(), metrics(), reset()
LLMModel Model spec with 17 built-in presets + custom()
Query (query_id, text, doc_ids, doc_tokens)
QueryResult (ttft_s, latency_s, cache_hit, cached_tokens, speculative)
OptimizeResult (ttft_s, qps_per_chip, cache_hit_rate, pareto_size)
CacheRecommendation (gpu_gb, host_gb, estimated_hit_rate, reasoning)
ServeMetrics (hit_rate, avg_ttft_s, gpu_used_mib, host_used_mib, ...)

Exceptions: RAGOptimizeErrorConfigError, ScheduleError, HardwareError, ServeError.


Hardware calibration

For tighter schedule predictions, profile once and pass the result:

python scripts/profile_hardware.py --output profiles/my_gpu.json
RAGOptimizeConfig(model=LLMModel.LLAMA_3_1_8B, hardware_profile="profiles/my_gpu.json")

Benchmarks

4× A100-SXM4-40GB, 1000 queries, Zipfian workload, calibrated cost model.

Paradigm Baseline +RAGCache +RAGO Speedup
Hyperscale 8B 264.8 ms 251.6 ms 243.6 ms 1.09×
Long-context 70B 30.9 ms 13.7 ms 3.4 ms 9.02×
Iterative 70B 264.8 ms 251.6 ms 243.6 ms 1.09×
Rewriter-Reranker 70B 649.2 ms 635.9 ms 339.7 ms 1.91×

Tests

pytest tests/ -m "not gpu" -v    # 100 tests, no GPU needed
pytest tests/ -m gpu -v          # serving tests (requires NVIDIA GPU + vllm)

MS-MARCO v2.1 fixture (50 queries) at tests/fixtures/ms_marco_sample.json.


Source layout

src/
├── hyperrag/        SDK (public API)
│   ├── client.py            RAGOptimize facade
│   ├── config.py            RAGOptimizeConfig
│   ├── serve.py             RAGServeController
│   ├── models.py            LLMModel + result types
│   └── exceptions.py
└── engine/                  Engine (RAGO + RAGCache algorithms)
    ├── schema/              RAGSchema workload model
    ├── cost_model/          Roofline / calibrated / adaptive
    ├── knowledge_tree/      Prefix trie for KV cache
    ├── cache/               Multi-tier cache + PGDSF
    ├── request_scheduler/   Cache-aware reorder + speculative pipeline
    ├── scheduler/           Pareto scheduler
    ├── inference/           vLLM backend
    └── serving/             RAGController

References

  1. Nawras Alnaasan et al. "RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving." ISCA 2025. arXiv:2503.14649
  2. Chao Jin et al. "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation." ACM TOCS 2025. arXiv:2404.12457

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dv_hyperrag-0.1.0.tar.gz (54.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dv_hyperrag-0.1.0-py3-none-any.whl (62.5 kB view details)

Uploaded Python 3

File details

Details for the file dv_hyperrag-0.1.0.tar.gz.

File metadata

  • Download URL: dv_hyperrag-0.1.0.tar.gz
  • Upload date:
  • Size: 54.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dv_hyperrag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 91e4b62af41c5622dc658b4a35c834a9ee37ae412f1fe01bffdf586b64446b51
MD5 3a770e0aab02595c53a381bc02d0bc12
BLAKE2b-256 89d71caf596f276cfb1f3104e390b4e3ec0dd0dc217aa207ca9a3afabc6c98df

See more details on using hashes here.

File details

Details for the file dv_hyperrag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dv_hyperrag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 62.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for dv_hyperrag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d312b9700cf295772b653b108ad4f859f7e660a062c687a1f079a606afc82abe
MD5 b9dbbcd9950fd49a82fb0f4e6abd8260
BLAKE2b-256 e3eda242a8023ef569841f54f88a3252bfe8a3c99eb132df89abc79351c55726

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page