Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
Project description
| Documentation | Examples | Benchmarks |
News
- [2026/02] ContextPilot v0.3.2 released, supporting PageIndex and Mem0.
- [2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.
- [2025/12] ContextPilot v0.2.0 released.
About
ContextPilot is a fast optimization system on context engineering layer for agentic workloads:
- High Throughput & Cache Hit Ratio: Boosting prefill throughput and prefix cache hit ratio with intelligent context reuse.
- Strong Compatibility: Strong compatibility with existing popular RAG libraries (PageIndex), Agentic memory layer (Mem0), KV cache optimization engine (LMCache), and Inference engines (vLLM and SGLang).
- Negligible Accuracy Loss: Achieving significant performance improvements with minimal to no accuracy degradation across various benchmarks.
- Widely Tested: Tested with a wide range of RAG and Agentic AI applications.
Target Workloads
- Trending Topic QA — Search and generation for breaking news and hot topics beyond model knowledge
- Closed-Domain Long-Context QA — QA over specialized corpora (novels, financial reports, legal documents) with retrieval or in-context search
- Large-Batch Long-Context Execution — High-throughput inference where many requests share overlapping contexts; ContextPilot maximizes prefix reuse regardless of the search method
- Multi-Turn Conversations with Long-Term Memory — Persistent context reuse across turns (e.g. Mem0)
Benchmark and Performance
System Performance
ContextPilot (Stateless) on DeepSeek-R1 maintains accuracy compared to SGLang, achieving 64.68% vs 64.15% F1 on MultihopRAG and 41.08% vs 40.20% F1 on NarrativeQA.
Accuracy on MT-RAG Benchmark (Online Scheduling)
| Method | Qwen3-4B | Llama3.1-8B | Qwen3-30B-A3B |
|---|---|---|---|
| LMCache | 62.56 | 68.46 | 75.12 |
| CacheBlend | 50.33 | 56.52 | X |
| RadixCache | 62.56 | 68.46 | 75.12 |
| ContextPilot | 64.27 | 68.12 | 75.81 |
ContextPilot delivers 4-13x improvements in cache hit rates and 1.5-3.5x reductions in prefill latency for large-batch RAG workloads, while maintaining or improving accuracy.
Furthermore, ContextPilot has been tested to reduce input token costs by around 36% with GPT-5.2.
See Benchmarks in the documentation for GPU vs CPU performance analysis and detailed benchmark methodology.
Getting Started
Installation
Requirements: Python >= 3.10
pip install contextpilot
Or install from source:
git clone https://github.com/Edinburgh-AgenticAI/ContextPilot.git
cd ContextPilot
pip install -e .
More detailed installation instructions are available in the docs.
Quick Start
Stateful — ContextPilot tracks cached state across turns so
overlapping documents are moved to the prefix for KV-cache reuse:
from openai import OpenAI
import contextpilot as cp
client = OpenAI(base_url="http://localhost:30000/v1", api_key="...")
cp_live = cp.ContextPilot(use_gpu=False)
# Simulated per-turn memory search (e.g. from mem0)
# Each turn retrieves different but partially overlapping documents
turn_memories = [
["Transformers use self-attention", "GPT is based on transformers", "BERT is bidirectional"],
["RNNs use hidden states", "GPT is based on transformers", "LSTMs solve vanishing gradients"],
["Attention computes QKV", "Transformers use self-attention", "GPT is based on transformers"],
]
queries = ["What are transformers?", "How do RNNs compare?", "Explain attention in detail."]
for turn_idx, (query, mems) in enumerate(zip(queries, turn_memories)):
# 1. Reorder for prefix sharing (handles cold start & incremental)
# .reorder() accepts a single list or list-of-lists
reordered, indices = cp_live.reorder(mems)
ctx = reordered[0] # single context per turn
# Turn 2: "GPT is based on transformers" ← moved to prefix (shared with turn 1)
# Turn 3: "Transformers …", "GPT …" ← both moved to prefix
# 2. Generate answer with reordered context
docs_section = "\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(ctx))
importance_ranking = ">".join(
str(ctx.index(doc) + 1) for doc in mems if doc in ctx
)
response = client.chat.completions.create(
model="Qwen/Qwen3-4B",
messages=[
{"role": "system", "content": (
f"Answer the question based on the provided documents.\n\n"
f"<documents>\n{docs_section}\n</documents>\n\n"
f"Read the documents in this importance ranking: {importance_ranking}\n"
f"Prioritize information from higher-ranked documents."
)},
{"role": "user", "content": query},
],
)
print(f"[Turn {turn_idx+1}] Q: {query}")
print(f"A: {response.choices[0].message.content}\n")
Note: Stateful mode works without eviction sync —
ContextPilottracks the previous ordering and reorders new contexts to maximize prefix cache hits. For production deployments with limited KV-cache capacity, install the SGLang eviction patch to keep the index in sync. See the online usage guide for HTTP server setup.
Offline / Online Stateless — same API, just pass the full batch at once:
from openai import OpenAI
import contextpilot as cp
client = OpenAI(base_url="http://localhost:30000/v1", api_key="...") # Your inference engine URL and API key
cp_batch = cp.ContextPilot(use_gpu=False)
queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
all_contexts = [
["Doc about AI", "Doc about ML", "Doc about computing"],
["Doc about neural nets", "Doc about deep learning"],
["Doc about ML", "Doc about AI", "Doc about deep learning basics"],
]
# One call: builds index, reorders docs for prefix sharing, and schedules execution order
# .reorder() returns (reordered_contexts, original_indices)
reordered_ctx, order = cp_batch.reorder(all_contexts)
# Build all prompts in optimized order
messages_batch = []
for ctx, orig_idx in zip(reordered_ctx, order):
docs_section = "\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(ctx))
importance_ranking = ">".join(
str(ctx.index(doc) + 1) for doc in all_contexts[orig_idx] if doc in ctx
)
messages_batch.append({
"model": "Qwen/Qwen3-4B",
"messages": [
{"role": "system", "content": (
f"Answer the question based on the provided documents.\n\n"
f"<documents>\n{docs_section}\n</documents>\n\n"
f"Read the documents in this importance ranking: {importance_ranking}\n"
f"Prioritize information from higher-ranked documents."
)},
{"role": "user", "content": queries[orig_idx]},
],
})
# Send concurrently — inference engine processes them in order for max cache reuse
import asyncio, openai
async def generate_all(batch):
aclient = openai.AsyncOpenAI(base_url="http://localhost:30000/v1", api_key="...")
tasks = [aclient.chat.completions.create(**req) for req in batch]
return await asyncio.gather(*tasks)
responses = asyncio.run(generate_all(messages_batch))
for resp, orig_idx in zip(responses, order):
print(f"Q: {queries[orig_idx]}\nA: {resp.choices[0].message.content}\n")
For online stateless scheduling via HTTP server, see the online usage guide.
Documentation
Check out the ContextPilot documentation for comprehensive guides.
Examples
Go hands-on with our examples, demonstrating how to address different use cases with ContextPilot.
Contributing
We welcome and value all contributions! Please feel free to submit issues and pull requests.
Citation
We will include the paper citation soon!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextpilot-0.3.3.post2.tar.gz.
File metadata
- Download URL: contextpilot-0.3.3.post2.tar.gz
- Upload date:
- Size: 128.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6299007ba8ce8318744b87c66317eeee686b66d6cf17201ae2fe19f25be626b9
|
|
| MD5 |
819417282367079515795f49e88d4be3
|
|
| BLAKE2b-256 |
4f1a06a8160b3e020ea60ea1c42820550c0ca5c4605ca3697321913b7508f432
|
File details
Details for the file contextpilot-0.3.3.post2-py3-none-any.whl.
File metadata
- Download URL: contextpilot-0.3.3.post2-py3-none-any.whl
- Upload date:
- Size: 107.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
061cc97e2a092d93b2d449ce7285b68e6f164bdb19e214e9db5fcc1575020ca8
|
|
| MD5 |
a53db745adad2ea43b55d4ec42c25ece
|
|
| BLAKE2b-256 |
785eebb6afae5ee97e5b4026a27e6cf43ec2791706419528dbc4ca951135665c
|