Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

These details have not been verified by PyPI

Project links

Project description

ContextPilot: Efficient Long Context Inference with Context Reuse

| Documentation | Examples | Benchmarks |

News

[2026/02] ContextPilot v0.3.2 released, supporting PageIndex and Mem0.
[2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.
[2025/12] ContextPilot v0.2.0 released.

About

ContextPilot is a fast optimization system on context engineering layer for agentic workloads:

High Throughput & Cache Hit Ratio: Boosting prefill throughput and prefix cache hit ratio with intelligent context reuse.
Strong Compatibility: Strong compatibility with existing popular RAG libraries (PageIndex), Agentic memory layer (Mem0), KV cache optimization engine (LMCache), and Inference engines (vLLM and SGLang).
Negligible Accuracy Loss: Achieving significant performance improvements with minimal to no accuracy degradation across various benchmarks.
Widely Tested: Tested with a wide range of RAG and Agentic AI applications.

Target Workloads

Trending Topic QA — Search and generation for breaking news and hot topics beyond model knowledge
Closed-Domain Long-Context QA — QA over specialized corpora (novels, financial reports, legal documents) with retrieval or in-context search
Large-Batch Long-Context Execution — High-throughput inference where many requests share overlapping contexts; ContextPilot maximizes prefix reuse regardless of the search method
Multi-Turn Conversations with Long-Term Memory — Persistent context reuse across turns (e.g. Mem0)

Benchmark and Performance

System Performance

ContextPilot (Stateless) on DeepSeek-R1 maintains accuracy compared to SGLang, achieving 64.68% vs 64.15% F1 on MultihopRAG and 41.08% vs 40.20% F1 on NarrativeQA.

Accuracy on MT-RAG Benchmark (Online Scheduling)

Method	Qwen3-4B	Llama3.1-8B	Qwen3-30B-A3B
LMCache	62.56	68.46	75.12
CacheBlend	50.33	56.52	X
RadixCache	62.56	68.46	75.12
ContextPilot	64.27	68.12	75.81

ContextPilot delivers 4-13x improvements in cache hit rates and 1.5-3.5x reductions in prefill latency for large-batch RAG workloads, while maintaining or improving accuracy.

Furthermore, ContextPilot has been tested to reduce input token costs by around 36% with GPT-5.2.

See Benchmarks in the documentation for GPU vs CPU performance analysis and detailed benchmark methodology.

Getting Started

Installation

Requirements: Python >= 3.10

pip install contextpilot

Or install from source:

git clone https://github.com/Edinburgh-AgenticAI/ContextPilot.git
cd ContextPilot
pip install -e .

More detailed installation instructions are available in the docs.

Quick Start

Offline / Online Stateless — build index & schedule in one shot:

from openai import OpenAI
import contextpilot as cp

client = OpenAI(base_url="http://localhost:30000/v1", api_key="...") # Your inference engine URL and API key

queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
all_contexts = [
    ["Doc about AI", "Doc about ML", "Doc about computing"],
    ["Doc about neural nets", "Doc about deep learning"],
    ["Doc about ML", "Doc about AI", "Doc about deep learning basics"],
]

# Build index and schedule for prefix sharing
index = cp.build_context_index(all_contexts, use_gpu=False)
reordered, _, order, _ = cp.InterContextScheduler().schedule_contexts(index)

# Send in optimized order — shared prefixes hit KV cache
for ctx, orig_idx in zip(reordered, order):
    docs_section = "\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(ctx))
    # Importance ranking restores original retrieval order for the model
    importance_ranking = ">".join(
        str(ctx.index(doc) + 1) for doc in all_contexts[orig_idx] if doc in ctx
    )
    response = client.chat.completions.create(
        model="Qwen/Qwen3-4B",
        messages=[
            {"role": "system", "content": (
                f"Answer the question based on the provided documents.\n\n"
                f"<documents>\n{docs_section}\n</documents>\n\n"
                f"Read the documents in this importance ranking: {importance_ranking}\n"
                f"Prioritize information from higher-ranked documents."
            )},
            {"role": "user", "content": queries[orig_idx]},
        ],
    )
    print(f"Q: {queries[orig_idx]}\nA: {response.choices[0].message.content}\n")

For online stateless scheduling via HTTP server, see the online usage guide.

Stateful — LiveContextIndex tracks cached state:

from openai import OpenAI
import contextpilot as cp

client = OpenAI(base_url="http://localhost:30000/v1", api_key="...")
live = cp.LiveContextIndex(use_gpu=False)

# Simulate multi-turn: each turn has batch_size=1
turns = [
    {
        "query": "What is AI?",
        "contexts": [["Doc about AI", "Doc about ML", "Doc about computing"]],
    },
    {
        "query": "Compare supervised and unsupervised learning",
        # 2 of 3 docs overlap with Turn 1 ("Doc about AI", "Doc about ML"), different order + 1 new doc
        "contexts": [["Doc about ML", "Doc about clustering", "Doc about AI"]],
    },
]

for turn_idx, turn in enumerate(turns):
    contexts = turn["contexts"]
    query = turn["query"]

    # build_incremental handles both cold start and incremental turns
    result = live.build_incremental(contexts)
    reordered = result['reordered_contexts']
    # Turn 2: reordered to ["Doc about AI", "Doc about ML", "Doc about clustering"]
    #                        ^— shared prefix from Turn 1 —^    ^— new doc appended

    ctx = reordered[0]
    docs_section = "\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(ctx))
    importance_ranking = ">".join(
        str(ctx.index(doc) + 1) for doc in contexts[0] if doc in ctx
    )
    response = client.chat.completions.create(
        model="Qwen/Qwen3-4B",
        messages=[
            {"role": "system", "content": (
                f"Answer the question based on the provided documents.\n\n"
                f"<documents>\n{docs_section}\n</documents>\n\n"
                f"Read the documents in this importance ranking: {importance_ranking}\n"
                f"Prioritize information from higher-ranked documents."
            )},
            {"role": "user", "content": query},
        ],
    )
    print(f"[Turn {turn_idx+1}] Q: {query}")
    print(f"A: {response.choices[0].message.content}\n")

Note: Stateful mode works without eviction sync — LiveContextIndex tracks the previous ordering and reorders new contexts to maximize prefix cache hits. For production deployments with limited storage size where the KV cache may evict entries, install the SGLang eviction patch to keep the index in sync. See the online usage guide for HTTP server setup.

Documentation

Check out the ContextPilot documentation for comprehensive guides.

Examples

Go hands-on with our examples, demonstrating how to address different use cases with ContextPilot.

Contributing

We welcome and value all contributions! Please feel free to submit issues and pull requests.

Citation

We will include the paper citation soon!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Apr 12, 2026

0.4.0

Mar 31, 2026

0.3.5.post2

Mar 5, 2026

0.3.5

Feb 25, 2026

0.3.4

Feb 21, 2026

0.3.3.post2

Feb 17, 2026

0.3.3

Feb 17, 2026

This version

0.3.2

Feb 16, 2026

0.3.1

Feb 15, 2026

0.3.0

Jan 30, 2026

0.2.0

Jan 27, 2026

0.1.0

Jan 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextpilot-0.3.2.tar.gz (125.1 kB view details)

Uploaded Feb 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextpilot-0.3.2-py3-none-any.whl (105.9 kB view details)

Uploaded Feb 16, 2026 Python 3

File details

Details for the file contextpilot-0.3.2.tar.gz.

File metadata

Download URL: contextpilot-0.3.2.tar.gz
Upload date: Feb 16, 2026
Size: 125.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for contextpilot-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9b91a29b6b3a1bb3cc2f8d87aa5421e43a97a251d495f5f3d2bb2da1cb71601f`
MD5	`a7c644958b96c0a87527c1fab2d57b69`
BLAKE2b-256	`6a8e496f9d0beb280a5c20397b4e31e8abde1d54e2e9e104d7466fe30dc2c738`

See more details on using hashes here.

File details

Details for the file contextpilot-0.3.2-py3-none-any.whl.

File metadata

Download URL: contextpilot-0.3.2-py3-none-any.whl
Upload date: Feb 16, 2026
Size: 105.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for contextpilot-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2b418ddb334d2ab86aa80c54af9a876816df01ab5f25eea522f0fec73bafd87`
MD5	`72bac94a5e0daa7e11972ccebd7bc1d2`
BLAKE2b-256	`68a3eb3aeafac512206098f2f3b0da93b56fb70f4be460a0f81b48eb9c4b3680`

See more details on using hashes here.

contextpilot 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ContextPilot: Efficient Long Context Inference with Context Reuse

News

About

Target Workloads

Benchmark and Performance

System Performance

Accuracy on MT-RAG Benchmark (Online Scheduling)

Getting Started

Installation

Quick Start

Documentation

Examples

Contributing

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes