Benchmark any system that transforms LLM context
Project description
context-bench
Benchmark any system that transforms LLM context.
Prompt compressors, memory managers, context stuffers, RAG rerankers — if it touches the context window before an LLM sees it, context-bench measures how well it works and what it costs.
Why context-bench?
You built (or bought) something that modifies LLM context. Now you need to answer:
- Does compression destroy information? Measure quality with F1, exact match, and pass rate against ground-truth QA datasets.
- Is the cost worth it? Track compression ratio and cost-per-successful-completion side by side.
- Which approach wins? Run multiple systems on the same dataset in one call and get a comparison table.
context-bench gives you a single evaluate() call that runs your system against a dataset, scores every example, and aggregates the results — no boilerplate, no framework lock-in.
Quick start
uv sync
Benchmark Headroom in 3 lines:
from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import MeanScore, PassRate
# headroom proxy --port 8787
headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")
result = evaluate(
systems=[headroom],
dataset=your_dataset,
evaluators=[your_evaluator],
metrics=[MeanScore(), PassRate()],
text_fields=["response"], # count only the proxy output tokens
)
print(result.summary)
How it works
flowchart LR
D[Dataset\ndicts] --> S[System\n.process]
S --> E[Evaluator\n.score]
E --> M[Metric\n.compute]
S -. output dict .-> S
E -. scores dict .-> E
M -. summary dict .-> M
- Dataset — any
Iterable[dict]. Must have"id"and"context"keys. - System — implements
.nameand.process(example) -> dict. This is the thing you're benchmarking. - Evaluator — implements
.nameand.score(original, processed) -> dict[str, float]. Compares before/after. - Metric — implements
.nameand.compute(rows) -> dict[str, float]. Aggregates scores across examples.
All interfaces are typing.Protocol — implement the methods, don't subclass anything.
Benchmark a proxy
The built-in OpenAIProxy system wraps any OpenAI-compatible endpoint. Point at a URL, get quality and cost metrics back — no HTTP boilerplate needed.
Headroom
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CostOfPass, MeanScore, PassRate
from context_bench.metrics.quality import f1_score
headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")
class QAEvaluator:
name = "qa_f1"
def score(self, original, processed):
return {"score": f1_score(processed.get("response", ""),
original.get("answer", ""))}
result = evaluate(
systems=[headroom],
dataset=your_dataset,
evaluators=[QAEvaluator()],
metrics=[MeanScore(), PassRate(), CostOfPass()],
text_fields=["response"],
)
Compresr
Compresr uses a Python SDK instead of a proxy, so wrap it in a custom system:
from compresr import CompressionClient
class CompresrSystem:
name = "compresr"
def __init__(self, api_key):
self.client = CompressionClient(api_key=api_key)
def process(self, example):
compressed = self.client.generate(
context=example["context"],
question=example.get("question", ""),
)
return {**example, "context": compressed}
Compare Headroom vs Compresr
result = evaluate(
systems=[
OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
CompresrSystem(api_key="..."),
],
dataset=dataset,
evaluators=[QAEvaluator()],
metrics=[MeanScore(), PassRate(), CostOfPass()],
text_fields=["response"],
)
Any OpenAI-compatible endpoint
OpenAIProxy(
base_url="http://localhost:8080",
model="gpt-4",
api_key="sk-...", # or set OPENAI_API_KEY env var
system_prompt="Be concise.", # prepended as system message
extra_body={"temperature": 0}, # any additional request params
)
text_fields=["response"]— By default the runner counts all string fields for token stats, which would double-count context in output tokens. Passtext_fields=["response"]so only the proxy's actual output is measured.
Compare systems head-to-head
from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CompressionRatio, CostOfPass, MeanScore, PassRate
from context_bench.reporters.markdown import to_markdown
result = evaluate(
systems=[
OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
CompresrSystem(api_key="..."),
OpenAIProxy("http://localhost:8080", model="gpt-4", name="baseline_gpt4"),
],
dataset=dataset,
evaluators=[QAEvaluator()],
metrics=[MeanScore(), PassRate(), CompressionRatio(), CostOfPass()],
text_fields=["response"],
)
print(to_markdown(result))
Output:
# Evaluation Results
| System | mean_score | pass_rate | compression_ratio | cost_of_pass |
|---------------|------------|-----------|-------------------|--------------|
| headroom | 0.9200 | 0.9000 | 0.8760 | 145.4118 |
| compresr | 0.8800 | 0.8500 | 0.7200 | 185.5556 |
| baseline_gpt4 | 0.9500 | 0.9500 | 0.0000 | 258.0000 |
Export results
result.to_json() # JSON string
result.to_dataframe() # pandas DataFrame (requires pandas)
result.filter(system="headroom") # filter to one system
Built-in datasets
| Dataset | Domain | Loader | Install |
|---|---|---|---|
| HotpotQA | Multi-hop QA | datasets.huggingface.hotpotqa() |
pip install -e ".[datasets]" |
| GSM8K | Math reasoning | datasets.huggingface.gsm8k() |
pip install -e ".[datasets]" |
| BFCL v3 | Function calling | datasets.huggingface.bfcl_simple() |
pip install -e ".[datasets]" |
| APIGen | Multi-turn tool use | datasets.agent_traces.apigen_mt() |
pip install -e ".[datasets]" |
| SWE-agent | Coding agent traces | datasets.agent_traces.swe_agent_traces() |
pip install -e ".[datasets]" |
| Local JSONL | Any | datasets.local.load_jsonl(path) |
Core |
Or bring your own — any list[dict] with "id" and "context" keys works.
Built-in metrics
| Metric | What it measures |
|---|---|
MeanScore |
Average score across all examples |
PassRate(threshold) |
Fraction of examples scoring above threshold |
CompressionRatio |
1 - (output_tokens / input_tokens) |
CostOfPass(threshold) |
Tokens spent per successful completion (arXiv:2504.13359) |
ParetoRank |
Rank on the quality-vs-cost Pareto frontier |
f1_score, exact_match, recall_score |
SQuAD-standard text comparison utilities |
Installation
# Core (just tiktoken)
uv sync
# With HuggingFace dataset loaders
uv sync --extra datasets
# Everything
uv sync --all-extras
# Development
uv sync --group dev
Requires Python 3.10+ and uv.
Running tests
uv run pytest
Project structure
src/context_bench/
├── __init__.py # Public API: evaluate, EvalResult, EvalRow, OpenAIProxy
├── types.py # Protocol definitions (System, Evaluator, Metric)
├── runner.py # Core evaluate() orchestration
├── results.py # EvalRow / EvalResult dataclasses
├── registry.py # Plugin system for named components
├── systems/ # Built-in systems (OpenAIProxy)
├── datasets/ # Built-in dataset loaders
├── metrics/ # MeanScore, PassRate, CompressionRatio, CostOfPass, ParetoRank
├── reporters/ # Markdown and JSON output formatters
└── utils/tokens.py # Pluggable tokenizer (default: tiktoken cl100k_base)
CI/CD
This project uses GitHub Actions for continuous integration:
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [master, main]
pull_request:
branches: [master, main]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- run: uv python install ${{ matrix.python-version }}
- run: uv sync --group dev
- run: uv run pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file context_bench-0.1.0.tar.gz.
File metadata
- Download URL: context_bench-0.1.0.tar.gz
- Upload date:
- Size: 110.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b31e5be18791e3a3318e8d5b7e1f257dcfa0d578058db4a9c83fbe6a0b80799
|
|
| MD5 |
b276167dcd8152d3d7f1f9881aa8341a
|
|
| BLAKE2b-256 |
9eeaffe2156ffe019146517cbde49e651b40930e56ac0da774ddc8ea11d73b44
|
Provenance
The following attestation bundles were made for context_bench-0.1.0.tar.gz:
Publisher:
publish.yml on npow/context-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
context_bench-0.1.0.tar.gz -
Subject digest:
8b31e5be18791e3a3318e8d5b7e1f257dcfa0d578058db4a9c83fbe6a0b80799 - Sigstore transparency entry: 976128877
- Sigstore integration time:
-
Permalink:
npow/context-bench@6deb5de4d5ecd68bb0ac59465a02eb32995616a9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6deb5de4d5ecd68bb0ac59465a02eb32995616a9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file context_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: context_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f61d6c44a2770c977b6c37eaafca8d45258070c752fcd8763af4d92cac572c26
|
|
| MD5 |
b16b9a99be7adcf096e474349c01c359
|
|
| BLAKE2b-256 |
c26ea9f8412ad368e2e5125f0348ff82d70423ceb969669e7a6e649631a6ea70
|
Provenance
The following attestation bundles were made for context_bench-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on npow/context-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
context_bench-0.1.0-py3-none-any.whl -
Subject digest:
f61d6c44a2770c977b6c37eaafca8d45258070c752fcd8763af4d92cac572c26 - Sigstore transparency entry: 976128883
- Sigstore integration time:
-
Permalink:
npow/context-bench@6deb5de4d5ecd68bb0ac59465a02eb32995616a9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/npow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6deb5de4d5ecd68bb0ac59465a02eb32995616a9 -
Trigger Event:
release
-
Statement type: