Benchmark any system that transforms LLM context

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

npow

These details have not been verified by PyPI

Project description

context-bench

Benchmark any system that transforms LLM context.

Prompt compressors, memory managers, context stuffers, RAG rerankers — if it touches the context window before an LLM sees it, context-bench measures how well it works and what it costs.

Why context-bench?

You built (or bought) something that modifies LLM context. Now you need to answer:

Does compression destroy information? Measure quality with F1, exact match, and pass rate against ground-truth QA datasets.
Is the cost worth it? Track compression ratio and cost-per-successful-completion side by side.
Which approach wins? Run multiple systems on the same dataset in one call and get a comparison table.

context-bench gives you a single evaluate() call that runs your system against a dataset, scores every example, and aggregates the results — no boilerplate, no framework lock-in.

Quick start

uv sync

Benchmark Headroom in 3 lines:

from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import MeanScore, PassRate

# headroom proxy --port 8787
headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")
result = evaluate(
    systems=[headroom],
    dataset=your_dataset,
    evaluators=[your_evaluator],
    metrics=[MeanScore(), PassRate()],
    text_fields=["response"],   # count only the proxy output tokens
)
print(result.summary)

How it works

flowchart LR
    D[Dataset\ndicts] --> S[System\n.process]
    S --> E[Evaluator\n.score]
    E --> M[Metric\n.compute]
    S -. output dict .-> S
    E -. scores dict .-> E
    M -. summary dict .-> M

Dataset — any Iterable[dict]. Must have "id" and "context" keys.
System — implements .name and .process(example) -> dict. This is the thing you're benchmarking.
Evaluator — implements .name and .score(original, processed) -> dict[str, float]. Compares before/after.
Metric — implements .name and .compute(rows) -> dict[str, float]. Aggregates scores across examples.

All interfaces are typing.Protocol — implement the methods, don't subclass anything.

Benchmark a proxy

The built-in OpenAIProxy system wraps any OpenAI-compatible endpoint. Point at a URL, get quality and cost metrics back — no HTTP boilerplate needed.

Headroom

pip install "headroom-ai[proxy]"
headroom proxy --port 8787

from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CostOfPass, MeanScore, PassRate
from context_bench.metrics.quality import f1_score

headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")

class QAEvaluator:
    name = "qa_f1"
    def score(self, original, processed):
        return {"score": f1_score(processed.get("response", ""),
                                   original.get("answer", ""))}

result = evaluate(
    systems=[headroom],
    dataset=your_dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CostOfPass()],
    text_fields=["response"],
)

Compresr

Compresr uses a Python SDK instead of a proxy, so wrap it in a custom system:

from compresr import CompressionClient

class CompresrSystem:
    name = "compresr"

    def __init__(self, api_key):
        self.client = CompressionClient(api_key=api_key)

    def process(self, example):
        compressed = self.client.generate(
            context=example["context"],
            question=example.get("question", ""),
        )
        return {**example, "context": compressed}

Compare Headroom vs Compresr

result = evaluate(
    systems=[
        OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
        CompresrSystem(api_key="..."),
    ],
    dataset=dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CostOfPass()],
    text_fields=["response"],
)

Any OpenAI-compatible endpoint

OpenAIProxy(
    base_url="http://localhost:8080",
    model="gpt-4",
    api_key="sk-...",              # or set OPENAI_API_KEY env var
    system_prompt="Be concise.",   # prepended as system message
    extra_body={"temperature": 0}, # any additional request params
)

text_fields=["response"] — By default the runner counts all string fields for token stats, which would double-count context in output tokens. Pass text_fields=["response"] so only the proxy's actual output is measured.

Compare systems head-to-head

from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CompressionRatio, CostOfPass, MeanScore, PassRate
from context_bench.reporters.markdown import to_markdown

result = evaluate(
    systems=[
        OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
        CompresrSystem(api_key="..."),
        OpenAIProxy("http://localhost:8080", model="gpt-4", name="baseline_gpt4"),
    ],
    dataset=dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CompressionRatio(), CostOfPass()],
    text_fields=["response"],
)

print(to_markdown(result))

Output:

# Evaluation Results

| System        | mean_score | pass_rate | compression_ratio | cost_of_pass |
|---------------|------------|-----------|-------------------|--------------|
| headroom      | 0.9200     | 0.9000    | 0.8760            | 145.4118     |
| compresr      | 0.8800     | 0.8500    | 0.7200            | 185.5556     |
| baseline_gpt4 | 0.9500     | 0.9500    | 0.0000            | 258.0000     |

Export results

result.to_json()          # JSON string
result.to_dataframe()     # pandas DataFrame (requires pandas)
result.filter(system="headroom")  # filter to one system

Built-in datasets

Dataset	Domain	Loader	Install
HotpotQA	Multi-hop QA	`datasets.huggingface.hotpotqa()`	`pip install -e ".[datasets]"`
GSM8K	Math reasoning	`datasets.huggingface.gsm8k()`	`pip install -e ".[datasets]"`
BFCL v3	Function calling	`datasets.huggingface.bfcl_simple()`	`pip install -e ".[datasets]"`
APIGen	Multi-turn tool use	`datasets.agent_traces.apigen_mt()`	`pip install -e ".[datasets]"`
SWE-agent	Coding agent traces	`datasets.agent_traces.swe_agent_traces()`	`pip install -e ".[datasets]"`
Local JSONL	Any	`datasets.local.load_jsonl(path)`	Core

Or bring your own — any list[dict] with "id" and "context" keys works.

Built-in metrics

Metric	What it measures
`MeanScore`	Average score across all examples
`PassRate(threshold)`	Fraction of examples scoring above threshold
`CompressionRatio`	`1 - (output_tokens / input_tokens)`
`CostOfPass(threshold)`	Tokens spent per successful completion (arXiv:2504.13359)
`ParetoRank`	Rank on the quality-vs-cost Pareto frontier
`f1_score`, `exact_match`, `recall_score`	SQuAD-standard text comparison utilities

Installation

# Core (just tiktoken)
uv sync

# With HuggingFace dataset loaders
uv sync --extra datasets

# Everything
uv sync --all-extras

# Development
uv sync --group dev

Requires Python 3.10+ and uv.

Running tests

uv run pytest

Project structure

src/context_bench/
├── __init__.py          # Public API: evaluate, EvalResult, EvalRow, OpenAIProxy
├── types.py             # Protocol definitions (System, Evaluator, Metric)
├── runner.py            # Core evaluate() orchestration
├── results.py           # EvalRow / EvalResult dataclasses
├── registry.py          # Plugin system for named components
├── systems/             # Built-in systems (OpenAIProxy)
├── datasets/            # Built-in dataset loaders
├── metrics/             # MeanScore, PassRate, CompressionRatio, CostOfPass, ParetoRank
├── reporters/           # Markdown and JSON output formatters
└── utils/tokens.py      # Pluggable tokenizer (default: tiktoken cl100k_base)

CI/CD

This project uses GitHub Actions for continuous integration:

# .github/workflows/ci.yml
name: CI
on:
  push:
    branches: [master, main]
  pull_request:
    branches: [master, main]
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - run: uv python install ${{ matrix.python-version }}
      - run: uv sync --group dev
      - run: uv run pytest

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

npow

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

context_bench-0.1.0.tar.gz (110.7 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

context_bench-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file context_bench-0.1.0.tar.gz.

File metadata

Download URL: context_bench-0.1.0.tar.gz
Upload date: Feb 22, 2026
Size: 110.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for context_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8b31e5be18791e3a3318e8d5b7e1f257dcfa0d578058db4a9c83fbe6a0b80799`
MD5	`b276167dcd8152d3d7f1f9881aa8341a`
BLAKE2b-256	`9eeaffe2156ffe019146517cbde49e651b40930e56ac0da774ddc8ea11d73b44`

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_bench-0.1.0.tar.gz:

Publisher: publish.yml on npow/context-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: context_bench-0.1.0.tar.gz
- Subject digest: 8b31e5be18791e3a3318e8d5b7e1f257dcfa0d578058db4a9c83fbe6a0b80799
- Sigstore transparency entry: 976128877
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: npow/context-bench@6deb5de4d5ecd68bb0ac59465a02eb32995616a9
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6deb5de4d5ecd68bb0ac59465a02eb32995616a9
- Trigger Event: release

File details

Details for the file context_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: context_bench-0.1.0-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for context_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f61d6c44a2770c977b6c37eaafca8d45258070c752fcd8763af4d92cac572c26`
MD5	`b16b9a99be7adcf096e474349c01c359`
BLAKE2b-256	`c26ea9f8412ad368e2e5125f0348ff82d70423ceb969669e7a6e649631a6ea70`

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_bench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on npow/context-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: context_bench-0.1.0-py3-none-any.whl
- Subject digest: f61d6c44a2770c977b6c37eaafca8d45258070c752fcd8763af4d92cac572c26
- Sigstore transparency entry: 976128883
- Sigstore integration time: Feb 22, 2026
Source repository:
- Permalink: npow/context-bench@6deb5de4d5ecd68bb0ac59465a02eb32995616a9
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/npow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6deb5de4d5ecd68bb0ac59465a02eb32995616a9
- Trigger Event: release

context-bench 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

context-bench

Why context-bench?

Quick start

How it works

Benchmark a proxy

Headroom

Compresr

Compare Headroom vs Compresr

Any OpenAI-compatible endpoint

Compare systems head-to-head

Export results

Built-in datasets

Built-in metrics

Installation

Running tests

Project structure

CI/CD

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance