Skip to main content

Benchmark any system that transforms LLM context

Project description

context-bench

Python 3.10+ License: MIT CI Code style: black

Benchmark any system that transforms LLM context.

Prompt compressors, memory managers, context stuffers, RAG rerankers — if it touches the context window before an LLM sees it, context-bench measures how well it works and what it costs.


Why context-bench?

You built (or bought) something that modifies LLM context. Now you need to answer:

  • Does compression destroy information? Measure quality with F1, exact match, and pass rate against ground-truth QA datasets.
  • Is the cost worth it? Track compression ratio and cost-per-successful-completion side by side.
  • Which approach wins? Run multiple systems on the same dataset in one call and get a comparison table.

context-bench gives you a single evaluate() call that runs your system against a dataset, scores every example, and aggregates the results — no boilerplate, no framework lock-in.

Quick start

uv sync

Benchmark Headroom in 3 lines:

from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import MeanScore, PassRate

# headroom proxy --port 8787
headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")
result = evaluate(
    systems=[headroom],
    dataset=your_dataset,
    evaluators=[your_evaluator],
    metrics=[MeanScore(), PassRate()],
    text_fields=["response"],   # count only the proxy output tokens
)
print(result.summary)

How it works

flowchart LR
    D[Dataset\ndicts] --> S[System\n.process]
    S --> E[Evaluator\n.score]
    E --> M[Metric\n.compute]
    S -. output dict .-> S
    E -. scores dict .-> E
    M -. summary dict .-> M
  1. Dataset — any Iterable[dict]. Must have "id" and "context" keys.
  2. System — implements .name and .process(example) -> dict. This is the thing you're benchmarking.
  3. Evaluator — implements .name and .score(original, processed) -> dict[str, float]. Compares before/after.
  4. Metric — implements .name and .compute(rows) -> dict[str, float]. Aggregates scores across examples.

All interfaces are typing.Protocol — implement the methods, don't subclass anything.

Benchmark a proxy

The built-in OpenAIProxy system wraps any OpenAI-compatible endpoint. Point at a URL, get quality and cost metrics back — no HTTP boilerplate needed.

Headroom

pip install "headroom-ai[proxy]"
headroom proxy --port 8787
from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CostOfPass, MeanScore, PassRate
from context_bench.metrics.quality import f1_score

headroom = OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom")

class QAEvaluator:
    name = "qa_f1"
    def score(self, original, processed):
        return {"score": f1_score(processed.get("response", ""),
                                   original.get("answer", ""))}

result = evaluate(
    systems=[headroom],
    dataset=your_dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CostOfPass()],
    text_fields=["response"],
)

Compresr

Compresr uses a Python SDK instead of a proxy, so wrap it in a custom system:

from compresr import CompressionClient

class CompresrSystem:
    name = "compresr"

    def __init__(self, api_key):
        self.client = CompressionClient(api_key=api_key)

    def process(self, example):
        compressed = self.client.generate(
            context=example["context"],
            question=example.get("question", ""),
        )
        return {**example, "context": compressed}

Compare Headroom vs Compresr

result = evaluate(
    systems=[
        OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
        CompresrSystem(api_key="..."),
    ],
    dataset=dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CostOfPass()],
    text_fields=["response"],
)

Any OpenAI-compatible endpoint

OpenAIProxy(
    base_url="http://localhost:8080",
    model="gpt-4",
    api_key="sk-...",              # or set OPENAI_API_KEY env var
    system_prompt="Be concise.",   # prepended as system message
    extra_body={"temperature": 0}, # any additional request params
)

text_fields=["response"] — By default the runner counts all string fields for token stats, which would double-count context in output tokens. Pass text_fields=["response"] so only the proxy's actual output is measured.

Compare systems head-to-head

from context_bench import OpenAIProxy, evaluate
from context_bench.metrics import CompressionRatio, CostOfPass, MeanScore, PassRate
from context_bench.reporters.markdown import to_markdown

result = evaluate(
    systems=[
        OpenAIProxy("http://localhost:8787", model="claude-sonnet-4-5-20250929", name="headroom"),
        CompresrSystem(api_key="..."),
        OpenAIProxy("http://localhost:8080", model="gpt-4", name="baseline_gpt4"),
    ],
    dataset=dataset,
    evaluators=[QAEvaluator()],
    metrics=[MeanScore(), PassRate(), CompressionRatio(), CostOfPass()],
    text_fields=["response"],
)

print(to_markdown(result))

Output:

# Evaluation Results

| System        | mean_score | pass_rate | compression_ratio | cost_of_pass |
|---------------|------------|-----------|-------------------|--------------|
| headroom      | 0.9200     | 0.9000    | 0.8760            | 145.4118     |
| compresr      | 0.8800     | 0.8500    | 0.7200            | 185.5556     |
| baseline_gpt4 | 0.9500     | 0.9500    | 0.0000            | 258.0000     |

Export results

result.to_json()          # JSON string
result.to_dataframe()     # pandas DataFrame (requires pandas)
result.filter(system="headroom")  # filter to one system

Built-in datasets

Dataset Domain Loader Install
HotpotQA Multi-hop QA datasets.huggingface.hotpotqa() pip install -e ".[datasets]"
GSM8K Math reasoning datasets.huggingface.gsm8k() pip install -e ".[datasets]"
BFCL v3 Function calling datasets.huggingface.bfcl_simple() pip install -e ".[datasets]"
APIGen Multi-turn tool use datasets.agent_traces.apigen_mt() pip install -e ".[datasets]"
SWE-agent Coding agent traces datasets.agent_traces.swe_agent_traces() pip install -e ".[datasets]"
Local JSONL Any datasets.local.load_jsonl(path) Core

Or bring your own — any list[dict] with "id" and "context" keys works.

Built-in metrics

Metric What it measures
MeanScore Average score across all examples
PassRate(threshold) Fraction of examples scoring above threshold
CompressionRatio 1 - (output_tokens / input_tokens)
CostOfPass(threshold) Tokens spent per successful completion (arXiv:2504.13359)
ParetoRank Rank on the quality-vs-cost Pareto frontier
f1_score, exact_match, recall_score SQuAD-standard text comparison utilities

Installation

# Core (just tiktoken)
uv sync

# With HuggingFace dataset loaders
uv sync --extra datasets

# Everything
uv sync --all-extras

# Development
uv sync --group dev

Requires Python 3.10+ and uv.

Running tests

uv run pytest

Project structure

src/context_bench/
├── __init__.py          # Public API: evaluate, EvalResult, EvalRow, OpenAIProxy
├── types.py             # Protocol definitions (System, Evaluator, Metric)
├── runner.py            # Core evaluate() orchestration
├── results.py           # EvalRow / EvalResult dataclasses
├── registry.py          # Plugin system for named components
├── systems/             # Built-in systems (OpenAIProxy)
├── datasets/            # Built-in dataset loaders
├── metrics/             # MeanScore, PassRate, CompressionRatio, CostOfPass, ParetoRank
├── reporters/           # Markdown and JSON output formatters
└── utils/tokens.py      # Pluggable tokenizer (default: tiktoken cl100k_base)

CI/CD

This project uses GitHub Actions for continuous integration:

# .github/workflows/ci.yml
name: CI
on:
  push:
    branches: [master, main]
  pull_request:
    branches: [master, main]
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - run: uv python install ${{ matrix.python-version }}
      - run: uv sync --group dev
      - run: uv run pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

context_bench-0.1.0.tar.gz (110.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

context_bench-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file context_bench-0.1.0.tar.gz.

File metadata

  • Download URL: context_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 110.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for context_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8b31e5be18791e3a3318e8d5b7e1f257dcfa0d578058db4a9c83fbe6a0b80799
MD5 b276167dcd8152d3d7f1f9881aa8341a
BLAKE2b-256 9eeaffe2156ffe019146517cbde49e651b40930e56ac0da774ddc8ea11d73b44

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_bench-0.1.0.tar.gz:

Publisher: publish.yml on npow/context-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file context_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: context_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for context_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f61d6c44a2770c977b6c37eaafca8d45258070c752fcd8763af4d92cac572c26
MD5 b16b9a99be7adcf096e474349c01c359
BLAKE2b-256 c26ea9f8412ad368e2e5125f0348ff82d70423ceb969669e7a6e649631a6ea70

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_bench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on npow/context-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page