Cost Diff CI — bundlewatch for LLM cost. Replays scenarios in CI, posts cost diff on PRs.
Project description
costdiff
Catch LLM cost regressions before they ship.
costdiff runs your LLM scenarios on every pull request, measures the cost,
compares it to main, and posts a diff comment. If the PR makes your app
significantly more expensive, the check fails — just like a failing test.
Think bundlewatch or size-limit, but for tokens and dollars.
scenario baseline head Δ
support_bot $0.0042 $0.0044 +4.7% 🟢 ok
research_agent $0.0810 $0.1620 +100.0% 🔴 FAIL
─────────────────────────────────────────────────────────
TOTAL $0.0852 $0.1664 +95.3% 🔴 FAIL
Why
A single prompt change, a model swap, or a new tool can quietly multiply your bill. By the time monitoring catches it in production, you've already paid.
costdiff shifts that signal left, into the PR review.
- Deterministic. Replays the same scenarios with the same inputs every run.
- Provider-native. Reads token usage from OpenAI and Anthropic SDKs via OpenTelemetry — no scraping, no estimation.
- Noise-aware. Repeats each scenario, computes IQR, and ignores deltas that fall inside the noise floor.
- Drop-in. One YAML file, one GitHub Action.
30-second quickstart
pip install inferenceci
costdiff init # creates costdiff.yaml + scenarios/
export OPENAI_API_KEY=...
costdiff run # writes report.json
That's it locally. Edit scenarios/example_openai.py to call your real code,
add scenarios you care about, and re-run.
Use it in CI
Add this workflow to .github/workflows/costdiff.yml:
name: costdiff
on:
pull_request:
jobs:
costdiff:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: raghavg27/costdiff-action@v1
with:
config-path: costdiff.yaml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
On every PR, the action will:
- Run your scenarios on the PR head.
- Run them again on the merge-base from
main. - Diff the two reports.
- Post or update a sticky comment on the PR.
- Fail the check if cost regression exceeds your thresholds.
The PR comment looks like:
costdiff
Result: 🔴 FAIL — regressions exceed thresholds
metric baseline head Δ status cost (USD) $0.0852 $0.1664 +$0.0812 (+95.3%) 🔴 regression input tokens 12,000 24,000 +12,000 (+100.0%) 🔴 regression calls 14 14 0 (0.0%) ⚫ no_change Fail reasons:
- total cost regression +95.3% (threshold 15.0%)
Configure (costdiff.yaml)
version: 1
runs_per_scenario: 3 # repeat for noise control
providers:
openai: { api_key_env: OPENAI_API_KEY }
anthropic: { api_key_env: ANTHROPIC_API_KEY }
thresholds:
cost_increase_pct: 15 # fail if total cost up >15%
scenario_cost_increase_pct: 25 # fail if any scenario up >25%
ignore_below_usd: 0.001 # ignore changes smaller than this
scenarios:
- name: support_bot
entrypoint: scenarios/support.py:run
input: { query: "How do I reset my password?" }
timeout_seconds: 60
- name: research_agent
entrypoint: scenarios/research.py:run
input_file: scenarios/inputs/research.json
timeout_seconds: 300
A scenario is just a Python function:
# scenarios/support.py
from openai import OpenAI
def run(input: dict) -> dict:
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": input["query"]}],
)
return {"answer": resp.choices[0].message.content}
costdiff doesn't care what the function returns — it only watches the
SDK calls it makes.
CLI
costdiff init # scaffold a project
costdiff run [--output report.json] # run scenarios, write a report
costdiff compare baseline.json head.json # diff two reports
costdiff pricing list # show the pricing table
costdiff version
compare exit codes: 0 within thresholds, 1 regression, 2 input error.
compare formats: text (default), markdown (PR comments), json
(machine-readable).
Tuning thresholds
Three knobs — start with the defaults, tighten as your prompts stabilize.
| knob | default | tighten when… | loosen when… |
|---|---|---|---|
cost_increase_pct |
15 | prompts are stable, cost is critical | actively iterating |
scenario_cost_increase_pct |
25 | catching localized blowups matters | long-tail scenarios are noisy |
ignore_below_usd |
0.001 | every cent counts | scenarios are tiny by design |
If you see false positives, bump runs_per_scenario before touching
thresholds. The IQR-overlap test treats noisy metrics as noise (⚪) and
won't fail on them, so more runs = less flake.
What it tracks
For each scenario:
- cost in USD (per provider, per call, summed per run)
- input / output / cached tokens
- LLM calls + tool calls
- wall-clock latency
For each metric: median, p95, and IQR across runs.
Top-level totals are the sum across scenarios.
Pricing
A pricing table for OpenAI and Anthropic ships with the package. To override:
pricing_file: pricing.yaml
# pricing.yaml
version: 1
last_updated: 2026-05-01
providers:
openai:
gpt-4o:
input_per_1m: 2.50
output_per_1m: 10.00
cached_input_per_1m: 1.25
anthropic:
claude-opus-4-7:
input_per_1m: 15.00
output_per_1m: 75.00
cache_write_per_1m: 18.75
cache_read_per_1m: 1.50
If a model isn't in the table, its tokens are still counted, but the report
flags it under warnings and reports cost_median_usd: null for those calls.
How it works
costdiff installs OpenTelemetry GenAI auto-instrumentation for the official
OpenAI and Anthropic SDKs, runs your scenarios in-process, and reads token
counts directly from the spans the instrumentations emit.
No SDK forks. No HTTP scraping. No accuracy tradeoffs. The same numbers your provider would bill you for.
Versions are pinned because GenAI semantic conventions are still moving. See
docs/internals.md for the attribute list.
Privacy
Prompts are never written to reports by default (redact_prompts: true).
Only token counts, costs, and model names leave your process. API keys are
read from environment variables and never logged.
Frequently asked
Does this call my LLM provider for real?
Yes. Scenarios run real SDK calls, which is how you get real token counts
and real cost. Use small max_tokens and cheap models on the scenarios you
run in CI; or use prompt caching so the per-PR cost is cents, not dollars.
Can I use it with LangChain / LangGraph / CrewAI? Yes — anything that ultimately calls the OpenAI or Anthropic SDKs is instrumented automatically. Other providers are on the roadmap.
What about quality? Doesn't a cheaper model mean worse output?
costdiff is intentionally not a quality evaluator — it's a cost guard.
Pair it with your existing eval suite.
My runs are noisy.
Bump runs_per_scenario from 3 → 5 or 7. The IQR-overlap test will mark
overlapping metrics as noise instead of failing.
Roadmap
- Google Gemini support
- Cached baselines (skip re-running
main) - OpenTelemetry trace export (send to your APM)
- VS Code inline annotations
Out of scope: live production monitoring, quality evaluation, auto-PRs,
cost-by-team attribution. See PRD.md.
Develop
uv venv --python 3.11 .venv
uv pip install -e ".[dev]"
.venv/bin/pytest -q
.venv/bin/ruff check .
License
Proprietary. © 2026 InferenceLabs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferenceci-0.1.0.tar.gz.
File metadata
- Download URL: inferenceci-0.1.0.tar.gz
- Upload date:
- Size: 35.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8eabdea2659f4a5cf78a45d954200f27f7b394a68fc4fcf4bc70d4be4dba81f
|
|
| MD5 |
b4094eeac3645f5615c551d9488322d0
|
|
| BLAKE2b-256 |
303cb8ae2a7c41d1925d8a20027286026ffb3234e5e8e50f44275d65572fad56
|
Provenance
The following attestation bundles were made for inferenceci-0.1.0.tar.gz:
Publisher:
release.yml on raghavg27/InferenceCI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferenceci-0.1.0.tar.gz -
Subject digest:
f8eabdea2659f4a5cf78a45d954200f27f7b394a68fc4fcf4bc70d4be4dba81f - Sigstore transparency entry: 1474127993
- Sigstore integration time:
-
Permalink:
raghavg27/InferenceCI@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/raghavg27
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa -
Trigger Event:
push
-
Statement type:
File details
Details for the file inferenceci-0.1.0-py3-none-any.whl.
File metadata
- Download URL: inferenceci-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cbd93350bb762c2a9d18afb882507a71ae038eaa57c60f6dc8347d5dff47694
|
|
| MD5 |
c29c0451cd544de4e76c06236f121b23
|
|
| BLAKE2b-256 |
71b80e894818a03db513e520f17a32c900c85aeca4486f43a9091813a0ba373a
|
Provenance
The following attestation bundles were made for inferenceci-0.1.0-py3-none-any.whl:
Publisher:
release.yml on raghavg27/InferenceCI
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inferenceci-0.1.0-py3-none-any.whl -
Subject digest:
3cbd93350bb762c2a9d18afb882507a71ae038eaa57c60f6dc8347d5dff47694 - Sigstore transparency entry: 1474128389
- Sigstore integration time:
-
Permalink:
raghavg27/InferenceCI@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/raghavg27
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa -
Trigger Event:
push
-
Statement type: