Cost Diff CI — bundlewatch for LLM cost. Replays scenarios in CI, posts cost diff on PRs.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raghavg27

These details have not been verified by PyPI

Project description

costdiff

Catch LLM cost regressions before they ship.

costdiff runs your LLM scenarios on every pull request, measures the cost, compares it to main, and posts a diff comment. If the PR makes your app significantly more expensive, the check fails — just like a failing test.

Think bundlewatch or size-limit, but for tokens and dollars.

  scenario           baseline      head         Δ
  support_bot        $0.0042       $0.0044      +4.7%   🟢 ok
  research_agent     $0.0810       $0.1620     +100.0%  🔴 FAIL
  ─────────────────────────────────────────────────────────
  TOTAL              $0.0852       $0.1664      +95.3%  🔴 FAIL

Why

A single prompt change, a model swap, or a new tool can quietly multiply your bill. By the time monitoring catches it in production, you've already paid.

costdiff shifts that signal left, into the PR review.

Deterministic. Replays the same scenarios with the same inputs every run.
Provider-native. Reads token usage from OpenAI and Anthropic SDKs via OpenTelemetry — no scraping, no estimation.
Noise-aware. Repeats each scenario, computes IQR, and ignores deltas that fall inside the noise floor.
Drop-in. One YAML file, one GitHub Action.

30-second quickstart

pip install inferenceci
costdiff init                       # creates costdiff.yaml + scenarios/
export OPENAI_API_KEY=...
costdiff run                        # writes report.json

That's it locally. Edit scenarios/example_openai.py to call your real code, add scenarios you care about, and re-run.

Use it in CI

Add this workflow to .github/workflows/costdiff.yml:

name: costdiff
on:
  pull_request:

jobs:
  costdiff:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - uses: raghavg27/costdiff-action@v1
        with:
          config-path: costdiff.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

On every PR, the action will:

Run your scenarios on the PR head.
Run them again on the merge-base from main.
Diff the two reports.
Post or update a sticky comment on the PR.
Fail the check if cost regression exceeds your thresholds.

The PR comment looks like:

costdiff

Result: 🔴 FAIL — regressions exceed thresholds

metric baseline head Δ status

cost (USD) $0.0852 $0.1664 +$0.0812 (+95.3%) 🔴 regression

input tokens 12,000 24,000 +12,000 (+100.0%) 🔴 regression

calls 14 14 0 (0.0%) ⚫ no_change

Fail reasons:

total cost regression +95.3% (threshold 15.0%)

metric	baseline	head	Δ	status
cost (USD)	$0.0852	$0.1664	+$0.0812 (+95.3%)	🔴 regression
input tokens	12,000	24,000	+12,000 (+100.0%)	🔴 regression
calls	14	14	0 (0.0%)	⚫ no_change

Configure (`costdiff.yaml`)

version: 1
runs_per_scenario: 3                # repeat for noise control

providers:
  openai:    { api_key_env: OPENAI_API_KEY }
  anthropic: { api_key_env: ANTHROPIC_API_KEY }

thresholds:
  cost_increase_pct: 15             # fail if total cost up >15%
  scenario_cost_increase_pct: 25    # fail if any scenario up >25%
  ignore_below_usd: 0.001           # ignore changes smaller than this

scenarios:
  - name: support_bot
    entrypoint: scenarios/support.py:run
    input: { query: "How do I reset my password?" }
    timeout_seconds: 60

  - name: research_agent
    entrypoint: scenarios/research.py:run
    input_file: scenarios/inputs/research.json
    timeout_seconds: 300

A scenario is just a Python function:

# scenarios/support.py
from openai import OpenAI

def run(input: dict) -> dict:
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": input["query"]}],
    )
    return {"answer": resp.choices[0].message.content}

costdiff doesn't care what the function returns — it only watches the SDK calls it makes.

CLI

costdiff init                                  # scaffold a project
costdiff run [--output report.json]            # run scenarios, write a report
costdiff compare baseline.json head.json       # diff two reports
costdiff pricing list                          # show the pricing table
costdiff version

compare exit codes: 0 within thresholds, 1 regression, 2 input error.

compare formats: text (default), markdown (PR comments), json (machine-readable).

Tuning thresholds

Three knobs — start with the defaults, tighten as your prompts stabilize.

knob	default	tighten when…	loosen when…
`cost_increase_pct`	15	prompts are stable, cost is critical	actively iterating
`scenario_cost_increase_pct`	25	catching localized blowups matters	long-tail scenarios are noisy
`ignore_below_usd`	0.001	every cent counts	scenarios are tiny by design

If you see false positives, bump runs_per_scenario before touching thresholds. The IQR-overlap test treats noisy metrics as noise (⚪) and won't fail on them, so more runs = less flake.

What it tracks

For each scenario:

cost in USD (per provider, per call, summed per run)
input / output / cached tokens
LLM calls + tool calls
wall-clock latency

For each metric: median, p95, and IQR across runs.

Top-level totals are the sum across scenarios.

Pricing

A pricing table for OpenAI and Anthropic ships with the package. To override:

pricing_file: pricing.yaml

# pricing.yaml
version: 1
last_updated: 2026-05-01
providers:
  openai:
    gpt-4o:
      input_per_1m: 2.50
      output_per_1m: 10.00
      cached_input_per_1m: 1.25
  anthropic:
    claude-opus-4-7:
      input_per_1m: 15.00
      output_per_1m: 75.00
      cache_write_per_1m: 18.75
      cache_read_per_1m: 1.50

If a model isn't in the table, its tokens are still counted, but the report flags it under warnings and reports cost_median_usd: null for those calls.

How it works

costdiff installs OpenTelemetry GenAI auto-instrumentation for the official OpenAI and Anthropic SDKs, runs your scenarios in-process, and reads token counts directly from the spans the instrumentations emit.

No SDK forks. No HTTP scraping. No accuracy tradeoffs. The same numbers your provider would bill you for.

Versions are pinned because GenAI semantic conventions are still moving. See docs/internals.md for the attribute list.

Privacy

Prompts are never written to reports by default (redact_prompts: true). Only token counts, costs, and model names leave your process. API keys are read from environment variables and never logged.

Frequently asked

Does this call my LLM provider for real? Yes. Scenarios run real SDK calls, which is how you get real token counts and real cost. Use small max_tokens and cheap models on the scenarios you run in CI; or use prompt caching so the per-PR cost is cents, not dollars.

Can I use it with LangChain / LangGraph / CrewAI? Yes — anything that ultimately calls the OpenAI or Anthropic SDKs is instrumented automatically. Other providers are on the roadmap.

What about quality? Doesn't a cheaper model mean worse output? costdiff is intentionally not a quality evaluator — it's a cost guard. Pair it with your existing eval suite.

My runs are noisy. Bump runs_per_scenario from 3 → 5 or 7. The IQR-overlap test will mark overlapping metrics as noise instead of failing.

Roadmap

Google Gemini support
Cached baselines (skip re-running main)
OpenTelemetry trace export (send to your APM)
VS Code inline annotations

Out of scope: live production monitoring, quality evaluation, auto-PRs, cost-by-team attribution. See PRD.md.

Develop

uv venv --python 3.11 .venv
uv pip install -e ".[dev]"
.venv/bin/pytest -q
.venv/bin/ruff check .

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

raghavg27

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferenceci-0.1.0.tar.gz (35.9 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferenceci-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file inferenceci-0.1.0.tar.gz.

File metadata

Download URL: inferenceci-0.1.0.tar.gz
Upload date: May 8, 2026
Size: 35.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferenceci-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f8eabdea2659f4a5cf78a45d954200f27f7b394a68fc4fcf4bc70d4be4dba81f`
MD5	`b4094eeac3645f5615c551d9488322d0`
BLAKE2b-256	`303cb8ae2a7c41d1925d8a20027286026ffb3234e5e8e50f44275d65572fad56`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferenceci-0.1.0.tar.gz:

Publisher: release.yml on raghavg27/InferenceCI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferenceci-0.1.0.tar.gz
- Subject digest: f8eabdea2659f4a5cf78a45d954200f27f7b394a68fc4fcf4bc70d4be4dba81f
- Sigstore transparency entry: 1474127993
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: raghavg27/InferenceCI@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/raghavg27
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa
- Trigger Event: push

File details

Details for the file inferenceci-0.1.0-py3-none-any.whl.

File metadata

Download URL: inferenceci-0.1.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferenceci-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3cbd93350bb762c2a9d18afb882507a71ae038eaa57c60f6dc8347d5dff47694`
MD5	`c29c0451cd544de4e76c06236f121b23`
BLAKE2b-256	`71b80e894818a03db513e520f17a32c900c85aeca4486f43a9091813a0ba373a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferenceci-0.1.0-py3-none-any.whl:

Publisher: release.yml on raghavg27/InferenceCI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferenceci-0.1.0-py3-none-any.whl
- Subject digest: 3cbd93350bb762c2a9d18afb882507a71ae038eaa57c60f6dc8347d5dff47694
- Sigstore transparency entry: 1474128389
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: raghavg27/InferenceCI@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/raghavg27
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d41b8be74b70c326b1f9fce3d40b27aa89ff83aa
- Trigger Event: push

inferenceci 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

costdiff

Why

30-second quickstart

Use it in CI

costdiff

Configure (costdiff.yaml)

CLI

Tuning thresholds

What it tracks

Pricing

How it works

Privacy

Frequently asked

Roadmap

Develop

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Configure (`costdiff.yaml`)