PR comments for LLM eval regressions. A pytest plugin and GitHub App.
Project description
evalcheck
Bundlesize for your LLM evals. A pytest plugin and GitHub App that posts a PR comment showing which evals regressed, improved, or are new — every time you push.
The pitch
You change a prompt, swap a model, or tweak a RAG pipeline. Did quality go up or down? Today the answer is "run the evals locally, eyeball the numbers, hope you remembered to commit them."
evalcheck makes the answer a PR comment.
## evalcheck — 24 evals run on commit a1b2c3d
| Eval | main | this PR | Δ |
|------|------|---------|--------|
| test_summarization::faithfulness | 0.84 | 0.71 | -0.13 |
| test_qa::relevance | 0.91 | 0.93 | +0.02 |
| test_classifier::accuracy | 0.87 | 0.87 | — |
1 regression, 1 improvement, 22 unchanged.
GitHub Check: failing — faithfulness dropped below 0.75 threshold.
That comment, on every push, is the entire product.
Install
pip install evalcheck
# tests/test_summarization.py
from evalcheck import eval, faithfulness
@eval(metric=faithfulness, threshold=0.75)
def test_summary(article):
return summarize(article)
pytest
Plain pytest. No new runner, no flags, no cloud account. The first run writes a baseline to .evalcheck/snapshots/. Commit it. Subsequent runs diff against it.
Add the GitHub App, push the branch, and the PR comment shows up.
How it differs
| evalcheck | deepeval | pytest-evals | promptfoo | braintrust | |
|---|---|---|---|---|---|
Plain pytest invocation |
yes | needs deepeval test run |
needs --run-eval flag |
own CLI | SDK-only |
| Ships LLM-as-judge metrics | yes | yes | bring your own | yes | yes |
| Regression baseline in git, not cloud | yes | cloud-tied | none | none | cloud |
| PR comment per push | yes | no | no | no | no |
| GitHub Check API status | yes | no | no | no | no |
| Composes with existing pytest fixtures | yes | adapter required | yes | n/a | n/a |
The wedge is the bottom three rows. Nobody else ships them.
Why this exists
Every meaningful CI signal in modern dev — coverage, bundle size, type errors, accessibility — shows up as a PR comment or a Check status. Engineers triage from the PR view, not from a separate dashboard. LLM eval tooling skipped that pattern. evalcheck closes the gap.
The pytest-native path matters because eval logic shares fixtures with the rest of your test suite: the same DB seeded the same way, the same mocked HTTP client, the same auth context. Running evals as a separate parallel test framework doubles the surface area and rots within a quarter.
Pricing
- OSS plugin (
pip install evalcheck) — MIT, free forever. Runs locally and in CI. Writes baselines to.evalcheck/snapshots/. Full functionality without ever signing up. - GitHub App — free for public repos and the first 50 evals per private-repo run. $19 per private repo per month above that.
- Hosted dashboard (later, optional) — historical trend lines, model-vs-model side-by-sides, drill-into-failures. $49/team/month. Reads the same JSON snapshots — opting in changes nothing about how the plugin runs.
The OSS plugin is intentionally complete on its own. The GitHub App and dashboard are pure convenience layers; teams that prefer their own CI tooling can render the JSON output however they want.
Built-in metrics (v1)
faithfulness— output grounded in the supplied context (LLM-as-judge)relevance— output answers the inputcorrectness— output matches an expected answer (LLM-as-judge with a rubric)exact_match— deterministic string equalityregex_match— deterministic pattern matchcustom(fn)— arbitrary scorer returning a float in [0, 1]
Multi-provider out of the box (OpenAI, Anthropic, local via Ollama). Judge model configurable per metric.
Roadmap
| Weeks | Phase | Output |
|---|---|---|
| 1–3 | Plugin v1 | pip install evalcheck works. @eval decorator, 6 built-in metrics, JSON snapshot, plain pytest invocation, multi-provider. OSS on GitHub from day one. llms.txt on docs. |
| 4–6 | GitHub App v1 | Webhook receives push, runs evals in a sandboxed runner, posts PR comment, sets Check status. Free tier live. |
| 7–9 | First 10 paying installs | Cookbook PRs into LangChain, LlamaIndex, OpenAI examples. One honest "Show HN" post. Direct outreach to maintainers of public AI repos that already have eval scripts in the repo but no CI integration. No sales calls. |
| 10–12 | Compounding distribution | Programmatic SEO: evalcheck vs deepeval, evalcheck vs pytest-evals, LLM eval CI GitHub Action. Marketplace listing. Badge embed on README. PyPI download counter on landing page. |
What's not in v1
- Hosted historical dashboard — use
git logon.evalcheck/snapshots/ - Automatic dataset generation
- Slack / Teams / Discord notifications
- SSO, RBAC, multi-repo org views
- Cost / latency tracking (it's a separate concern; observability tools own that)
- A web playground
Keeping the surface this small is the only way one person ships a working product in 12 weekends.
Kill criterion
Day 90: under 500 weekly PyPI downloads and under 3 paying GitHub App installs. Either alone is salvageable; both together means the bundlesize-for-prompts framing didn't land and the OSS-to-paid funnel isn't compounding. Walk away cleanly — keep the OSS plugin published, take the lessons into mcpguard.
Honest risks
- deepeval can ship a real pytest plugin in a weekend. They have 15.1k stars of momentum. Mitigation: get the GitHub App and PR-comment surface live fast — that's the moat, not the plugin.
pytest-evalscan ship metrics in a weekend. They have the pytest-native pattern down. Mitigation: same — the comment surface and GitHub Check are non-trivial to add and require infra they don't have.- The "engineers will adopt LLM evals in CI" bet. Not yet a default behaviour for most teams. evalcheck has to teach the habit while selling the tool. Distribution is the harder half of the work.
Status
Pre-build. README written first deliberately — if the pitch isn't compelling on this page, no amount of code rescues it. If you read this far and the wedge feels real, that's the green light.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalcheck-0.2.0.tar.gz.
File metadata
- Download URL: evalcheck-0.2.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
998ad2234919e6d0ab8d96ba657a0527a3ccb69ced580707bbef0ecc518b3c42
|
|
| MD5 |
c3f52ac97493c88a3deee451c017c73c
|
|
| BLAKE2b-256 |
2818b448d24f9418c7f5a21336e4f025630daf7f8aa4473c49be1d31e062d721
|
Provenance
The following attestation bundles were made for evalcheck-0.2.0.tar.gz:
Publisher:
release.yml on Boiga7/evalcheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evalcheck-0.2.0.tar.gz -
Subject digest:
998ad2234919e6d0ab8d96ba657a0527a3ccb69ced580707bbef0ecc518b3c42 - Sigstore transparency entry: 1400309676
- Sigstore integration time:
-
Permalink:
Boiga7/evalcheck@1a42dee7aa705474539053efb50959ae0bb02420 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Boiga7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1a42dee7aa705474539053efb50959ae0bb02420 -
Trigger Event:
push
-
Statement type:
File details
Details for the file evalcheck-0.2.0-py3-none-any.whl.
File metadata
- Download URL: evalcheck-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77fb5d85679844835a3a2fe6b642e7775c8eab2a40ca2f4bfc74b71ef6666384
|
|
| MD5 |
eb466bffa6606f7143cfa709e48c32ac
|
|
| BLAKE2b-256 |
28a33039f43f80e76c092cb14e3650a1ab2e8ba21977d87f5ad05f6961d3125e
|
Provenance
The following attestation bundles were made for evalcheck-0.2.0-py3-none-any.whl:
Publisher:
release.yml on Boiga7/evalcheck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evalcheck-0.2.0-py3-none-any.whl -
Subject digest:
77fb5d85679844835a3a2fe6b642e7775c8eab2a40ca2f4bfc74b71ef6666384 - Sigstore transparency entry: 1400309711
- Sigstore integration time:
-
Permalink:
Boiga7/evalcheck@1a42dee7aa705474539053efb50959ae0bb02420 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Boiga7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1a42dee7aa705474539053efb50959ae0bb02420 -
Trigger Event:
push
-
Statement type: