PR comments for LLM eval regressions. A pytest plugin and GitHub App.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

evalcheck

Bundlesize for your LLM evals. A pytest plugin and GitHub App that posts a PR comment showing which evals regressed, improved, or are new — every time you push.

The pitch

You change a prompt, swap a model, or tweak a RAG pipeline. Did quality go up or down? Today the answer is "run the evals locally, eyeball the numbers, hope you remembered to commit them."

evalcheck makes the answer a PR comment.

## evalcheck — 24 evals run on commit a1b2c3d

| Eval | main | this PR |   Δ    |
|------|------|---------|--------|
| test_summarization::faithfulness | 0.84 | 0.71 | -0.13 |
| test_qa::relevance               | 0.91 | 0.93 | +0.02 |
| test_classifier::accuracy        | 0.87 | 0.87 |   —   |

1 regression, 1 improvement, 22 unchanged.
GitHub Check: failing — faithfulness dropped below 0.75 threshold.

That comment, on every push, is the entire product.

Install

pip install evalcheck

# tests/test_summarization.py
from evalcheck import eval, faithfulness

@eval(metric=faithfulness, threshold=0.75)
def test_summary(article):
    return summarize(article)

pytest

Plain pytest. No new runner, no flags, no cloud account. The first run writes a baseline to .evalcheck/snapshots/. Commit it. Subsequent runs diff against it.

Add the GitHub App, push the branch, and the PR comment shows up.

How it differs

	evalcheck	deepeval	pytest-evals	promptfoo	braintrust
Plain `pytest` invocation	yes	needs `deepeval test run`	needs `--run-eval` flag	own CLI	SDK-only
Ships LLM-as-judge metrics	yes	yes	bring your own	yes	yes
Regression baseline in git, not cloud	yes	cloud-tied	none	none	cloud
PR comment per push	yes	no	no	no	no
GitHub Check API status	yes	no	no	no	no
Composes with existing pytest fixtures	yes	adapter required	yes	n/a	n/a

The wedge is the bottom three rows. Nobody else ships them.

Why this exists

Every meaningful CI signal in modern dev — coverage, bundle size, type errors, accessibility — shows up as a PR comment or a Check status. Engineers triage from the PR view, not from a separate dashboard. LLM eval tooling skipped that pattern. evalcheck closes the gap.

The pytest-native path matters because eval logic shares fixtures with the rest of your test suite: the same DB seeded the same way, the same mocked HTTP client, the same auth context. Running evals as a separate parallel test framework doubles the surface area and rots within a quarter.

Pricing

OSS plugin (pip install evalcheck) — MIT, free forever. Runs locally and in CI. Writes baselines to .evalcheck/snapshots/. Full functionality without ever signing up.
GitHub App — free for public repos and the first 50 evals per private-repo run. $19 per private repo per month above that.
Hosted dashboard (later, optional) — historical trend lines, model-vs-model side-by-sides, drill-into-failures. $49/team/month. Reads the same JSON snapshots — opting in changes nothing about how the plugin runs.

The OSS plugin is intentionally complete on its own. The GitHub App and dashboard are pure convenience layers; teams that prefer their own CI tooling can render the JSON output however they want.

Built-in metrics (v1)

faithfulness — output grounded in the supplied context (LLM-as-judge)
relevance — output answers the input
correctness — output matches an expected answer (LLM-as-judge with a rubric)
exact_match — deterministic string equality
regex_match — deterministic pattern match
custom(fn) — arbitrary scorer returning a float in [0, 1]

Multi-provider out of the box (OpenAI, Anthropic, local via Ollama). Judge model configurable per metric.

Roadmap

Weeks	Phase	Output
1–3	Plugin v1	`pip install evalcheck` works. `@eval` decorator, 6 built-in metrics, JSON snapshot, plain `pytest` invocation, multi-provider. OSS on GitHub from day one. llms.txt on docs.
4–6	GitHub App v1	Webhook receives push, runs evals in a sandboxed runner, posts PR comment, sets Check status. Free tier live.
7–9	First 10 paying installs	Cookbook PRs into LangChain, LlamaIndex, OpenAI examples. One honest "Show HN" post. Direct outreach to maintainers of public AI repos that already have eval scripts in the repo but no CI integration. No sales calls.
10–12	Compounding distribution	Programmatic SEO: `evalcheck vs deepeval`, `evalcheck vs pytest-evals`, `LLM eval CI GitHub Action`. Marketplace listing. Badge embed on README. PyPI download counter on landing page.

What's not in v1

Hosted historical dashboard — use git log on .evalcheck/snapshots/
Automatic dataset generation
Slack / Teams / Discord notifications
SSO, RBAC, multi-repo org views
Cost / latency tracking (it's a separate concern; observability tools own that)
A web playground

Keeping the surface this small is the only way one person ships a working product in 12 weekends.

Kill criterion

Day 90: under 500 weekly PyPI downloads and under 3 paying GitHub App installs. Either alone is salvageable; both together means the bundlesize-for-prompts framing didn't land and the OSS-to-paid funnel isn't compounding. Walk away cleanly — keep the OSS plugin published, take the lessons into mcpguard.

Honest risks

deepeval can ship a real pytest plugin in a weekend. They have 15.1k stars of momentum. Mitigation: get the GitHub App and PR-comment surface live fast — that's the moat, not the plugin.
pytest-evals can ship metrics in a weekend. They have the pytest-native pattern down. Mitigation: same — the comment surface and GitHub Check are non-trivial to add and require infra they don't have.
The "engineers will adopt LLM evals in CI" bet. Not yet a default behaviour for most teams. evalcheck has to teach the habit while selling the tool. Distribution is the harder half of the work.

Status

Pre-build. README written first deliberately — if the pitch isn't compelling on this page, no amount of code rescues it. If you read this far and the wedge feels real, that's the green light.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Boiga7

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 29, 2026

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalcheck-0.2.0.tar.gz (24.6 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalcheck-0.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file evalcheck-0.2.0.tar.gz.

File metadata

Download URL: evalcheck-0.2.0.tar.gz
Upload date: Apr 29, 2026
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalcheck-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`998ad2234919e6d0ab8d96ba657a0527a3ccb69ced580707bbef0ecc518b3c42`
MD5	`c3f52ac97493c88a3deee451c017c73c`
BLAKE2b-256	`2818b448d24f9418c7f5a21336e4f025630daf7f8aa4473c49be1d31e062d721`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcheck-0.2.0.tar.gz:

Publisher: release.yml on Boiga7/evalcheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evalcheck-0.2.0.tar.gz
- Subject digest: 998ad2234919e6d0ab8d96ba657a0527a3ccb69ced580707bbef0ecc518b3c42
- Sigstore transparency entry: 1400309676
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: Boiga7/evalcheck@1a42dee7aa705474539053efb50959ae0bb02420
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Boiga7
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1a42dee7aa705474539053efb50959ae0bb02420
- Trigger Event: push

File details

Details for the file evalcheck-0.2.0-py3-none-any.whl.

File metadata

Download URL: evalcheck-0.2.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalcheck-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77fb5d85679844835a3a2fe6b642e7775c8eab2a40ca2f4bfc74b71ef6666384`
MD5	`eb466bffa6606f7143cfa709e48c32ac`
BLAKE2b-256	`28a33039f43f80e76c092cb14e3650a1ab2e8ba21977d87f5ad05f6961d3125e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalcheck-0.2.0-py3-none-any.whl:

Publisher: release.yml on Boiga7/evalcheck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evalcheck-0.2.0-py3-none-any.whl
- Subject digest: 77fb5d85679844835a3a2fe6b642e7775c8eab2a40ca2f4bfc74b71ef6666384
- Sigstore transparency entry: 1400309711
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: Boiga7/evalcheck@1a42dee7aa705474539053efb50959ae0bb02420
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Boiga7
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1a42dee7aa705474539053efb50959ae0bb02420
- Trigger Event: push

evalcheck 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

evalcheck

The pitch

Install

How it differs

Why this exists

Pricing

Built-in metrics (v1)

Roadmap

What's not in v1

Kill criterion

Honest risks

Status

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance