Stability assertions for LLM prompts and agents — a DeepEval plugin

These details have not been verified by PyPI

Project links

Project description

stability-eval

Stability-first assertions for LLM prompts and agents. A plugin for DeepEval.

LLM evals tell you if your prompt works once. stability-eval tells you if it works every time.

from stability_eval import stable, cross_model_agreement, perturbation_stable

@stable(runs=5, threshold=1.0)
def test_invoice_extraction():
    ...

@cross_model_agreement(models=["gpt-4o-mini", "claude-opus-4-7", "gemini/gemini-2.0-flash"], threshold=0.85)
def test_classifier_prompt():
    ...

@perturbation_stable(n=10, threshold=0.9)
def test_extraction_robustness():
    ...

Why?

DeepEval and Promptfoo are great at "is this output correct?" Neither makes stability a first-class assertion. Most agent failures in production are flakiness, not correctness — and pass@N (any of N pass) hides flakiness that pass^N (all of N pass) catches.

Install

pip install stability-eval

The library uses litellm under the hood to talk to LLM providers, so you can use any model you already have access to. Set your API keys as environment variables before running:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."

Quick example

import litellm
from stability_eval import stable, cross_model_agreement, perturbation_stable


def extract_total(prompt: str, model: str = "gpt-4o-mini") -> str:
    resp = litellm.completion(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    return resp.choices[0].message.content


# same prompt, 5 runs, all must pass
@stable(runs=5, threshold=1.0)
def test_extraction_is_deterministic():
    out = extract_total("Extract just the total from: 'Subtotal $1100, tax $134.56. Total: $1,234.56'")
    assert "1,234.56" in out


# GPT, Claude and Gemini must agree on the output
@cross_model_agreement(
    models=["gpt-4o-mini", "claude-haiku-4-5-20251001", "gemini/gemini-2.0-flash"],
    threshold=0.85,
)
def test_extraction_agrees_across_models(model: str):
    return extract_total(
        "Extract just the total from: 'Subtotal $1100, tax $134.56. Total: $1,234.56'",
        model=model,
    )


# reword the prompt 10 ways, output must stay stable
@perturbation_stable(n=10, threshold=0.9)
def test_extraction_robust_to_phrasing(prompt: str):
    return extract_total(prompt)

Run with pytest.

Decorators

`@stable(runs=5, threshold=1.0)`

Runs the test N times. Passes only if passes / runs >= threshold. At threshold=1.0 that's all-or-nothing — useful when you need to be sure a prompt is truly deterministic. Drop it to something like 0.8 if you're okay with one failure in five.

When it fails:

AssertionError: @stable failed: 3/5 passed (rate=0.60, required>=1.0)
Failures: ["run 1: AssertionError: assert '1,234.56' in 'The total is $1234.56'",
           "run 3: AssertionError: assert '1,234.56' in 'Total amount: 1234.56 USD'"]

`@cross_model_agreement(models=[...], threshold=0.85, similarity="embedding")`

Calls your function once per model (injects model= as a kwarg), then computes pairwise semantic similarity between outputs. Useful for catching prompts that only happen to work well with one model's output style.

similarity="embedding" uses sentence-transformers locally — fast and no extra API calls. similarity="judge" asks an LLM to score the similarity instead, which handles nuance better but is slower and costs money.

When it fails you also get which pair disagreed most and what each model returned, so it's usually obvious what went wrong.

`@perturbation_stable(n=10, threshold=0.9, judge_model="gpt-4o-mini", prompt_var="prompt")`

Rewrites the prompt N times using judge_model, runs your function on each variant, and checks that outputs stay semantically close to the baseline. Good for catching prompts that only work because of a specific phrasing — the kind of thing that breaks when a colleague touches the prompt.

Your function must accept the prompt as a kwarg. The kwarg name defaults to "prompt"; change it with prompt_var if yours is called something else.

Works with DeepEval

All three decorators are also exposed as BaseMetric subclasses for use inside assert_test:

from deepeval import assert_test
from stability_eval.metrics import PassNMetric, CrossModelAgreementMetric, PerturbationStabilityMetric

assert_test(test_case, [
    PassNMetric(runs=5, threshold=1.0),
    CrossModelAgreementMetric(models=["gpt-4o-mini", "claude-haiku-4-5-20251001"], threshold=0.85),
])

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stability_eval-0.1.0.tar.gz (10.3 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stability_eval-0.1.0-py3-none-any.whl (10.7 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file stability_eval-0.1.0.tar.gz.

File metadata

Download URL: stability_eval-0.1.0.tar.gz
Upload date: May 4, 2026
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for stability_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0925957c5b5f7273ab7ea06cd3f59df41a1d60edb96631e63d27bbafff275089`
MD5	`88b57e45a1141aa86126ef868c859463`
BLAKE2b-256	`8ee1b0dde3d647dcac5690da3f8f33363aca3e01d1da82ee0ced655c8a491e8d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for stability_eval-0.1.0.tar.gz:

Publisher: pypi_workflow.yml on serdarakis/stability-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: stability_eval-0.1.0.tar.gz
- Subject digest: 0925957c5b5f7273ab7ea06cd3f59df41a1d60edb96631e63d27bbafff275089
- Sigstore transparency entry: 1438583946
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: serdarakis/stability-eval@3ffe502540277d1c3a46461bdd2bbd008d6e17e2
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/serdarakis
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_workflow.yml@3ffe502540277d1c3a46461bdd2bbd008d6e17e2
- Trigger Event: push

File details

Details for the file stability_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: stability_eval-0.1.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for stability_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`471e5e18f6f85e5cfc8d37d24da64182c22eb23171fdfb6926858acf8bad9326`
MD5	`20ca8c6eb9db7454bf73a326d2695aff`
BLAKE2b-256	`de49ef5223e7b2cecea80b87ee6573565f183cd6ef35da7bb0aa66ab626e4712`

See more details on using hashes here.

Provenance

The following attestation bundles were made for stability_eval-0.1.0-py3-none-any.whl:

Publisher: pypi_workflow.yml on serdarakis/stability-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: stability_eval-0.1.0-py3-none-any.whl
- Subject digest: 471e5e18f6f85e5cfc8d37d24da64182c22eb23171fdfb6926858acf8bad9326
- Sigstore transparency entry: 1438583957
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: serdarakis/stability-eval@3ffe502540277d1c3a46461bdd2bbd008d6e17e2
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/serdarakis
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_workflow.yml@3ffe502540277d1c3a46461bdd2bbd008d6e17e2
- Trigger Event: push

stability-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

stability-eval

Why?

Install

Quick example

Decorators

`@stable(runs=5, threshold=1.0)`

`@cross_model_agreement(models=[...], threshold=0.85, similarity="embedding")`

`@perturbation_stable(n=10, threshold=0.9, judge_model="gpt-4o-mini", prompt_var="prompt")`

Works with DeepEval

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance