Stability assertions for LLM prompts and agents — a DeepEval plugin
Project description
stability-eval
Stability-first assertions for LLM prompts and agents. A plugin for DeepEval.
LLM evals tell you if your prompt works once. stability-eval tells you if it works every time.
from stability_eval import stable, cross_model_agreement, perturbation_stable
@stable(runs=5, threshold=1.0)
def test_invoice_extraction():
...
@cross_model_agreement(models=["gpt-4o-mini", "claude-opus-4-7", "gemini/gemini-2.0-flash"], threshold=0.85)
def test_classifier_prompt():
...
@perturbation_stable(n=10, threshold=0.9)
def test_extraction_robustness():
...
Why?
DeepEval and Promptfoo are great at "is this output correct?" Neither makes stability a first-class assertion. Most agent failures in production are flakiness, not correctness — and pass@N (any of N pass) hides flakiness that pass^N (all of N pass) catches.
Install
pip install stability-eval
The library uses litellm under the hood to talk to LLM providers, so you can use any model you already have access to. Set your API keys as environment variables before running:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
Quick example
import litellm
from stability_eval import stable, cross_model_agreement, perturbation_stable
def extract_total(prompt: str, model: str = "gpt-4o-mini") -> str:
resp = litellm.completion(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
return resp.choices[0].message.content
# same prompt, 5 runs, all must pass
@stable(runs=5, threshold=1.0)
def test_extraction_is_deterministic():
out = extract_total("Extract just the total from: 'Subtotal $1100, tax $134.56. Total: $1,234.56'")
assert "1,234.56" in out
# GPT, Claude and Gemini must agree on the output
@cross_model_agreement(
models=["gpt-4o-mini", "claude-haiku-4-5-20251001", "gemini/gemini-2.0-flash"],
threshold=0.85,
)
def test_extraction_agrees_across_models(model: str):
return extract_total(
"Extract just the total from: 'Subtotal $1100, tax $134.56. Total: $1,234.56'",
model=model,
)
# reword the prompt 10 ways, output must stay stable
@perturbation_stable(n=10, threshold=0.9)
def test_extraction_robust_to_phrasing(prompt: str):
return extract_total(prompt)
Run with pytest.
Decorators
@stable(runs=5, threshold=1.0)
Runs the test N times. Passes only if passes / runs >= threshold. At threshold=1.0 that's all-or-nothing — useful when you need to be sure a prompt is truly deterministic. Drop it to something like 0.8 if you're okay with one failure in five.
When it fails:
AssertionError: @stable failed: 3/5 passed (rate=0.60, required>=1.0)
Failures: ["run 1: AssertionError: assert '1,234.56' in 'The total is $1234.56'",
"run 3: AssertionError: assert '1,234.56' in 'Total amount: 1234.56 USD'"]
@cross_model_agreement(models=[...], threshold=0.85, similarity="embedding")
Calls your function once per model (injects model= as a kwarg), then computes pairwise semantic similarity between outputs. Useful for catching prompts that only happen to work well with one model's output style.
similarity="embedding" uses sentence-transformers locally — fast and no extra API calls. similarity="judge" asks an LLM to score the similarity instead, which handles nuance better but is slower and costs money.
When it fails you also get which pair disagreed most and what each model returned, so it's usually obvious what went wrong.
@perturbation_stable(n=10, threshold=0.9, judge_model="gpt-4o-mini", prompt_var="prompt")
Rewrites the prompt N times using judge_model, runs your function on each variant, and checks that outputs stay semantically close to the baseline. Good for catching prompts that only work because of a specific phrasing — the kind of thing that breaks when a colleague touches the prompt.
Your function must accept the prompt as a kwarg. The kwarg name defaults to "prompt"; change it with prompt_var if yours is called something else.
Works with DeepEval
All three decorators are also exposed as BaseMetric subclasses for use inside assert_test:
from deepeval import assert_test
from stability_eval.metrics import PassNMetric, CrossModelAgreementMetric, PerturbationStabilityMetric
assert_test(test_case, [
PassNMetric(runs=5, threshold=1.0),
CrossModelAgreementMetric(models=["gpt-4o-mini", "claude-haiku-4-5-20251001"], threshold=0.85),
])
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stability_eval-0.1.0.tar.gz.
File metadata
- Download URL: stability_eval-0.1.0.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0925957c5b5f7273ab7ea06cd3f59df41a1d60edb96631e63d27bbafff275089
|
|
| MD5 |
88b57e45a1141aa86126ef868c859463
|
|
| BLAKE2b-256 |
8ee1b0dde3d647dcac5690da3f8f33363aca3e01d1da82ee0ced655c8a491e8d
|
Provenance
The following attestation bundles were made for stability_eval-0.1.0.tar.gz:
Publisher:
pypi_workflow.yml on serdarakis/stability-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stability_eval-0.1.0.tar.gz -
Subject digest:
0925957c5b5f7273ab7ea06cd3f59df41a1d60edb96631e63d27bbafff275089 - Sigstore transparency entry: 1438583946
- Sigstore integration time:
-
Permalink:
serdarakis/stability-eval@3ffe502540277d1c3a46461bdd2bbd008d6e17e2 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/serdarakis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_workflow.yml@3ffe502540277d1c3a46461bdd2bbd008d6e17e2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file stability_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: stability_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
471e5e18f6f85e5cfc8d37d24da64182c22eb23171fdfb6926858acf8bad9326
|
|
| MD5 |
20ca8c6eb9db7454bf73a326d2695aff
|
|
| BLAKE2b-256 |
de49ef5223e7b2cecea80b87ee6573565f183cd6ef35da7bb0aa66ab626e4712
|
Provenance
The following attestation bundles were made for stability_eval-0.1.0-py3-none-any.whl:
Publisher:
pypi_workflow.yml on serdarakis/stability-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stability_eval-0.1.0-py3-none-any.whl -
Subject digest:
471e5e18f6f85e5cfc8d37d24da64182c22eb23171fdfb6926858acf8bad9326 - Sigstore transparency entry: 1438583957
- Sigstore integration time:
-
Permalink:
serdarakis/stability-eval@3ffe502540277d1c3a46461bdd2bbd008d6e17e2 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/serdarakis
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_workflow.yml@3ffe502540277d1c3a46461bdd2bbd008d6e17e2 -
Trigger Event:
push
-
Statement type: