CLI-first LLM stability analyzer for measuring output consistency across repeated prompt runs.
Project description
ai-stability
ai-stability is a CLI-first LLM stability analyzer for developers who want to measure output consistency, detect prompt variance, and inspect unstable model behavior locally.
It runs the same prompt multiple times against the same model, compares the responses, computes a simple stability score, and saves a local JSON artifact for replay and debugging.
Why It Exists
LLM outputs often vary even when the prompt, model, and calling code stay the same. That makes it harder to:
- evaluate prompt reliability
- spot regressions during model upgrades
- understand whether output drift is minor wording variance or meaningful behavior change
- build confidence in AI-powered developer tooling
ai-stability is intentionally narrow and local-first:
- one prompt file in
- repeated model calls
- simple, explicit similarity scoring
- readable terminal output
- JSON artifact saved locally for replay and debugging
Features
- CLI-first workflow with no database, dashboard, or hosted backend
- repeated prompt execution against the same model
- explicit pairwise similarity and aggregate stability scoring
- run-by-run output review
- inline reference-vs-run diffing for fast variance inspection
- local JSON artifact saving for debugging and replay
- provider abstraction with OpenAI implemented first
Requirements
- Python 3.11+
- An OpenAI API key in
OPENAI_API_KEY
Install
python -m venv .venv
.venv\Scripts\activate
python -m pip install -e .[dev]
Configure
Set your API key in the shell:
$env:OPENAI_API_KEY="your_api_key"
You can copy .env.example for reference, but the CLI reads the key from the environment.
Quick Start
Create a prompt file:
Example prompt.txt:
Explain the tradeoffs between unit tests and integration tests in five bullet points.
Run the analyzer:
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini
If you want to invoke it through the module instead of the installed script:
python -m ai_stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini
Example with a custom JSON output path:
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini --out results\sample-run.json
CLI Command
ai-stability run PROMPT_FILE --n 5 --provider openai --model MODEL_NAME
Current options:
--n: number of repeated runs, minimum2--provider: currentlyopenai--model: target model name--temperature: sampling temperature, default1.0--out: optional output file or output directory for the JSON artifact
How Scoring Works
The v1 scoring heuristic is intentionally simple and inspectable:
- normalize whitespace in each output
- compute pairwise text similarity with Python's
difflib.SequenceMatcher - average all pairwise similarity scores
- convert the average to a
0-100stability score
Stability labels:
80-100: High stability50-79: Medium stability0-49: Low stability
What the CLI Prints
- summary first
- average and pairwise similarity
- final stability score and label
- each run output
- a simple reference-vs-run diff for variation review
JSON Artifact
By default, results are written to results/ai-stability-YYYYMMDD-HHMMSS.json.
The JSON artifact includes:
- prompt metadata
- provider and model
- all collected outputs
- pairwise similarities
- stability score and label
- human-readable diffs
Example Workflow
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini
Use this when you want to compare how stable a model is for a fixed prompt before shipping a prompt change, swapping models, or debugging flaky output behavior.
Run Tests
python -m pytest
Repository Structure
src/ai_stability/
cli.py
runner.py
scoring.py
diffing.py
output.py
storage.py
providers/
base.py
openai_provider.py
tests/
test_scoring.py
test_runner.py
Files to Review First
src/ai_stability/cli.pysrc/ai_stability/runner.pysrc/ai_stability/scoring.pysrc/ai_stability/providers/openai_provider.py
Roadmap Notes
- V1 runs requests sequentially on purpose.
- Only OpenAI is implemented, but the provider boundary is small and ready for Anthropic later.
- The scoring heuristic is intentionally simple and inspectable rather than statistically sophisticated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_stability-0.1.0.tar.gz.
File metadata
- Download URL: ai_stability-0.1.0.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc6454026f3bb578d8b609b35dd918ae35fb6c567ddd56ac6b481095c4aa50d
|
|
| MD5 |
111d39aa802d29890e9c902018881d25
|
|
| BLAKE2b-256 |
05c55ac04ce9c7a0105ba69d7847635f04c187c77061dd436df4d645a1bf0ee5
|
File details
Details for the file ai_stability-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_stability-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
803b25c4b1a5a95e3f1125b5342cd955e5ad0061de67955cf7f220b3ece67d27
|
|
| MD5 |
963880b4a925d6ab0a6d594dee93c7d7
|
|
| BLAKE2b-256 |
a53461f7fc4e28fbc9aa4538d281ecd79cb6eb6b74ab32d07fbcbdcba8855b79
|