Skip to main content

CLI-first LLM stability analyzer for measuring output consistency across repeated prompt runs.

Project description

ai-stability

Tests License: MIT Python

ai-stability is a CLI-first LLM stability analyzer for developers who want to measure output consistency, detect prompt variance, and inspect unstable model behavior locally.

It runs the same prompt multiple times against the same model, compares the responses, computes a simple stability score, and saves a local JSON artifact for replay and debugging.

Why It Exists

LLM outputs often vary even when the prompt, model, and calling code stay the same. That makes it harder to:

  • evaluate prompt reliability
  • spot regressions during model upgrades
  • understand whether output drift is minor wording variance or meaningful behavior change
  • build confidence in AI-powered developer tooling

ai-stability is intentionally narrow and local-first:

  • one prompt file in
  • repeated model calls
  • simple, explicit similarity scoring
  • readable terminal output
  • JSON artifact saved locally for replay and debugging

Features

  • CLI-first workflow with no database, dashboard, or hosted backend
  • repeated prompt execution against the same model
  • explicit pairwise similarity and aggregate stability scoring
  • run-by-run output review
  • inline reference-vs-run diffing for fast variance inspection
  • local JSON artifact saving for debugging and replay
  • provider abstraction with OpenAI implemented first

Requirements

  • Python 3.11+
  • An OpenAI API key in OPENAI_API_KEY

Install

python -m venv .venv
.venv\Scripts\activate
python -m pip install -e .[dev]

Configure

Set your API key in the shell:

$env:OPENAI_API_KEY="your_api_key"

You can copy .env.example for reference, but the CLI reads the key from the environment.

Quick Start

Create a prompt file:

Example prompt.txt:

Explain the tradeoffs between unit tests and integration tests in five bullet points.

Run the analyzer:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

If you want to invoke it through the module instead of the installed script:

python -m ai_stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Example with a custom JSON output path:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini --out results\sample-run.json

CLI Command

ai-stability run PROMPT_FILE --n 5 --provider openai --model MODEL_NAME

Current options:

  • --n: number of repeated runs, minimum 2
  • --provider: currently openai
  • --model: target model name
  • --temperature: sampling temperature, default 1.0
  • --out: optional output file or output directory for the JSON artifact

How Scoring Works

The v1 scoring heuristic is intentionally simple and inspectable:

  1. normalize whitespace in each output
  2. compute pairwise text similarity with Python's difflib.SequenceMatcher
  3. average all pairwise similarity scores
  4. convert the average to a 0-100 stability score

Stability labels:

  • 80-100: High stability
  • 50-79: Medium stability
  • 0-49: Low stability

What the CLI Prints

  • summary first
  • average and pairwise similarity
  • final stability score and label
  • each run output
  • a simple reference-vs-run diff for variation review

JSON Artifact

By default, results are written to results/ai-stability-YYYYMMDD-HHMMSS.json.

The JSON artifact includes:

  • prompt metadata
  • provider and model
  • all collected outputs
  • pairwise similarities
  • stability score and label
  • human-readable diffs

Example Workflow

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Use this when you want to compare how stable a model is for a fixed prompt before shipping a prompt change, swapping models, or debugging flaky output behavior.

Run Tests

python -m pytest

Repository Structure

src/ai_stability/
  cli.py
  runner.py
  scoring.py
  diffing.py
  output.py
  storage.py
  providers/
    base.py
    openai_provider.py
tests/
  test_scoring.py
  test_runner.py

Files to Review First

  • src/ai_stability/cli.py
  • src/ai_stability/runner.py
  • src/ai_stability/scoring.py
  • src/ai_stability/providers/openai_provider.py

Roadmap Notes

  • V1 runs requests sequentially on purpose.
  • Only OpenAI is implemented, but the provider boundary is small and ready for Anthropic later.
  • The scoring heuristic is intentionally simple and inspectable rather than statistically sophisticated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_stability-0.1.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_stability-0.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file ai_stability-0.1.0.tar.gz.

File metadata

  • Download URL: ai_stability-0.1.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_stability-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3dc6454026f3bb578d8b609b35dd918ae35fb6c567ddd56ac6b481095c4aa50d
MD5 111d39aa802d29890e9c902018881d25
BLAKE2b-256 05c55ac04ce9c7a0105ba69d7847635f04c187c77061dd436df4d645a1bf0ee5

See more details on using hashes here.

File details

Details for the file ai_stability-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ai_stability-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_stability-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 803b25c4b1a5a95e3f1125b5342cd955e5ad0061de67955cf7f220b3ece67d27
MD5 963880b4a925d6ab0a6d594dee93c7d7
BLAKE2b-256 a53461f7fc4e28fbc9aa4538d281ecd79cb6eb6b74ab32d07fbcbdcba8855b79

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page