Local-first evaluation and failure-analysis toolkit for LLM and RAG systems.

These details have not been verified by PyPI

Project links

Project description

Model Failure Lab

Python License

Model Failure Lab is a local-first evaluation and failure-analysis toolkit for LLM and RAG systems. It helps teams run prompt datasets, classify failures, compare model versions, and turn regressions into reusable test cases.

What It Is

Model Failure Lab focuses on one production loop:

failure -> report -> compare -> harvest -> promote -> rerun

The primary value is not only executing evals, but preserving deterministic artifact history so teams can turn regressions into durable datasets and governance decisions.

Quickstart

Use Python 3.11 or newer.

From a local clone:

git clone <repo-url>
cd model-failure-lab
make install
make demo

Useful command shortcuts:

make help
make check
make smoke

Equivalent direct install command:

python3 -m pip install .

Then run the canonical workflow manually:

failure-lab run --dataset reasoning-failures-v1 --model demo
failure-lab report --run <run-id>
failure-lab run --dataset reasoning-failures-v1 --model ollama:llama3.2
failure-lab compare <baseline-run-id> <candidate-run-id>
failure-lab harvest --comparison <comparison-id> --delta regression --out datasets/harvested/regression-pack.json
failure-lab dataset promote datasets/harvested/regression-pack.json --dataset-id reasoning-regressions-v1
failure-lab run --dataset reasoning-regressions-v1 --model demo

If your shell does not expose the console script on PATH, use:

python3 -m model_failure_lab demo

Example Output

Prompt case:

"What is 37 * 48?"

Run result:

model output: incorrect
failure type: reasoning_error
classification confidence: high

Comparison summary:

regression rate: +12%
new failure clusters: arithmetic carry errors

CLI transcript (abbreviated):

$ failure-lab run --dataset reasoning-failures-v1 --model demo
Failure Lab Run
Dataset: reasoning-failures-v1
Model: demo
Status: completed
Cases: attempted=8 classified=8 errors=0
Failure rate: 62.5%
Run ID: 20260427_192110_266368_reasoning_failures_v1_demo_...

$ failure-lab report --run 20260427_192110_266368_reasoning_failures_v1_demo_...
Failure Lab Report
Status: completed
Failure types: reasoning=62.5% (5)

$ failure-lab compare <baseline-run-id> <candidate-run-id>
Failure Lab Compare
Status: improved
Compatible: True
Case changes: improvements=1

Screenshots

Screenshots are supported and strongly recommended for product clarity.

Place assets under docs/screens/:

run-summary.png
failure-inventory.png
comparison-view.png
harvest-replay-workflow.gif

When those files exist, embed them with:

![Run summary](docs/screens/run-summary.png)
![Failure inventory](docs/screens/failure-inventory.png)
![Comparison view](docs/screens/comparison-view.png)
![Harvest replay](docs/screens/harvest-replay-workflow.gif)

Reference wiring and naming live in docs/product-screens.md and docs/screens/README.md.

Core Workflow

failure-lab writes artifact folders under the active root (default: current working directory):

datasets/
runs/
reports/

Comparison outputs are persisted as report artifacts under reports/.

Use --root on commands to target a specific workspace.

For detailed artifact contracts and examples, see docs/artifact-model.md.

Model Adapters

failure-lab run --model supports:

demo for deterministic local execution
customer-support-failures-v1 bundled flagship support-policy pack
ollama:<model>
anthropic:<model> (after installing optional dependencies)
OpenAI model names (after installing optional dependencies)

Optional extras:

python3 -m pip install '.[anthropic]'
python3 -m pip install '.[openai]'
python3 -m pip install '.[dev]'
python3 -m pip install '.[legacy]' (legacy-only surfaces)
python3 -m pip install '.[ui]' (legacy Streamlit UI)

If installing from a published distribution in the future, the equivalent form is model-failure-lab[anthropic], model-failure-lab[openai], model-failure-lab[legacy], and model-failure-lab[ui].

React Debugger

The React debugger reads existing artifact workspaces via:

FAILURE_LAB_ARTIFACT_ROOT

Example:

export FAILURE_LAB_ARTIFACT_ROOT=/path/to/failure-lab-workspace
npm --prefix frontend run dev

Development

make install-dev
make check

Versioning

This project follows semantic versioning before v1.0 in the practical sense:

patch: bug fixes and docs
minor: CLI-compatible feature additions
breaking: CLI or artifact schema changes

Legacy Surfaces

Legacy surfaces are retained for reference only and are not part of the supported production workflow.

See:

docs/legacy.md
docs/ui_parity.md
docs/v1_4_closeout.md

Documentation

Detailed docs moved out of this README:

Harvest replay: docs/harvest-replay.md
Legacy surfaces: docs/legacy.md
Fixture workspace: docs/fixture-workspace.md
Artifact schema/model: docs/artifact-model.md
Adapter extension guide: docs/adapter-extension-guide.md
Architecture overview: docs/architecture.md
CI governance and waivers: docs/ci-governance.md
Contributor code map: docs/code-map.md
5-minute operator quickstart: docs/getting-started-operator.md
Release and PyPI guide: docs/release-and-pypi.md

License

This project is licensed under the MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_failure_lab-0.1.0.tar.gz (244.5 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

model_failure_lab-0.1.0-py3-none-any.whl (316.0 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file model_failure_lab-0.1.0.tar.gz.

File metadata

Download URL: model_failure_lab-0.1.0.tar.gz
Upload date: Apr 27, 2026
Size: 244.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for model_failure_lab-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b72df7b041f3c7b45e017cc71306f30548e24e9eb53ecd46ba7cfedbfdf0a0a7`
MD5	`523a134ff3b9f5babeb3007533ec1319`
BLAKE2b-256	`4a8873566c709e0d49bbef49f678921e3c2ed9dc6edb31f98da9950a305ca2cc`

See more details on using hashes here.

File details

Details for the file model_failure_lab-0.1.0-py3-none-any.whl.

File metadata

Download URL: model_failure_lab-0.1.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 316.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for model_failure_lab-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ff3b1fb36e10bf50ca5c4ed76fa2d759c03b49fe6d3638134ecf58d64630fa5`
MD5	`edfd9c331b8785c039adda1d563a7cf5`
BLAKE2b-256	`b71f4e678cdd290264eaee97da7219c01c9230c62767c24da974abc75cbae38a`

See more details on using hashes here.

model-failure-lab 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Model Failure Lab

What It Is

Quickstart

Example Output

Screenshots

Core Workflow

Model Adapters

React Debugger

Development

Versioning

Legacy Surfaces

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes