Local-first evaluation and failure-analysis toolkit for LLM and RAG systems.
Project description
Model Failure Lab
Model Failure Lab is a local-first evaluation and failure-analysis toolkit for LLM and RAG systems. It helps teams run prompt datasets, classify failures, compare model versions, and turn regressions into reusable test cases.
What It Is
Model Failure Lab focuses on one production loop:
failure -> report -> compare -> harvest -> promote -> rerun
The primary value is not only executing evals, but preserving deterministic artifact history so teams can turn regressions into durable datasets and governance decisions.
Quickstart
Use Python 3.11 or newer.
From a local clone:
git clone <repo-url>
cd model-failure-lab
make install
make demo
Useful command shortcuts:
make help
make check
make smoke
Equivalent direct install command:
python3 -m pip install .
Then run the canonical workflow manually:
failure-lab run --dataset reasoning-failures-v1 --model demo
failure-lab report --run <run-id>
failure-lab run --dataset reasoning-failures-v1 --model ollama:llama3.2
failure-lab compare <baseline-run-id> <candidate-run-id>
failure-lab harvest --comparison <comparison-id> --delta regression --out datasets/harvested/regression-pack.json
failure-lab dataset promote datasets/harvested/regression-pack.json --dataset-id reasoning-regressions-v1
failure-lab run --dataset reasoning-regressions-v1 --model demo
If your shell does not expose the console script on PATH, use:
python3 -m model_failure_lab demo
Example Output
Prompt case:
"What is 37 * 48?"
Run result:
- model output: incorrect
- failure type: reasoning_error
- classification confidence: high
Comparison summary:
- regression rate: +12%
- new failure clusters: arithmetic carry errors
CLI transcript (abbreviated):
$ failure-lab run --dataset reasoning-failures-v1 --model demo
Failure Lab Run
Dataset: reasoning-failures-v1
Model: demo
Status: completed
Cases: attempted=8 classified=8 errors=0
Failure rate: 62.5%
Run ID: 20260427_192110_266368_reasoning_failures_v1_demo_...
$ failure-lab report --run 20260427_192110_266368_reasoning_failures_v1_demo_...
Failure Lab Report
Status: completed
Failure types: reasoning=62.5% (5)
$ failure-lab compare <baseline-run-id> <candidate-run-id>
Failure Lab Compare
Status: improved
Compatible: True
Case changes: improvements=1
Screenshots
Screenshots are supported and strongly recommended for product clarity.
Place assets under docs/screens/:
run-summary.pngfailure-inventory.pngcomparison-view.pngharvest-replay-workflow.gif
When those files exist, embed them with:




Reference wiring and naming live in docs/product-screens.md and docs/screens/README.md.
Core Workflow
failure-lab writes artifact folders under the active root (default: current working directory):
datasets/runs/reports/
Comparison outputs are persisted as report artifacts under reports/.
Use --root on commands to target a specific workspace.
For detailed artifact contracts and examples, see docs/artifact-model.md.
Model Adapters
failure-lab run --model supports:
demofor deterministic local executioncustomer-support-failures-v1bundled flagship support-policy packollama:<model>anthropic:<model>(after installing optional dependencies)- OpenAI model names (after installing optional dependencies)
Optional extras:
python3 -m pip install '.[anthropic]'python3 -m pip install '.[openai]'python3 -m pip install '.[dev]'python3 -m pip install '.[legacy]'(legacy-only surfaces)python3 -m pip install '.[ui]'(legacy Streamlit UI)
If installing from a published distribution in the future, the equivalent form is
model-failure-lab[anthropic], model-failure-lab[openai], model-failure-lab[legacy],
and model-failure-lab[ui].
React Debugger
The React debugger reads existing artifact workspaces via:
FAILURE_LAB_ARTIFACT_ROOT
Example:
export FAILURE_LAB_ARTIFACT_ROOT=/path/to/failure-lab-workspace
npm --prefix frontend run dev
Development
make install-dev
make check
Versioning
This project follows semantic versioning before v1.0 in the practical sense:
- patch: bug fixes and docs
- minor: CLI-compatible feature additions
- breaking: CLI or artifact schema changes
Legacy Surfaces
Legacy surfaces are retained for reference only and are not part of the supported production workflow.
See:
docs/legacy.mddocs/ui_parity.mddocs/v1_4_closeout.md
Documentation
Detailed docs moved out of this README:
- Harvest replay:
docs/harvest-replay.md - Legacy surfaces:
docs/legacy.md - Fixture workspace:
docs/fixture-workspace.md - Artifact schema/model:
docs/artifact-model.md - Adapter extension guide:
docs/adapter-extension-guide.md - Architecture overview:
docs/architecture.md - CI governance and waivers:
docs/ci-governance.md - Contributor code map:
docs/code-map.md - 5-minute operator quickstart:
docs/getting-started-operator.md - Release and PyPI guide:
docs/release-and-pypi.md
License
This project is licensed under the MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file model_failure_lab-0.1.0.tar.gz.
File metadata
- Download URL: model_failure_lab-0.1.0.tar.gz
- Upload date:
- Size: 244.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b72df7b041f3c7b45e017cc71306f30548e24e9eb53ecd46ba7cfedbfdf0a0a7
|
|
| MD5 |
523a134ff3b9f5babeb3007533ec1319
|
|
| BLAKE2b-256 |
4a8873566c709e0d49bbef49f678921e3c2ed9dc6edb31f98da9950a305ca2cc
|
File details
Details for the file model_failure_lab-0.1.0-py3-none-any.whl.
File metadata
- Download URL: model_failure_lab-0.1.0-py3-none-any.whl
- Upload date:
- Size: 316.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ff3b1fb36e10bf50ca5c4ed76fa2d759c03b49fe6d3638134ecf58d64630fa5
|
|
| MD5 |
edfd9c331b8785c039adda1d563a7cf5
|
|
| BLAKE2b-256 |
b71f4e678cdd290264eaee97da7219c01c9230c62767c24da974abc75cbae38a
|