Skip to main content

Local-first evaluation and failure-analysis toolkit for LLM and RAG systems.

Project description

Model Failure Lab

Python License

Model Failure Lab is a local-first evaluation and failure-analysis toolkit for LLM and RAG systems. It helps teams run prompt datasets, classify failures, compare model versions, and turn regressions into reusable test cases.

What It Is

Model Failure Lab focuses on one production loop:

failure -> report -> compare -> harvest -> promote -> rerun

The primary value is not only executing evals, but preserving deterministic artifact history so teams can turn regressions into durable datasets and governance decisions.

Quickstart

Use Python 3.11 or newer.

From a local clone:

git clone <repo-url>
cd model-failure-lab
make install
make demo

Useful command shortcuts:

make help
make check
make smoke

Equivalent direct install command:

python3 -m pip install .

Then run the canonical workflow manually:

failure-lab run --dataset reasoning-failures-v1 --model demo
failure-lab report --run <run-id>
failure-lab run --dataset reasoning-failures-v1 --model ollama:llama3.2
failure-lab compare <baseline-run-id> <candidate-run-id>
failure-lab harvest --comparison <comparison-id> --delta regression --out datasets/harvested/regression-pack.json
failure-lab dataset promote datasets/harvested/regression-pack.json --dataset-id reasoning-regressions-v1
failure-lab run --dataset reasoning-regressions-v1 --model demo

If your shell does not expose the console script on PATH, use:

python3 -m model_failure_lab demo

Example Output

Prompt case:

"What is 37 * 48?"

Run result:

  • model output: incorrect
  • failure type: reasoning_error
  • classification confidence: high

Comparison summary:

  • regression rate: +12%
  • new failure clusters: arithmetic carry errors

CLI transcript (abbreviated):

$ failure-lab run --dataset reasoning-failures-v1 --model demo
Failure Lab Run
Dataset: reasoning-failures-v1
Model: demo
Status: completed
Cases: attempted=8 classified=8 errors=0
Failure rate: 62.5%
Run ID: 20260427_192110_266368_reasoning_failures_v1_demo_...

$ failure-lab report --run 20260427_192110_266368_reasoning_failures_v1_demo_...
Failure Lab Report
Status: completed
Failure types: reasoning=62.5% (5)

$ failure-lab compare <baseline-run-id> <candidate-run-id>
Failure Lab Compare
Status: improved
Compatible: True
Case changes: improvements=1

Screenshots

Screenshots are supported and strongly recommended for product clarity.

Place assets under docs/screens/:

  • run-summary.png
  • failure-inventory.png
  • comparison-view.png
  • harvest-replay-workflow.gif

When those files exist, embed them with:

![Run summary](docs/screens/run-summary.png)
![Failure inventory](docs/screens/failure-inventory.png)
![Comparison view](docs/screens/comparison-view.png)
![Harvest replay](docs/screens/harvest-replay-workflow.gif)

Reference wiring and naming live in docs/product-screens.md and docs/screens/README.md.

Core Workflow

failure-lab writes artifact folders under the active root (default: current working directory):

  • datasets/
  • runs/
  • reports/

Comparison outputs are persisted as report artifacts under reports/.

Use --root on commands to target a specific workspace.

For detailed artifact contracts and examples, see docs/artifact-model.md.

Model Adapters

failure-lab run --model supports:

  • demo for deterministic local execution
  • customer-support-failures-v1 bundled flagship support-policy pack
  • ollama:<model>
  • anthropic:<model> (after installing optional dependencies)
  • OpenAI model names (after installing optional dependencies)

Optional extras:

  • python3 -m pip install '.[anthropic]'
  • python3 -m pip install '.[openai]'
  • python3 -m pip install '.[dev]'
  • python3 -m pip install '.[legacy]' (legacy-only surfaces)
  • python3 -m pip install '.[ui]' (legacy Streamlit UI)

If installing from a published distribution in the future, the equivalent form is model-failure-lab[anthropic], model-failure-lab[openai], model-failure-lab[legacy], and model-failure-lab[ui].

React Debugger

The React debugger reads existing artifact workspaces via:

  • FAILURE_LAB_ARTIFACT_ROOT

Example:

export FAILURE_LAB_ARTIFACT_ROOT=/path/to/failure-lab-workspace
npm --prefix frontend run dev

Development

make install-dev
make check

Versioning

This project follows semantic versioning before v1.0 in the practical sense:

  • patch: bug fixes and docs
  • minor: CLI-compatible feature additions
  • breaking: CLI or artifact schema changes

Legacy Surfaces

Legacy surfaces are retained for reference only and are not part of the supported production workflow.

See:

  • docs/legacy.md
  • docs/ui_parity.md
  • docs/v1_4_closeout.md

Documentation

Detailed docs moved out of this README:

  • Harvest replay: docs/harvest-replay.md
  • Legacy surfaces: docs/legacy.md
  • Fixture workspace: docs/fixture-workspace.md
  • Artifact schema/model: docs/artifact-model.md
  • Adapter extension guide: docs/adapter-extension-guide.md
  • Architecture overview: docs/architecture.md
  • CI governance and waivers: docs/ci-governance.md
  • Contributor code map: docs/code-map.md
  • 5-minute operator quickstart: docs/getting-started-operator.md
  • Release and PyPI guide: docs/release-and-pypi.md

License

This project is licensed under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_failure_lab-0.1.0.tar.gz (244.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

model_failure_lab-0.1.0-py3-none-any.whl (316.0 kB view details)

Uploaded Python 3

File details

Details for the file model_failure_lab-0.1.0.tar.gz.

File metadata

  • Download URL: model_failure_lab-0.1.0.tar.gz
  • Upload date:
  • Size: 244.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for model_failure_lab-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b72df7b041f3c7b45e017cc71306f30548e24e9eb53ecd46ba7cfedbfdf0a0a7
MD5 523a134ff3b9f5babeb3007533ec1319
BLAKE2b-256 4a8873566c709e0d49bbef49f678921e3c2ed9dc6edb31f98da9950a305ca2cc

See more details on using hashes here.

File details

Details for the file model_failure_lab-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for model_failure_lab-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ff3b1fb36e10bf50ca5c4ed76fa2d759c03b49fe6d3638134ecf58d64630fa5
MD5 edfd9c331b8785c039adda1d563a7cf5
BLAKE2b-256 b71f4e678cdd290264eaee97da7219c01c9230c62767c24da974abc75cbae38a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page