Skip to main content

pytest for LLMs - catch prompt regressions before they reach production

Project description

> evalflow

pytest for LLMs

PyPI Python License: MIT CI

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

  • local prompt and dataset files
  • SQLite-backed run history in .evalflow/
  • CI-friendly exit codes
  • offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

  • catch regressions before merge
  • compare runs locally
  • keep prompt versions in YAML
  • run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

  • pytest-style exit codes: 0=pass, 1=fail, 2=error
  • exact match, embedding, consistency, and LLM judge methods
  • baseline snapshots catch regressions, not just low scores
  • prompt registry keeps prompts versioned in YAML
  • works with OpenAI, Anthropic, Groq, Gemini, and Ollama
  • local SQLite storage, no account needed
  • offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Security

  • evalflow reads API keys from environment variables, never config files
  • evalflow.yaml stores env var names, not secret values
  • keep .env and .evalflow/ out of git
  • see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalflow-0.1.4.tar.gz (50.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalflow-0.1.4-py3-none-any.whl (53.0 kB view details)

Uploaded Python 3

File details

Details for the file evalflow-0.1.4.tar.gz.

File metadata

  • Download URL: evalflow-0.1.4.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7828dbb9fbd34c24da4b842812e0f525daf2e6a14379d8a8d4b0f6885c5a24bd
MD5 66a89ca5bf4c4d8d15e5662a7ff326d2
BLAKE2b-256 2ccf287f1fc5a589e399c3a6c8f850c1e4c7bf029bf317bff55ae35b16ffbf96

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.4.tar.gz:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalflow-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: evalflow-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 53.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c4301be0e479e566d19c9a4ede9392c6a49bf1742039ae26d8e31e8d26db4d27
MD5 1992150b71128ffcea022a87dc224b80
BLAKE2b-256 d403adb8b89bc5fec9143dec4d553560e620fe71f8b4751dc7490fcd7c23da24

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.4-py3-none-any.whl:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page