Skip to main content

pytest for LLMs - catch prompt regressions before they reach production

Project description

> evalflow

pytest for LLMs

PyPI Python License: MIT CI

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

  • local prompt and dataset files
  • SQLite-backed run history in .evalflow/
  • CI-friendly exit codes
  • offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

  • catch regressions before merge
  • compare runs locally
  • keep prompt versions in YAML
  • run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

  • pytest-style exit codes: 0=pass, 1=fail, 2=error
  • exact match, embedding, consistency, and LLM judge methods
  • baseline snapshots catch regressions, not just low scores
  • prompt registry keeps prompts versioned in YAML
  • works with OpenAI, Anthropic, Groq, Gemini, and Ollama
  • local SQLite storage, no account needed
  • offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Security

  • evalflow reads API keys from environment variables, never config files
  • evalflow.yaml stores env var names, not secret values
  • keep .env and .evalflow/ out of git
  • see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalflow-0.1.5.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalflow-0.1.5-py3-none-any.whl (53.3 kB view details)

Uploaded Python 3

File details

Details for the file evalflow-0.1.5.tar.gz.

File metadata

  • Download URL: evalflow-0.1.5.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.5.tar.gz
Algorithm Hash digest
SHA256 6eebdf470585b3cc6156c38be2c890dc3dd3519a1655cbaf886d1fdb9cb2d86e
MD5 9ebb286e5df44bf450f9ab23f3c8ba12
BLAKE2b-256 22f447f541faab813ecc579d2fe556c74deefbc791f4757b5db6822d8f03a228

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.5.tar.gz:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalflow-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: evalflow-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 53.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 cba350ecc71c22ea7238e13d37f4a50e71caf4bf4b1dfab7518e12d1c0011883
MD5 2e2cb49b040d6449e095316d93dfa9ee
BLAKE2b-256 252b2ea257d5c1579cd970acb5e98079641be1a56c9ac0f0a20b463ab1ed886f

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.5-py3-none-any.whl:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page