Skip to main content

pytest for LLMs - catch prompt regressions before they reach production

Project description

> evalflow

pytest for LLMs

PyPI Python License: MIT CI

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

  • local prompt and dataset files
  • SQLite-backed run history in .evalflow/
  • CI-friendly exit codes
  • offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

  • catch regressions before merge
  • compare runs locally
  • keep prompt versions in YAML
  • run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

  • pytest-style exit codes: 0=pass, 1=fail, 2=error
  • exact match, embedding, consistency, and LLM judge methods
  • baseline snapshots catch regressions, not just low scores
  • prompt registry keeps prompts versioned in YAML
  • works with OpenAI, Anthropic, Groq, Gemini, and Ollama
  • local SQLite storage, no account needed
  • offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Security

  • evalflow reads API keys from environment variables, never config files
  • evalflow.yaml stores env var names, not secret values
  • keep .env and .evalflow/ out of git
  • see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalflow-0.1.6.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalflow-0.1.6-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file evalflow-0.1.6.tar.gz.

File metadata

  • Download URL: evalflow-0.1.6.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.6.tar.gz
Algorithm Hash digest
SHA256 f247378ea7327498fd0ac2e9cf250a8ba8f19892e7a5d47954138815dce85214
MD5 b43718267748b45db01f6610e147d285
BLAKE2b-256 2f5c6bd6578541e9cac88e56fcff190d9e67c05dc8f28943d192500ff2b938bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.6.tar.gz:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalflow-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: evalflow-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 266ea550ce9b0231b4165747acecf24a6dba777ad7b69446040334506077a2d2
MD5 d4bbd6b0dfcd9f47d7f8d44473217810
BLAKE2b-256 83ff70d6dd88fc1441ad39373b7220484e637efbd64837bfdc34a3859f71327d

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.6-py3-none-any.whl:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page