Skip to main content

pytest for LLMs - catch prompt regressions before they reach production

Project description

> evalflow

pytest for LLMs

PyPI Python License: MIT CI

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

  • local prompt and dataset files
  • SQLite-backed run history in .evalflow/
  • CI-friendly exit codes
  • offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

  • catch regressions before merge
  • compare runs locally
  • keep prompt versions in YAML
  • run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

  • pytest-style exit codes: 0=pass, 1=fail, 2=error
  • exact match, embedding, consistency, and LLM judge methods
  • baseline snapshots catch regressions, not just low scores
  • prompt registry keeps prompts versioned in YAML
  • works with OpenAI, Anthropic, Groq, Gemini, and Ollama
  • local SQLite storage, no account needed
  • offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Security

  • evalflow reads API keys from environment variables, never config files
  • evalflow.yaml stores env var names, not secret values
  • keep .env and .evalflow/ out of git
  • see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalflow-0.1.7.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalflow-0.1.7-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file evalflow-0.1.7.tar.gz.

File metadata

  • Download URL: evalflow-0.1.7.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.7.tar.gz
Algorithm Hash digest
SHA256 6f703c7d7c8e8463f7c7f05c186268fb9a34d8e0fbc3b1fa90882c853261038f
MD5 11e159200a1f8bdbcdfe73472d054e50
BLAKE2b-256 847667da7d9087015566a1e93ee3ebe2504df94471496bbb32d16dfc462d31db

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.7.tar.gz:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalflow-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: evalflow-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalflow-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 327b516b56f711687ed9bb9774462332b7a62946739b0a872e1f6b923083bb1f
MD5 0fb625d9caa6ed0514adc12f412ca527
BLAKE2b-256 686d5aaea71d3fc69721eb55548107c4594e0cd48cee8845b6a95af8e0678b2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalflow-0.1.7-py3-none-any.whl:

Publisher: publish.yml on emartai/evalflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page