Skip to main content

Evaluate eval regressions.

Project description

reval

reval correlates your Langfuse eval sessions with your git history and uses a multi-agent LLM pipeline to pinpoint which code changes caused which metric regressions. It produces a report with explanations, evidence, and suggested fixes.

Installation

From PyPI:

pip install reval-cli

From source:

git clone https://github.com/calebevans/reval.git
cd reval
pip install .

For development (includes pytest, mypy, ruff, pre-commit):

pip install ".[dev]"

Requires Python 3.10+.

Quick Start

  1. Generate a starter config:
reval init
  1. Set your Langfuse credentials (or add them to reval.yaml):
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-..."
export LANGFUSE_SECRET_KEY="sk-..."
  1. Run an analysis against a Langfuse eval session:
reval analyze --eval-results <session-id>
  1. To compare two sessions (current vs. baseline) and correlate regressions with code changes:
reval analyze \
  --eval-results <current-session-id> \
  --eval-baseline <baseline-session-id> \
  --base main

Configuration

reval is configured through a reval.yaml file in your project root. Every field has a sensible default, so the file is optional for simple use cases.

langfuse:
  api_url: https://cloud.langfuse.com
  public_key: pk-...
  secret_key: sk-...
  project_id: ""                  # auto-detected if omitted
  current_session_id: ""          # or use --eval-results
  baseline_session_id: ""         # or use --eval-baseline
  publish: false                  # post results back to Langfuse

metrics:
  - name: answer_relevancy
    threshold: 0.05               # flag if score drops by more than this
  - name: faithfulness
    threshold: 0.05

relevance:
  include_patterns: []            # empty = include all non-ignored files
  ignore_patterns:
    - "**/tests/**"
    - "**/__pycache__/**"
    - "*.md"
    - "*.lock"
  category_mappings:
    prompt:
      - "**/prompts/**"
      - "**/*.prompt"
    model_config:
      - "**/config/model*"
      - "**/*llm_config*"
    retrieval:
      - "**/retrieval/**"
      - "**/rag/**"
    tool_definition:
      - "**/tools/**"
      - "**/functions/**"
    output_parsing:
      - "**/parsers/**"
      - "**/schema*"
    eval_config:
      - "**/eval*"

llm:
  model: openai/gpt-4o            # any LiteLLM model identifier
  temperature: 0.2
  max_tokens: 4096
  context_window: null             # override the model's default context window
  diff_model: null                 # use a different model for diff analysis
  eval_model: null                 # use a different model for eval analysis
  synthesis_model: null            # use a different model for synthesis

git:
  base: HEAD                       # base commit ref
  head: working                    # "working" = uncommitted changes

Configuration Sections

langfuse - Connection settings for your Langfuse instance. Credentials can also be set through environment variables (see below). Set publish: true to write analysis results back to Langfuse as comments.

metrics - List of metric names and their regression thresholds. A metric is flagged as regressed when current_score - baseline_score falls below -threshold. Defaults to 0.05 if not specified.

relevance - Controls which files from the git diff are included in analysis. Files matching ignore_patterns are excluded. If include_patterns is non-empty, only files matching at least one include pattern (and no ignore pattern) are kept. The category_mappings section maps glob patterns to semantic categories (prompt, model_config, retrieval, etc.) so the analysis agents understand the role of each changed file.

llm - Model configuration. The model field accepts any LiteLLM model identifier (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-20250514, vertex_ai/gemini-2.0-flash). You can assign different models to each analysis agent using diff_model, eval_model, and synthesis_model.

git - The commit refs to diff. Set head to working to diff uncommitted changes against base, or set both to commit SHAs/branch names.

Environment Variables

Langfuse credentials can be provided through environment variables instead of (or in addition to) reval.yaml. Environment variables take precedence when the corresponding config field is left empty.

Variable Config equivalent Description
LANGFUSE_BASE_URL langfuse.api_url Langfuse API URL
LANGFUSE_PUBLIC_KEY langfuse.public_key Langfuse public key
LANGFUSE_SECRET_KEY langfuse.secret_key Langfuse secret key
LANGFUSE_PROJECT_ID langfuse.project_id Langfuse project ID (auto-detected if omitted)

CLI Reference

reval init

Generate a starter reval.yaml with interactive prompts.

reval init [--output PATH]
Option Default Description
--output reval.yaml Path for the generated config file

reval analyze

Run the analysis pipeline. This is the main command.

reval analyze [OPTIONS]
Option Default Description
--eval-results Langfuse session ID for the current eval run (required)
--eval-baseline Langfuse session ID for the baseline run (omit for single-session mode)
--base From config or HEAD Base commit ref
--head From config or working Head ref (working for uncommitted changes)
--config reval.yaml Path to config file
--output terminal Output format: terminal, json, or markdown
--output-file Write the report to a file instead of stdout
--threshold 0.05 Global regression threshold (overrides per-metric config)
--model From config LLM model to use (overrides config)
--publish / --no-publish From config Publish results back to Langfuse
--verbose false Show debug information

reval report

Re-render a previously saved JSON report in a different format.

reval report REPORT_FILE [OPTIONS]
Option Default Description
--output terminal Output format: terminal, json, or markdown
--output-file Write the report to a file instead of stdout

Example: save a JSON report, then render it as markdown later:

reval analyze --eval-results sess-123 --output json --output-file report.json
reval report report.json --output markdown

Analysis Modes

Compare mode

Activated when you provide both --eval-results and --eval-baseline. reval fetches both sessions from Langfuse, diffs the git history between --base and --head, and runs three agents:

  1. Diff agent examines code changes in isolation and forms hypotheses about their potential eval impact.
  2. Eval agent investigates each regressed test case by comparing outputs, scores, and evaluator reasoning between current and baseline runs.
  3. Synthesis agent correlates the diff and eval findings into a final report with explanations and suggested fixes.

Single-session mode

Activated when you omit --eval-baseline. reval analyzes a single eval session without a baseline comparison. It loads source files matching your relevance patterns, runs the eval agent on any test cases that fall below threshold, and produces findings about what may be going wrong.

Output Formats

Format Flag Description
Terminal --output terminal Rich tables and panels with color-coded diffs (default)
JSON --output json Machine-readable output, can be re-rendered with reval report
Markdown --output markdown Tables and fenced diff blocks, suitable for PRs or documentation

All formats can be written to a file with --output-file PATH.

Publishing to Langfuse

When --publish is passed (or langfuse.publish is set to true in config), reval posts its analysis results back to Langfuse:

  • A session comment with the full markdown report is added to the current session.
  • A trace comment with relevant findings is added to each failed trace.

This makes it easy to review reval's analysis directly in the Langfuse UI alongside your eval results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reval_cli-1.0.0.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reval_cli-1.0.0-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file reval_cli-1.0.0.tar.gz.

File metadata

  • Download URL: reval_cli-1.0.0.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for reval_cli-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e5f33c51e0d3a3f999ad981711cb274b3e3470844537787b8d7b637333d935b3
MD5 b212532a396ba45058afc371cb8c4eb4
BLAKE2b-256 591e01af50a5360bea53734d7d33bf7a0db0488f38e8516768b3b66c2ed696b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for reval_cli-1.0.0.tar.gz:

Publisher: release.yml on calebevans/reval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reval_cli-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: reval_cli-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 36.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for reval_cli-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fb17c1cc4e5dfc700c78e4b44f9f24f0ce8aed8817a74a16cc1c35ea17998f0
MD5 4a57cebb089aeb6ff66ee257c767e392
BLAKE2b-256 7b27d1f8b65e7e3b6fd9e31067fc6b4b1f2f52339274859417b771708b95e7e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for reval_cli-1.0.0-py3-none-any.whl:

Publisher: release.yml on calebevans/reval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page