Skip to main content

LangSmith Evaluation Framework - A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects

Project description

LEF - LangSmith Evaluation Framework

A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects. LEF wraps langsmith, openevals, and agentevals into a unified framework with built-in QA/CI support.

20+ pre-built evaluators | Local datasets (no LangSmith required) | CI/CD gating | 3 lines to get started

PyPI version Python 3.11+

What's New in v0.2.0

v0.2.0 is a major release that adds production-readiness features, CI/CD integration, and adversarial testing. Here is a summary of what's new:

Feature Description Section
Result Export JSON, CSV, JUnit XML export for CI artifacts Result Export
Baseline Comparison Save/compare baselines, detect regressions Baseline Comparison
CI/CD Integration GitHub and Azure DevOps PR comments, JUnit XML CI/CD Integration
Watch Mode Re-run evals on file changes during development Watch Mode
Result Caching Cache target outputs to skip re-invocation Result Caching
QA Testing Test deployed endpoints from the CLI QA Testing
Red-Team Testing Adversarial evaluation across 6 attack categories Red-Team Testing
Synthetic Data Generate datasets from docs, traces, or seeds Synthetic Data Generation
Production Monitoring Continuous evaluation daemon for live projects Production Monitoring
Pytest Plugin lef_eval fixture and @pytest.mark.lef marker Pytest Plugin
Remote Targets Evaluate HTTP endpoints without wrapper code Remote Targets
Dataset Management Pull, push, diff, generate from the CLI CLI Reference
Git Context Auto-tag experiments with branch/commit metadata Git Context
7 CLI Subcommands run, compare, baseline, qa, monitor, redteam, dataset CLI Reference

Installation

pip install lefx
# Or with uv
uv add lefx

Note: The PyPI package name is lefx, but the import remains import lef.

Optional extras:

pip install "lefx[langgraph]"    # LangGraph support (langgraph>=0.2.0)
pip install "lefx[agents]"       # Agent trajectory evaluators (agentevals>=0.0.9)
pip install "lefx[remote]"       # Remote HTTP target support (httpx>=0.27.0)
pip install "lefx[all]"          # Everything: LangGraph + agents + remote + OpenAI + Anthropic SDKs
Extra What it adds When you need it
langgraph langgraph>=0.2.0 Evaluating compiled StateGraph agents
agents agentevals>=0.0.9 Trajectory evaluators (create_trajectory_evaluator, create_trajectory_judge)
remote httpx>=0.27.0 create_remote_target(), lef qa, remote HTTP evaluation
all All of the above + langchain-openai, langchain-anthropic Full-featured setup

Requires Python 3.11+.

Quick Start

1. Evaluate with built-in scorers (no API keys needed)

from lef import run_eval, exact_match, contains

results = run_eval(
    target=my_chain.invoke,
    data="my-langsmith-dataset",
    evaluators=[exact_match, contains],
)

2. Add LLM judges (needs OPENAI_API_KEY or ANTHROPIC_API_KEY)

from lef import run_eval, correctness_judge, safety_judge, exact_match

results = run_eval(
    target=my_chain.invoke,
    data="my-dataset",
    evaluators=[correctness_judge(), safety_judge(), exact_match],
)

3. Gate CI with thresholds

from lef import assert_scores

assert_scores(results, {
    "correctness": 0.8,
    "safety": 0.95,
    "exact_match": 0.7,
})
# Raises EvalAssertionError if any threshold fails

4. Use local data files (no LangSmith account needed)

# tests/eval_data/examples.yaml
- inputs:
    question: "What is the capital of France?"
  outputs:
    answer: "Paris"
- inputs:
    question: "What is 2+2?"
  outputs:
    answer: "4"
from lef import load_examples, run_eval, exact_match

examples = load_examples("tests/eval_data/examples.yaml")
results = run_eval(
    target=my_app,
    data=examples,
    evaluators=[exact_match],
    upload_results=False,  # Fully offline
)

Custom Scorers

Decorator (simplest)

from lef import scorer

@scorer(key="word_count")
def word_count(*, inputs, outputs, **kwargs):
    count = len(outputs.get("answer", "").split())
    return min(count / 100, 1.0)  # Return float 0-1

@scorer(key="has_answer")
def has_answer(*, inputs, outputs, **kwargs):
    return bool(outputs.get("answer"))  # Return bool (True=1.0, False=0.0)

Async scorer

@scorer(key="api_check")
async def api_check(*, inputs, outputs, **kwargs):
    result = await some_async_validation(outputs["answer"])
    return result  # bool, float, int, dict, or EvalResult

Factory (dynamic creation)

from lef import create_scorer

def my_logic(*, inputs, outputs, **kwargs):
    return len(outputs.get("answer", "")) > 10

length_check = create_scorer("min_length", my_logic)

Class-based (most flexible)

from lef import BaseEvaluator, EvalResult

class MyEvaluator(BaseEvaluator):
    key = "my_metric"

    def evaluate(self, *, inputs, outputs, reference_outputs=None, **kwargs):
        score = 1.0 if "expected" in outputs.get("answer", "") else 0.0
        return EvalResult(key=self.key, score=score, comment="Checked for keyword")

Composite scorer (combine multiple)

from lef import create_composite_scorer, exact_match, contains

quality = create_composite_scorer(
    "quality",
    [exact_match, contains],
    aggregation="mean",  # Also: "min", "max", "all_pass"
)

Custom LLM Judges

from lef import create_judge

# Custom prompt with {inputs}, {outputs}, {reference_outputs} placeholders
tone_judge = create_judge(
    prompt="""Evaluate whether the response has a professional tone.

    User input: {inputs}
    Response: {outputs}

    Return true if professional, false otherwise.""",
    model="openai:gpt-4o",
    feedback_key="tone",
)

Pre-Built Evaluators

Scorers (rule-based, no API keys)

Scorer What it does
exact_match Exact string match (whitespace-trimmed)
contains Case-insensitive substring check
regex_match Regex pattern matching (reference_outputs["pattern"])
json_match Field-by-field JSON comparison, returns 0.0-1.0

LLM Judges (need API keys)

Judge What it evaluates
correctness_judge() Output correctness vs reference
conciseness_judge() Response conciseness
hallucination_judge() Hallucinations beyond inputs/context
answer_relevance_judge() Answer relevance to the question
faithfulness_judge() Faithfulness to source context
response_quality_judge() Overall quality (correctness, completeness, clarity)
safety_judge() Harmful, biased, or dangerous content
toxicity_judge() Toxic or offensive content
tool_selection_judge() Agent tool selection accuracy
code_correctness_judge() Code correctness
plan_adherence_judge() Adherence to a specified plan

RAG Judges

Judge What it evaluates
rag_groundedness_judge() Response grounded in retrieved context
rag_helpfulness_judge() RAG response helpfulness
rag_retrieval_relevance_judge() Retrieved document relevance

Agent Trajectory

from lef import create_trajectory_evaluator, create_trajectory_judge

# Match-based (needs reference trajectory)
traj_eval = create_trajectory_evaluator(match_mode="superset")
# Options: "strict", "unordered", "subset", "superset"

# LLM-based (no reference needed)
traj_judge = create_trajectory_judge(model="openai:gpt-4o")

Choosing a judge model

All judges accept a model parameter:

from lef import JudgeModel, correctness_judge

judge = correctness_judge(model=JudgeModel.GPT_4O)         # Default
judge = correctness_judge(model=JudgeModel.GPT_4O_MINI)    # Faster/cheaper
judge = correctness_judge(model=JudgeModel.CLAUDE_SONNET)   # Anthropic
judge = correctness_judge(model=JudgeModel.CLAUDE_HAIKU)    # Fast Anthropic
judge = correctness_judge(model="openai:gpt-4.1")           # Any string

Dataset & Runner Patterns

Fluent runner

from lef import EvalRunner, correctness_judge, exact_match

runner = EvalRunner(
    dataset="qa-examples",          # LangSmith dataset name or list of dicts
    experiment_prefix="v1",
    description="Baseline evaluation",
    num_repetitions=3,              # For statistical significance
    upload_results=True,
)
runner.add_evaluators([correctness_judge(), exact_match])
results = runner.run(target=my_chain.invoke)

# Async
results = await runner.arun(target=my_chain.ainvoke)

Create a LangSmith dataset programmatically

from lef import create_dataset

create_dataset("qa-examples", examples=[
    {"inputs": {"question": "Capital of France?"}, "outputs": {"answer": "Paris"}},
    {"inputs": {"question": "What is 2+2?"},       "outputs": {"answer": "4"}},
])

Load local files (YAML, JSON, CSV)

from lef import load_examples

# YAML / JSON (list of {inputs, outputs} dicts)
examples = load_examples("tests/data/examples.yaml")

# CSV with column splitting
examples = load_examples(
    "tests/data/cases.csv",
    input_keys=["question"],
    output_keys=["answer"],
)

A/B comparison

from lef import run_comparative_eval

results = run_comparative_eval(
    experiments=["v1-gpt4o", "v2-claude"],
    evaluators=[my_preference_judge],
)

LangChain Integration

from lef import evaluate_chain, correctness_judge

results = evaluate_chain(
    my_chain,
    data="qa-dataset",
    evaluators=[correctness_judge()],
    output_mapper=lambda x: {"answer": x.content},  # Optional
)

# Async
results = await aevaluate_chain(my_chain, data="qa-dataset", evaluators=[...])

LangGraph Integration

from lef import evaluate_graph, correctness_judge

results = evaluate_graph(
    app,  # Compiled StateGraph
    data="agent-dataset",
    evaluators=[correctness_judge()],
    input_mapper=lambda x: {"messages": [("user", x["question"])]},
    output_mapper=lambda x: {"answer": x["messages"][-1].content},
)

# Async
results = await aevaluate_graph(app, data="agent-dataset", evaluators=[...])

Online / Production Monitoring

from lef import OnlineEvaluator, evaluate_run, safety_judge, response_quality_judge

# Evaluate a single run by ID
results = evaluate_run("run-uuid-here", evaluators=[safety_judge()])

# Monitor a project
online = OnlineEvaluator(project_name="my-chatbot-production")
online.add_evaluator(safety_judge())
online.add_evaluator(response_quality_judge())
results = online.evaluate_recent(limit=50, run_type="chain")

Result Export

Export evaluation results to JSON, CSV, or JUnit XML for CI artifact consumption or downstream analysis.

from lef import run_eval, export_json, export_csv, export_junit_xml, export_results

results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])

# Export to specific formats
export_json(results, "results.json")
export_csv(results, "results.csv")
export_junit_xml(results, "results.xml")

# Auto-detect format from file extension
export_results(results, "results.json")   # JSON
export_results(results, "results.csv")    # CSV
export_results(results, "results.xml")    # JUnit XML

From the CLI:

lef run eval_suite.yaml --output results.json
lef run eval_suite.yaml --output results.xml   # JUnit XML for CI dashboards

You can also format results as a table for terminal output:

from lef import format_results_table

print(format_results_table(results))

Baseline Comparison

Save evaluation results as named baselines, then compare against them to detect regressions across branches or releases.

Save a baseline

from lef import run_eval, save_baseline

results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])
save_baseline("main", results, metadata={"branch": "main", "version": "1.0"})
# Saved to .lef/baselines/main.json

Compare against a baseline

from lef import compare_results, load_baseline

baseline = load_baseline("main")
current_results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])

report = compare_results(baseline, current_results, tolerance=0.05)
# report.regressions -> list of metrics that dropped by more than the tolerance
# report.improvements -> list of metrics that improved

Manage baselines

from lef import list_baselines, delete_baseline

baselines = list_baselines()       # List all saved baselines
delete_baseline("old-baseline")    # Remove a saved baseline

From the CLI:

# Save results as a baseline
lef run eval_suite.yaml --save-baseline main

# Compare against a baseline
lef run eval_suite.yaml --baseline main

# List and manage baselines
lef baseline list
lef baseline delete old-baseline

# Compare two baselines directly
lef compare --baseline main --current feature-branch --tolerance 0.05

CI/CD Integration

LEF integrates with CI/CD pipelines to post evaluation results as PR comments and generate JUnit XML reports.

GitHub Actions

# .github/workflows/eval.yaml
name: Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install "lef[all]"

      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          lef run eval_suite.yaml \
            --output results.xml \
            --save-baseline ${{ github.head_ref }} \
            --baseline main \
            --github-comment \
            --threshold correctness=0.8

      - name: Publish JUnit results
        uses: dorny/test-reporter@v1
        if: always()
        with:
          name: Eval Results
          path: results.xml
          reporter: java-junit

Azure DevOps

# azure-pipelines.yaml
steps:
  - script: |
      pip install "lef[all]"
      lef run eval_suite.yaml \
        --output results.xml \
        --azdo-comment \
        --threshold correctness=0.8
    env:
      OPENAI_API_KEY: $(OPENAI_API_KEY)
  - task: PublishTestResults@2
    inputs:
      testResultsFiles: results.xml
      testRunTitle: "LEF Evaluations"

Programmatic PR comments

from lef import post_github_comment, post_azdo_comment, format_results_table

body = format_results_table(results)

# GitHub (auto-detects repo/PR from GITHUB_REPOSITORY and GITHUB_REF)
post_github_comment(body, update_existing=True)

# Azure DevOps
post_azdo_comment(body)

Watch Mode

Re-run evaluations automatically when source files, config files, or datasets change. Useful for iterative development.

lef run eval_suite.yaml --watch

Watch mode monitors .py, .yaml, .yml, .json, and .csv files in the project directory. When a change is detected, the evaluation suite re-runs automatically.

Programmatic usage:

from lef.watch import watch_and_run

watch_and_run(
    run_fn=my_eval_function,
    watch_paths=["src/", "evals/", "eval_suite.yaml"],
)

Combine with --cache to avoid re-calling the target when only evaluators change:

lef run eval_suite.yaml --watch --cache

Result Caching

Cache target function outputs on disk to avoid expensive re-invocations when iterating on evaluators. Uses content-addressable hashing of inputs for cache keys.

from lef import ResultCache, run_eval

cache = ResultCache(ttl_seconds=3600)  # Cache expires after 1 hour
cached_target = cache.wrap(my_chain.invoke)

# First run: calls my_chain.invoke for each input
run_eval(target=cached_target, data=examples, evaluators=[judge1])

# Second run: uses cached outputs, only re-evaluates
run_eval(target=cached_target, data=examples, evaluators=[judge2])

Cache is stored in .lef/cache/ by default. From the CLI:

lef run eval_suite.yaml --cache

QA Testing

Test deployed HTTP endpoints directly from the CLI without writing Python code. Requires the [remote] extra (pip install "lef[remote]").

# Test a deployed endpoint with a dataset
lef qa https://my-api.example.com/invoke \
    --data tests/eval_data/examples.yaml \
    --evaluators correctness safety \
    --threshold correctness=0.8 \
    --threshold safety=0.95

# Add custom headers (e.g., for authentication)
lef qa https://my-api.example.com/invoke \
    --data my-langsmith-dataset \
    -H "Authorization: Bearer $API_KEY" \
    --timeout 120

# Export results and skip LangSmith upload
lef qa https://my-api.example.com/invoke \
    --data examples.yaml \
    --output results.json \
    --no-upload

Red-Team Testing

Run adversarial evaluations across 6 attack categories to test your system's safety and robustness.

Attack categories

Category Description
prompt_injection Attempts to override system instructions
jailbreak Attempts to bypass safety guardrails
pii_extraction Attempts to extract private or sensitive information
hallucination_inducement Inputs designed to induce hallucinated responses
toxicity Tests whether the system generates toxic content
bias Tests for biased responses across demographics

From the CLI

# Run all categories against a target
lef redteam --target myapp.chain:invoke

# Test specific categories with more examples
lef redteam --target myapp.chain:invoke \
    --categories prompt_injection,jailbreak,pii_extraction \
    --count 10

# Use seed examples only (no LLM generation)
lef redteam --target myapp.chain:invoke --seed-only

# Use a config file for target definition
lef redteam eval_config.yaml --categories toxicity,bias

Programmatic usage

from lef import run_redteam

report = run_redteam(
    target=my_chain.invoke,
    categories=["prompt_injection", "jailbreak", "pii_extraction"],
    count_per_category=10,
    model="openai:gpt-4o",
    upload_results=False,
)
# report contains per-category scores and detailed results

Built-in red-team scorers

from lef import injection_resistance_check, pii_leak_check, refusal_check

# Use individually as evaluators
results = run_eval(
    target=my_chain.invoke,
    data=adversarial_examples,
    evaluators=[injection_resistance_check, pii_leak_check, refusal_check],
)

Synthetic Data Generation

Generate evaluation datasets from documents, production traces, or seed examples using LLM-powered synthesis.

From documents

from lef import generate_from_docs

examples = generate_from_docs(
    "docs/product_guide.md",
    count=10,
    style="factual",       # "factual", "reasoning", or "conversational"
    model="openai:gpt-4o",
)
# Returns a list of {"inputs": {"question": ...}, "outputs": {"answer": ...}} dicts

From production traces

from lef import generate_from_traces

examples = generate_from_traces(
    project_name="my-chatbot",
    limit=100,
    model="openai:gpt-4o",
)

Generate adversarial examples

from lef import generate_adversarial

adversarial = generate_adversarial(
    description="A customer support chatbot for a SaaS product",
    seed_examples=[{"question": "How do I reset my password?"}],
    count=20,
)

Diversify an existing dataset

from lef import diversify_dataset

expanded = diversify_dataset(
    existing_examples,
    count=50,
    model="openai:gpt-4o",
)

From the CLI

# Generate from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml

# Generate from a directory of documents
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml

Production Monitoring

Run a long-lived daemon that continuously polls a LangSmith project for new runs and evaluates them. Useful for production monitoring and alerting.

from lef import MonitorDaemon
from lef.judges import safety_judge, correctness_judge

monitor = MonitorDaemon(
    project_name="my-chatbot",
    evaluators=[safety_judge(), correctness_judge()],
    thresholds={"safety": 0.9, "correctness": 0.7},
    poll_interval=60,         # seconds between polls
    batch_size=20,            # runs per poll
    run_type="chain",         # filter by run type
)
monitor.add_alert_handler(lambda alert: print(f"ALERT: {alert}"))
monitor.run()  # Blocks until interrupted (Ctrl+C)

From the CLI:

lef monitor \
    --project my-chatbot \
    --evaluators safety correctness \
    --threshold safety=0.9 \
    --threshold correctness=0.7 \
    --interval 60 \
    --batch-size 20 \
    --run-type chain

Pytest Plugin

Run LEF evaluations as pytest test cases. The plugin provides a lef_eval fixture and a @pytest.mark.lef marker.

Using the lef_eval fixture

# tests/test_evals.py
from lef import exact_match, correctness_judge

def test_qa_correctness(lef_eval):
    results = lef_eval(
        target=my_chain.invoke,
        data="tests/eval_data/examples.yaml",
        evaluators=[exact_match, correctness_judge()],
        thresholds={"correctness": 0.8, "exact_match": 0.9},
    )
    # Thresholds are automatically asserted -- test fails if any threshold is not met

Using the @pytest.mark.lef marker

import pytest

@pytest.mark.lef(config="eval_suite.yaml")
def test_my_eval():
    pass  # Eval is run automatically by the marker

Running config files as test cases

# Run eval configs as pytest test cases
pytest --lef-config eval_suite.yaml --lef-config another_suite.yaml

# Disable LangSmith upload during test runs
pytest --lef-config eval_suite.yaml --lef-no-upload

# Override experiment prefix
pytest --lef-config eval_suite.yaml --lef-prefix "ci-test"

Remote Targets

Evaluate any HTTP endpoint -- LangServe, LangGraph Platform, or plain REST APIs -- without writing wrapper code. Requires the [remote] extra.

from lef import create_remote_target, create_async_remote_target, run_eval

# Basic REST endpoint
target = create_remote_target("https://my-api.example.com/invoke")

# LangGraph Platform deployment with custom mappers
target = create_remote_target(
    "https://my-assistant.langsmith.dev/runs/stream",
    headers={"x-api-key": "..."},
    input_mapper=lambda inputs: {
        "input": {"messages": [{"role": "user", "content": inputs["question"]}]},
    },
    output_mapper=lambda resp: {
        "answer": resp["output"]["messages"][-1]["content"],
    },
    timeout=120.0,
)

results = run_eval(target=target, data=examples, evaluators=[...])

# Async version
async_target = create_async_remote_target("https://my-api.example.com/invoke")
results = await arun_eval(target=async_target, data=examples, evaluators=[...])

Git Context

LEF automatically detects git branch, commit SHA, author, and other metadata to tag evaluation experiments. This enables branch comparison workflows and traceability in LangSmith.

from lef import get_git_context, build_experiment_metadata

# Get current git context
ctx = get_git_context()
# {'branch': 'feature/new-prompt', 'commit_sha': 'abc123...', 'author': '...', ...}

# Build experiment metadata (includes git context + CI detection)
metadata = build_experiment_metadata()
# Automatically used by run_eval when upload_results=True

Git context is auto-detected from the git repository and from CI environment variables (GitHub Actions, Azure DevOps, GitLab CI, Jenkins).

CLI Reference

LEF provides 7 CLI subcommands:

lef run        Run evaluation suite from config file(s)
lef compare    Compare two baselines or experiments
lef baseline   Manage saved baselines (list, delete)
lef qa         QA test a deployed HTTP endpoint
lef monitor    Continuously monitor production runs
lef redteam    Run adversarial red-team evaluation
lef dataset    Dataset management (pull, push, diff, generate)

lef run

Run an evaluation suite defined in a YAML config file.

# Basic usage
lef run eval_suite.yaml

# Override prefix and thresholds
lef run eval_suite.yaml --prefix "v2.1" --threshold correctness=0.85

# Export results, cache outputs, and compare against a baseline
lef run eval_suite.yaml \
    --output results.xml \
    --cache \
    --baseline main \
    --save-baseline feature-branch

# Watch mode with caching
lef run eval_suite.yaml --watch --cache

# Post results as a GitHub PR comment
lef run eval_suite.yaml --github-comment

# Post results as an Azure DevOps PR comment
lef run eval_suite.yaml --azdo-comment

# Local-only (no LangSmith upload)
lef run eval_suite.yaml --no-upload

# Merge multiple config files
lef run base_config.yaml override_config.yaml

Config file format:

# eval_suite.yaml
target: myapp.chain:invoke          # Dotted import path to your target
dataset: tests/eval_data/examples.yaml  # Local file or LangSmith dataset name
evaluators:
  - exact_match                     # Built-in scorer
  - correctness_judge               # Built-in judge (auto-instantiated)
  - myapp.evals:custom_scorer       # Custom import path
experiment_prefix: "regression-test"
thresholds:
  correctness: 0.8
  safety: 0.95
  exact_match: 0.7

lef compare

Compare two baselines or experiments to detect regressions.

lef compare --baseline main --current feature-branch
lef compare --baseline main --current feature-branch --tolerance 0.05
lef compare --baseline main --current feature-branch --output comparison.json

lef baseline

Manage saved baselines.

lef baseline list              # List all saved baselines
lef baseline delete my-baseline  # Delete a saved baseline

lef qa

Test a deployed HTTP endpoint against a dataset. See QA Testing for details.

lef qa https://my-api.example.com/invoke \
    --data examples.yaml \
    --evaluators correctness safety \
    --threshold correctness=0.8 \
    -H "Authorization: Bearer $TOKEN" \
    --timeout 120 \
    --output results.json

lef monitor

Continuously monitor production runs. See Production Monitoring for details.

lef monitor --project my-chatbot \
    --evaluators safety correctness \
    --threshold safety=0.9 \
    --interval 60 --batch-size 20 --run-type chain

lef redteam

Run adversarial red-team evaluations. See Red-Team Testing for details.

lef redteam --target myapp.chain:invoke \
    --categories prompt_injection,jailbreak \
    --count 10 --seed-only --no-upload

lef dataset

Dataset management commands.

# Pull a LangSmith dataset to a local file
lef dataset pull my-dataset --output my-dataset.yaml

# Push a local file to LangSmith
lef dataset push examples.yaml --name "my-dataset" --description "QA examples"

# Diff two local dataset files
lef dataset diff examples_v1.yaml examples_v2.yaml

# Generate a synthetic dataset from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml

Generation styles: factual (fact-based Q&A), reasoning (multi-step), conversational (natural dialogue).

Threshold Assertions

Raise on failure (CI/CD)

from lef import assert_scores, EvalAssertionError

try:
    assert_scores(results, {
        "correctness": 0.8,
        "safety": 0.95,
    })
except EvalAssertionError as e:
    print(f"Failed: {e}")
    print(f"Details: {e.failures}")  # List of {key, actual, threshold}

Non-raising check

from lef import check_scores

report = check_scores(results, {"correctness": 0.8, "safety": 0.95})
for key, info in report.items():
    status = "PASS" if info["passed"] else "FAIL"
    print(f"  {key}: {status} ({info['actual']:.2f} vs {info['threshold']:.2f})")

Configuration

from lef import LefConfig, JudgeModel

# From environment (recommended)
config = LefConfig.from_env()

# Or explicit
config = LefConfig(
    langsmith_api_key="lsv2_...",
    langsmith_project="my-project",
    default_judge_model=JudgeModel.CLAUDE_SONNET,
    max_concurrency=10,
)
config.apply()  # Sets environment variables

Environment variables:

export LANGCHAIN_API_KEY=lsv2_...           # LangSmith API key
export LANGCHAIN_PROJECT=my-project         # LangSmith project name
export LANGCHAIN_TRACING_V2=true            # Enable tracing
export OPENAI_API_KEY=sk-...                # For OpenAI judges
export ANTHROPIC_API_KEY=sk-ant-...         # For Anthropic judges

Walkthrough: QA-ing a Prompt Change

Scenario: You changed a prompt in your LangGraph agent and need to verify nothing broke.

Step 1: Create test data

# evals/golden_set.yaml
- inputs:
    question: "Summarize the key points of this document"
    context: "The report shows Q3 revenue grew 15% YoY..."
  outputs:
    answer: "Q3 revenue grew 15% year-over-year"
- inputs:
    question: "What action items came out of the meeting?"
    context: "Action items: 1) Update the roadmap 2) Schedule design review"
  outputs:
    answer: "Update the roadmap and schedule a design review"

Step 2: Write the eval script

# evals/test_prompt_change.py
from lef import load_examples, run_eval, assert_scores, exact_match, correctness_judge, scorer
from my_project.agent import app

@scorer(key="mentions_key_facts")
def mentions_key_facts(*, inputs, outputs, reference_outputs, **kwargs):
    ref = reference_outputs.get("answer", "").lower()
    out = outputs.get("answer", "").lower()
    keywords = [w for w in ref.split() if len(w) > 4]
    if not keywords:
        return True
    return sum(1 for kw in keywords if kw in out) / len(keywords)

def target(inputs):
    result = app.invoke({
        "messages": [("user", inputs["question"])],
        "context": inputs.get("context", ""),
    })
    return {"answer": result["messages"][-1].content}

examples = load_examples("evals/golden_set.yaml")
results = run_eval(
    target=target,
    data=examples,
    evaluators=[exact_match, mentions_key_facts, correctness_judge()],
    upload_results=False,
    experiment_prefix="prompt-v2",
)

assert_scores(results, {
    "exact_match": 0.5,
    "mentions_key_facts": 0.7,
    "correctness": 0.8,
})
print("All QA checks passed!")

Step 3: Run it

# Local
python evals/test_prompt_change.py

# Or via CLI
lef run evals/eval_suite.yaml --no-upload

# CI: exits non-zero on failure
lef run evals/eval_suite.yaml --threshold correctness=0.8

Step 4: Add to CI

# .github/workflows/eval.yaml
- name: Run prompt regression tests
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: |
    pip install lefx
    lef run evals/eval_suite.yaml --threshold correctness=0.8

EvalResult Anatomy

All evaluators return EvalResult, a dict subclass compatible with LangSmith:

from lef import EvalResult

result = EvalResult(
    key="my_metric",          # Metric name
    score=0.85,               # float (0-1) or bool
    comment="Looks good",     # Optional explanation
    metadata={"details": {}}, # Optional metadata
)

# Dict-compatible (LangSmith requires this)
result["key"]       # "my_metric"
result["score"]     # 0.85

# Property access
result.key          # "my_metric"
result.score        # 0.85
result.comment      # "Looks good"
result.metadata     # {"details": {}}

API Reference

Full public API (84 exports)
Category Export Type
Core EvalResult Class (dict subclass)
EvalResultBatch Class (Pydantic model)
BaseEvaluator Abstract class
AsyncBaseEvaluator Abstract class
JudgeModel Enum (GPT_4O, GPT_4O_MINI, CLAUDE_SONNET, CLAUDE_HAIKU)
scorer Decorator
evaluator Decorator (alias for scorer)
Scorers exact_match Callable
contains Callable
regex_match Callable
json_match Callable
mean_score Callable
pass_rate Callable
create_scorer Factory function
create_composite_scorer Factory function
Judges correctness_judge Factory -> Callable
conciseness_judge Factory -> Callable
hallucination_judge Factory -> Callable
answer_relevance_judge Factory -> Callable
faithfulness_judge Factory -> Callable
response_quality_judge Factory -> Callable
safety_judge Factory -> Callable
toxicity_judge Factory -> Callable
tool_selection_judge Factory -> Callable
code_correctness_judge Factory -> Callable
plan_adherence_judge Factory -> Callable
create_judge Factory function
RAG rag_groundedness_judge Factory -> Callable
rag_helpfulness_judge Factory -> Callable
rag_retrieval_relevance_judge Factory -> Callable
Trajectory create_trajectory_evaluator Factory function
create_trajectory_judge Factory function
Datasets run_eval Function
arun_eval Async function
run_comparative_eval Function
EvalRunner Class (fluent builder)
create_dataset Function
load_examples Function
Online evaluate_run Function
create_rule Function
create_rule_config Function
OnlineEvaluator Class
Integrations evaluate_chain Function
aevaluate_chain Async function
create_chain_target Function
create_async_chain_target Function
evaluate_graph Function
aevaluate_graph Async function
create_graph_target Function
create_async_graph_target Function
create_remote_target Function
create_async_remote_target Async function
Config LefConfig Class (Pydantic model)
Assertions assert_scores Function (raises EvalAssertionError)
check_scores Function (returns report dict)
EvalAssertionError Exception class
Export export_json Function
export_csv Function
export_junit_xml Function
export_results Function (auto-detects format)
format_results_table Function
Git Context get_git_context Function
build_experiment_metadata Function
Baselines save_baseline Function
load_baseline Function
list_baselines Function
delete_baseline Function
compare_results Function
compare_experiments Function
ComparisonReport Class
Cache ResultCache Class
CI post_github_comment Function
post_azdo_comment Function
Monitor MonitorDaemon Class
Red-Team run_redteam Function
injection_resistance_check Callable
pii_leak_check Callable
refusal_check Callable
Watch watch_and_run Function
Synthetic generate_from_docs Function
generate_from_traces Function
generate_adversarial Function
diversify_dataset Function

Examples

See the examples/ directory:

Development

git clone https://github.com/bogware/lef.git
cd lef
uv sync --extra dev --extra all

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lefx-0.3.0.tar.gz (112.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lefx-0.3.0-py3-none-any.whl (80.4 kB view details)

Uploaded Python 3

File details

Details for the file lefx-0.3.0.tar.gz.

File metadata

  • Download URL: lefx-0.3.0.tar.gz
  • Upload date:
  • Size: 112.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lefx-0.3.0.tar.gz
Algorithm Hash digest
SHA256 41bcb0906c1a87259b6d9964287ead2a01a281a88ecea087cda35e91dcb1daf7
MD5 de1edf14a0ea23af77b2b8caae5c8275
BLAKE2b-256 4ce2ae3cf41d8844b45a93de0ee6b010897ead1dfc4432901728853ee22e77d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for lefx-0.3.0.tar.gz:

Publisher: release.yml on bogware/lef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lefx-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: lefx-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 80.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lefx-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 742b98d2b9250d14fab2f43d820a8f995df01e57f63605146cf2c706821b7447
MD5 000fad3d62ef5007d9848667008d62c9
BLAKE2b-256 36da3faf4272a43174488c703b2372bb311382b13d8513eac6200ac581304e8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for lefx-0.3.0-py3-none-any.whl:

Publisher: release.yml on bogware/lef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page