LangSmith Evaluation Framework - A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

LEF - LangSmith Evaluation Framework

A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects. LEF wraps langsmith, openevals, and agentevals into a unified framework with built-in QA/CI support.

20+ pre-built evaluators | Local datasets (no LangSmith required) | CI/CD gating | 3 lines to get started

PyPI package name is lefx — install with pip install lefx, import with import lef.

Quick Reference (AI-Friendly)

Copy-paste patterns for Claude Code, Cursor, or any AI coding assistant:

# Install: pip install lefx[all]

# --- Evaluate a function against a local dataset ---
from lef import run_eval, exact_match, correctness_judge, check_scores

results = run_eval(
    target=lambda inputs: {"answer": my_chain.invoke(inputs)},
    data="path/to/dataset.yaml",          # or a LangSmith dataset name
    evaluators=[exact_match, correctness_judge()],
    upload_results=False,                  # True to upload to LangSmith
)
report = check_scores(results, {"correctness": 0.8})
print("PASS" if all(v["passed"] for v in report.values()) else "FAIL")

# --- Test a deployed HTTP endpoint ---
from lef import create_remote_target, run_eval, export_markdown, correctness_judge

target = create_remote_target(
    "https://my-endpoint.com/invoke",
    headers={"Authorization": "Bearer ..."},
    input_mapper=lambda inputs: {"query": inputs["question"]},
    output_mapper=lambda resp: {"answer": resp["response"]},
)
results = run_eval(target, data="qa_data.yaml", evaluators=[correctness_judge()])
export_markdown(results, "report.md", thresholds={"correctness": 0.8})

# --- Create a custom scorer ---
from lef import scorer

@scorer(key="is_polite")
def is_polite(*, outputs, **kwargs) -> bool:
    return any(w in outputs.get("answer", "").lower() for w in ["please", "thank", "sorry"])

# --- CLI equivalents ---
# lef run eval_config.yaml --output report.md --threshold correctness=0.8
# lef qa https://endpoint/invoke --data data.yaml --output report.md

YAML dataset format

# dataset.yaml — each entry has inputs + expected outputs
- inputs:
    question: "What is Python?"
  outputs:
    answer: "A programming language"
- inputs:
    question: "What is LangChain?"
  outputs:
    answer: "A framework for LLM applications"

Feature Highlights

Production-readiness features, CI/CD integration, Markdown reports, and adversarial testing:

Feature	Description	Section
Result Export	JSON, CSV, JUnit XML export for CI artifacts	Result Export
Baseline Comparison	Save/compare baselines, detect regressions	Baseline Comparison
CI/CD Integration	GitHub and Azure DevOps PR comments, JUnit XML	CI/CD Integration
Watch Mode	Re-run evals on file changes during development	Watch Mode
Result Caching	Cache target outputs to skip re-invocation	Result Caching
QA Testing	Test deployed endpoints from the CLI	QA Testing
Red-Team Testing	Adversarial evaluation across 6 attack categories	Red-Team Testing
Synthetic Data	Generate datasets from docs, traces, or seeds	Synthetic Data Generation
Production Monitoring	Continuous evaluation daemon for live projects	Production Monitoring
Pytest Plugin	`lef_eval` fixture and `@pytest.mark.lef` marker	Pytest Plugin
Remote Targets	Evaluate HTTP endpoints without wrapper code	Remote Targets
Dataset Management	Pull, push, diff, generate from the CLI	CLI Reference
Git Context	Auto-tag experiments with branch/commit metadata	Git Context
7 CLI Subcommands	`run`, `compare`, `baseline`, `qa`, `monitor`, `redteam`, `dataset`	CLI Reference

Installation

pip install lefx

# Or with uv
uv add lefx

Note: The PyPI package name is lefx, but the import remains import lef.

Optional extras:

pip install "lefx[langgraph]"    # LangGraph support (langgraph>=0.2.0)
pip install "lefx[agents]"       # Agent trajectory evaluators (agentevals>=0.0.9)
pip install "lefx[remote]"       # Remote HTTP target support (httpx>=0.27.0)
pip install "lefx[all]"          # Everything: LangGraph + agents + remote + OpenAI + Anthropic SDKs

Extra	What it adds	When you need it
`langgraph`	`langgraph>=0.2.0`	Evaluating compiled `StateGraph` agents
`agents`	`agentevals>=0.0.9`	Trajectory evaluators (`create_trajectory_evaluator`, `create_trajectory_judge`)
`remote`	`httpx>=0.27.0`	`create_remote_target()`, `lef qa`, remote HTTP evaluation
`all`	All of the above + `langchain-openai`, `langchain-anthropic`	Full-featured setup

Requires Python 3.11+.

Quick Start

1. Evaluate with built-in scorers (no API keys needed)

from lef import run_eval, exact_match, contains

results = run_eval(
    target=my_chain.invoke,
    data="my-langsmith-dataset",
    evaluators=[exact_match, contains],
)

2. Add LLM judges (needs `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`)

from lef import run_eval, correctness_judge, safety_judge, exact_match

results = run_eval(
    target=my_chain.invoke,
    data="my-dataset",
    evaluators=[correctness_judge(), safety_judge(), exact_match],
)

3. Gate CI with thresholds

from lef import assert_scores

assert_scores(results, {
    "correctness": 0.8,
    "safety": 0.95,
    "exact_match": 0.7,
})
# Raises EvalAssertionError if any threshold fails

4. Use local data files (no LangSmith account needed)

# tests/eval_data/examples.yaml
- inputs:
    question: "What is the capital of France?"
  outputs:
    answer: "Paris"
- inputs:
    question: "What is 2+2?"
  outputs:
    answer: "4"

from lef import load_examples, run_eval, exact_match

examples = load_examples("tests/eval_data/examples.yaml")
results = run_eval(
    target=my_app,
    data=examples,
    evaluators=[exact_match],
    upload_results=False,  # Fully offline
)

Custom Scorers

Decorator (simplest)

from lef import scorer

@scorer(key="word_count")
def word_count(*, inputs, outputs, **kwargs):
    count = len(outputs.get("answer", "").split())
    return min(count / 100, 1.0)  # Return float 0-1

@scorer(key="has_answer")
def has_answer(*, inputs, outputs, **kwargs):
    return bool(outputs.get("answer"))  # Return bool (True=1.0, False=0.0)

Async scorer

@scorer(key="api_check")
async def api_check(*, inputs, outputs, **kwargs):
    result = await some_async_validation(outputs["answer"])
    return result  # bool, float, int, dict, or EvalResult

Factory (dynamic creation)

from lef import create_scorer

def my_logic(*, inputs, outputs, **kwargs):
    return len(outputs.get("answer", "")) > 10

length_check = create_scorer("min_length", my_logic)

Class-based (most flexible)

from lef import BaseEvaluator, EvalResult

class MyEvaluator(BaseEvaluator):
    key = "my_metric"

    def evaluate(self, *, inputs, outputs, reference_outputs=None, **kwargs):
        score = 1.0 if "expected" in outputs.get("answer", "") else 0.0
        return EvalResult(key=self.key, score=score, comment="Checked for keyword")

Composite scorer (combine multiple)

from lef import create_composite_scorer, exact_match, contains

quality = create_composite_scorer(
    "quality",
    [exact_match, contains],
    aggregation="mean",  # Also: "min", "max", "all_pass"
)

Custom LLM Judges

from lef import create_judge

# Custom prompt with {inputs}, {outputs}, {reference_outputs} placeholders
tone_judge = create_judge(
    prompt="""Evaluate whether the response has a professional tone.

    User input: {inputs}
    Response: {outputs}

    Return true if professional, false otherwise.""",
    model="openai:gpt-4o",
    feedback_key="tone",
)

Pre-Built Evaluators

Scorers (rule-based, no API keys)

Scorer	What it does
`exact_match`	Exact string match (whitespace-trimmed)
`contains`	Case-insensitive substring check
`regex_match`	Regex pattern matching (`reference_outputs["pattern"]`)
`json_match`	Field-by-field JSON comparison, returns 0.0-1.0

LLM Judges (need API keys)

Judge	What it evaluates
`correctness_judge()`	Output correctness vs reference
`conciseness_judge()`	Response conciseness
`hallucination_judge()`	Hallucinations beyond inputs/context
`answer_relevance_judge()`	Answer relevance to the question
`faithfulness_judge()`	Faithfulness to source context
`response_quality_judge()`	Overall quality (correctness, completeness, clarity)
`safety_judge()`	Harmful, biased, or dangerous content
`toxicity_judge()`	Toxic or offensive content
`tool_selection_judge()`	Agent tool selection accuracy
`code_correctness_judge()`	Code correctness
`plan_adherence_judge()`	Adherence to a specified plan

RAG Judges

Judge	What it evaluates
`rag_groundedness_judge()`	Response grounded in retrieved context
`rag_helpfulness_judge()`	RAG response helpfulness
`rag_retrieval_relevance_judge()`	Retrieved document relevance

Agent Trajectory

from lef import create_trajectory_evaluator, create_trajectory_judge

# Match-based (needs reference trajectory)
traj_eval = create_trajectory_evaluator(match_mode="superset")
# Options: "strict", "unordered", "subset", "superset"

# LLM-based (no reference needed)
traj_judge = create_trajectory_judge(model="openai:gpt-4o")

Choosing a judge model

All judges accept a model parameter:

from lef import JudgeModel, correctness_judge

judge = correctness_judge(model=JudgeModel.GPT_4O)         # Default
judge = correctness_judge(model=JudgeModel.GPT_4O_MINI)    # Faster/cheaper
judge = correctness_judge(model=JudgeModel.CLAUDE_SONNET)   # Anthropic
judge = correctness_judge(model=JudgeModel.CLAUDE_HAIKU)    # Fast Anthropic
judge = correctness_judge(model="openai:gpt-4.1")           # Any string

Dataset & Runner Patterns

Fluent runner

from lef import EvalRunner, correctness_judge, exact_match

runner = EvalRunner(
    dataset="qa-examples",          # LangSmith dataset name or list of dicts
    experiment_prefix="v1",
    description="Baseline evaluation",
    num_repetitions=3,              # For statistical significance
    upload_results=True,
)
runner.add_evaluators([correctness_judge(), exact_match])
results = runner.run(target=my_chain.invoke)

# Async
results = await runner.arun(target=my_chain.ainvoke)

Create a LangSmith dataset programmatically

from lef import create_dataset

create_dataset("qa-examples", examples=[
    {"inputs": {"question": "Capital of France?"}, "outputs": {"answer": "Paris"}},
    {"inputs": {"question": "What is 2+2?"},       "outputs": {"answer": "4"}},
])

Load local files (YAML, JSON, CSV)

from lef import load_examples

# YAML / JSON (list of {inputs, outputs} dicts)
examples = load_examples("tests/data/examples.yaml")

# CSV with column splitting
examples = load_examples(
    "tests/data/cases.csv",
    input_keys=["question"],
    output_keys=["answer"],
)

A/B comparison

from lef import run_comparative_eval

results = run_comparative_eval(
    experiments=["v1-gpt4o", "v2-claude"],
    evaluators=[my_preference_judge],
)

LangChain Integration

from lef import evaluate_chain, correctness_judge

results = evaluate_chain(
    my_chain,
    data="qa-dataset",
    evaluators=[correctness_judge()],
    output_mapper=lambda x: {"answer": x.content},  # Optional
)

# Async
results = await aevaluate_chain(my_chain, data="qa-dataset", evaluators=[...])

LangGraph Integration

from lef import evaluate_graph, correctness_judge

results = evaluate_graph(
    app,  # Compiled StateGraph
    data="agent-dataset",
    evaluators=[correctness_judge()],
    input_mapper=lambda x: {"messages": [("user", x["question"])]},
    output_mapper=lambda x: {"answer": x["messages"][-1].content},
)

# Async
results = await aevaluate_graph(app, data="agent-dataset", evaluators=[...])

Online / Production Monitoring

from lef import OnlineEvaluator, evaluate_run, safety_judge, response_quality_judge

# Evaluate a single run by ID
results = evaluate_run("run-uuid-here", evaluators=[safety_judge()])

# Monitor a project
online = OnlineEvaluator(project_name="my-chatbot-production")
online.add_evaluator(safety_judge())
online.add_evaluator(response_quality_judge())
results = online.evaluate_recent(limit=50, run_type="chain")

Result Export

Export evaluation results to JSON, CSV, JUnit XML, or Markdown for CI artifacts, reports, or sharing.

from lef import run_eval, export_json, export_csv, export_junit_xml, export_markdown, export_results

results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])

# Export to specific formats
export_json(results, "results.json")
export_csv(results, "results.csv")
export_junit_xml(results, "results.xml")
export_markdown(results, "results.md", thresholds={"correctness": 0.8}, metadata={"env": "staging"})

# Auto-detect format from file extension
export_results(results, "results.json")   # JSON
export_results(results, "results.csv")    # CSV
export_results(results, "results.xml")    # JUnit XML
export_results(results, "results.md")     # Markdown report

From the CLI:

lef run eval_suite.yaml --output results.json
lef run eval_suite.yaml --output results.xml   # JUnit XML for CI dashboards
lef run eval_suite.yaml --output results.md    # Markdown report

You can also format results as a table for terminal output:

from lef import format_results_table

print(format_results_table(results))

Baseline Comparison

Save evaluation results as named baselines, then compare against them to detect regressions across branches or releases.

Save a baseline

from lef import run_eval, save_baseline

results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])
save_baseline("main", results, metadata={"branch": "main", "version": "1.0"})
# Saved to .lef/baselines/main.json

Compare against a baseline

from lef import compare_results, load_baseline

baseline = load_baseline("main")
current_results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])

report = compare_results(baseline, current_results, tolerance=0.05)
# report.regressions -> list of metrics that dropped by more than the tolerance
# report.improvements -> list of metrics that improved

Manage baselines

from lef import list_baselines, delete_baseline

baselines = list_baselines()       # List all saved baselines
delete_baseline("old-baseline")    # Remove a saved baseline

From the CLI:

# Save results as a baseline
lef run eval_suite.yaml --save-baseline main

# Compare against a baseline
lef run eval_suite.yaml --baseline main

# List and manage baselines
lef baseline list
lef baseline delete old-baseline

# Compare two baselines directly
lef compare --baseline main --current feature-branch --tolerance 0.05

CI/CD Integration

LEF integrates with CI/CD pipelines to post evaluation results as PR comments and generate JUnit XML reports.

GitHub Actions

# .github/workflows/eval.yaml
name: Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install "lef[all]"

      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          lef run eval_suite.yaml \
            --output results.xml \
            --save-baseline ${{ github.head_ref }} \
            --baseline main \
            --github-comment \
            --threshold correctness=0.8

      - name: Publish JUnit results
        uses: dorny/test-reporter@v1
        if: always()
        with:
          name: Eval Results
          path: results.xml
          reporter: java-junit

Azure DevOps

# azure-pipelines.yaml
steps:
  - script: |
      pip install "lef[all]"
      lef run eval_suite.yaml \
        --output results.xml \
        --azdo-comment \
        --threshold correctness=0.8
    env:
      OPENAI_API_KEY: $(OPENAI_API_KEY)
  - task: PublishTestResults@2
    inputs:
      testResultsFiles: results.xml
      testRunTitle: "LEF Evaluations"

Programmatic PR comments

from lef import post_github_comment, post_azdo_comment, format_results_table

body = format_results_table(results)

# GitHub (auto-detects repo/PR from GITHUB_REPOSITORY and GITHUB_REF)
post_github_comment(body, update_existing=True)

# Azure DevOps
post_azdo_comment(body)

Watch Mode

Re-run evaluations automatically when source files, config files, or datasets change. Useful for iterative development.

lef run eval_suite.yaml --watch

Watch mode monitors .py, .yaml, .yml, .json, and .csv files in the project directory. When a change is detected, the evaluation suite re-runs automatically.

Programmatic usage:

from lef.watch import watch_and_run

watch_and_run(
    run_fn=my_eval_function,
    watch_paths=["src/", "evals/", "eval_suite.yaml"],
)

Combine with --cache to avoid re-calling the target when only evaluators change:

lef run eval_suite.yaml --watch --cache

Result Caching

Cache target function outputs on disk to avoid expensive re-invocations when iterating on evaluators. Uses content-addressable hashing of inputs for cache keys.

from lef import ResultCache, run_eval

cache = ResultCache(ttl_seconds=3600)  # Cache expires after 1 hour
cached_target = cache.wrap(my_chain.invoke)

# First run: calls my_chain.invoke for each input
run_eval(target=cached_target, data=examples, evaluators=[judge1])

# Second run: uses cached outputs, only re-evaluates
run_eval(target=cached_target, data=examples, evaluators=[judge2])

Cache is stored in .lef/cache/ by default. From the CLI:

lef run eval_suite.yaml --cache

QA Testing

Test deployed HTTP endpoints against datasets with pass/fail gating. Works with LangSmith Deployments, LangGraph Platform, LangServe, or any REST API.

From the CLI

# Test a deployed endpoint with a dataset
lef qa https://my-api.example.com/invoke \
    --data tests/eval_data/examples.yaml \
    --evaluators correctness_judge safety_judge \
    --threshold correctness=0.8 \
    --threshold safety=0.95

# Generate a Markdown report
lef qa https://my-api.example.com/invoke \
    --data my-langsmith-dataset \
    --output results/qa_report.md

# Add custom headers (e.g., for authentication)
lef qa https://my-api.example.com/invoke \
    --data my-langsmith-dataset \
    -H "Authorization: Bearer $API_KEY" \
    --timeout 120

# Export results and skip LangSmith upload
lef qa https://my-api.example.com/invoke \
    --data examples.yaml \
    --output results.json \
    --no-upload

From Python

from lef import (
    create_remote_target, run_eval, export_markdown,
    check_scores, correctness_judge, safety_judge,
)

# Point at your deployed endpoint
target = create_remote_target(
    "https://my-api.example.com/invoke",
    headers={"Authorization": f"Bearer {API_KEY}"},
    input_mapper=lambda inputs: {"query": inputs["question"]},
    output_mapper=lambda resp: {"answer": resp["response"]},
)

# Run evaluation
results = run_eval(
    target,
    data="my-qa-dataset",  # LangSmith dataset or local file path
    evaluators=[correctness_judge(), safety_judge()],
    upload_results=False,
)

# Generate Markdown report with pass/fail
export_markdown(results, "qa_report.md", thresholds={"correctness": 0.8, "safety": 0.95})

# Check thresholds programmatically
report = check_scores(results, {"correctness": 0.8, "safety": 0.95})
all_passed = all(v["passed"] for v in report.values())

See examples/qa_endpoint_eval.py for a complete working example.

Red-Team Testing

Run adversarial evaluations across 6 attack categories to test your system's safety and robustness.

Attack categories

Category	Description
`prompt_injection`	Attempts to override system instructions
`jailbreak`	Attempts to bypass safety guardrails
`pii_extraction`	Attempts to extract private or sensitive information
`hallucination_inducement`	Inputs designed to induce hallucinated responses
`toxicity`	Tests whether the system generates toxic content
`bias`	Tests for biased responses across demographics

From the CLI

# Run all categories against a target
lef redteam --target myapp.chain:invoke

# Test specific categories with more examples
lef redteam --target myapp.chain:invoke \
    --categories prompt_injection,jailbreak,pii_extraction \
    --count 10

# Use seed examples only (no LLM generation)
lef redteam --target myapp.chain:invoke --seed-only

# Use a config file for target definition
lef redteam eval_config.yaml --categories toxicity,bias

Programmatic usage

from lef import run_redteam

report = run_redteam(
    target=my_chain.invoke,
    categories=["prompt_injection", "jailbreak", "pii_extraction"],
    count_per_category=10,
    model="openai:gpt-4o",
    upload_results=False,
)
# report contains per-category scores and detailed results

Built-in red-team scorers

from lef import injection_resistance_check, pii_leak_check, refusal_check

# Use individually as evaluators
results = run_eval(
    target=my_chain.invoke,
    data=adversarial_examples,
    evaluators=[injection_resistance_check, pii_leak_check, refusal_check],
)

Synthetic Data Generation

Generate evaluation datasets from documents, production traces, or seed examples using LLM-powered synthesis.

From documents

from lef import generate_from_docs

examples = generate_from_docs(
    "docs/product_guide.md",
    count=10,
    style="factual",       # "factual", "reasoning", or "conversational"
    model="openai:gpt-4o",
)
# Returns a list of {"inputs": {"question": ...}, "outputs": {"answer": ...}} dicts

From production traces

from lef import generate_from_traces

examples = generate_from_traces(
    project_name="my-chatbot",
    limit=100,
    model="openai:gpt-4o",
)

Generate adversarial examples

from lef import generate_adversarial

adversarial = generate_adversarial(
    description="A customer support chatbot for a SaaS product",
    seed_examples=[{"question": "How do I reset my password?"}],
    count=20,
)

Diversify an existing dataset

from lef import diversify_dataset

expanded = diversify_dataset(
    existing_examples,
    count=50,
    model="openai:gpt-4o",
)

From the CLI

# Generate from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml

# Generate from a directory of documents
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml

Production Monitoring

Run a long-lived daemon that continuously polls a LangSmith project for new runs and evaluates them. Useful for production monitoring and alerting.

from lef import MonitorDaemon
from lef.judges import safety_judge, correctness_judge

monitor = MonitorDaemon(
    project_name="my-chatbot",
    evaluators=[safety_judge(), correctness_judge()],
    thresholds={"safety": 0.9, "correctness": 0.7},
    poll_interval=60,         # seconds between polls
    batch_size=20,            # runs per poll
    run_type="chain",         # filter by run type
)
monitor.add_alert_handler(lambda alert: print(f"ALERT: {alert}"))
monitor.run()  # Blocks until interrupted (Ctrl+C)

From the CLI:

lef monitor \
    --project my-chatbot \
    --evaluators safety correctness \
    --threshold safety=0.9 \
    --threshold correctness=0.7 \
    --interval 60 \
    --batch-size 20 \
    --run-type chain

Pytest Plugin

Run LEF evaluations as pytest test cases. The plugin provides a lef_eval fixture and a @pytest.mark.lef marker.

Using the `lef_eval` fixture

# tests/test_evals.py
from lef import exact_match, correctness_judge

def test_qa_correctness(lef_eval):
    results = lef_eval(
        target=my_chain.invoke,
        data="tests/eval_data/examples.yaml",
        evaluators=[exact_match, correctness_judge()],
        thresholds={"correctness": 0.8, "exact_match": 0.9},
    )
    # Thresholds are automatically asserted -- test fails if any threshold is not met

Using the `@pytest.mark.lef` marker

import pytest

@pytest.mark.lef(config="eval_suite.yaml")
def test_my_eval():
    pass  # Eval is run automatically by the marker

Running config files as test cases

# Run eval configs as pytest test cases
pytest --lef-config eval_suite.yaml --lef-config another_suite.yaml

# Disable LangSmith upload during test runs
pytest --lef-config eval_suite.yaml --lef-no-upload

# Override experiment prefix
pytest --lef-config eval_suite.yaml --lef-prefix "ci-test"

Remote Targets

Evaluate any HTTP endpoint -- LangServe, LangGraph Platform, or plain REST APIs -- without writing wrapper code. Requires the [remote] extra.

from lef import create_remote_target, create_async_remote_target, run_eval

# Basic REST endpoint
target = create_remote_target("https://my-api.example.com/invoke")

# LangGraph Platform deployment with custom mappers
target = create_remote_target(
    "https://my-assistant.langsmith.dev/runs/stream",
    headers={"x-api-key": "..."},
    input_mapper=lambda inputs: {
        "input": {"messages": [{"role": "user", "content": inputs["question"]}]},
    },
    output_mapper=lambda resp: {
        "answer": resp["output"]["messages"][-1]["content"],
    },
    timeout=120.0,
)

results = run_eval(target=target, data=examples, evaluators=[...])

# Async version
async_target = create_async_remote_target("https://my-api.example.com/invoke")
results = await arun_eval(target=async_target, data=examples, evaluators=[...])

Git Context

LEF automatically detects git branch, commit SHA, author, and other metadata to tag evaluation experiments. This enables branch comparison workflows and traceability in LangSmith.

from lef import get_git_context, build_experiment_metadata

# Get current git context
ctx = get_git_context()
# {'branch': 'feature/new-prompt', 'commit_sha': 'abc123...', 'author': '...', ...}

# Build experiment metadata (includes git context + CI detection)
metadata = build_experiment_metadata()
# Automatically used by run_eval when upload_results=True

Git context is auto-detected from the git repository and from CI environment variables (GitHub Actions, Azure DevOps, GitLab CI, Jenkins).

CLI Reference

LEF provides 7 CLI subcommands:

lef run        Run evaluation suite from config file(s)
lef compare    Compare two baselines or experiments
lef baseline   Manage saved baselines (list, delete)
lef qa         QA test a deployed HTTP endpoint
lef monitor    Continuously monitor production runs
lef redteam    Run adversarial red-team evaluation
lef dataset    Dataset management (pull, push, diff, generate)

`lef run`

Run an evaluation suite defined in a YAML config file.

# Basic usage
lef run eval_suite.yaml

# Override prefix and thresholds
lef run eval_suite.yaml --prefix "v2.1" --threshold correctness=0.85

# Export results, cache outputs, and compare against a baseline
lef run eval_suite.yaml \
    --output results.xml \
    --cache \
    --baseline main \
    --save-baseline feature-branch

# Watch mode with caching
lef run eval_suite.yaml --watch --cache

# Post results as a GitHub PR comment
lef run eval_suite.yaml --github-comment

# Post results as an Azure DevOps PR comment
lef run eval_suite.yaml --azdo-comment

# Local-only (no LangSmith upload)
lef run eval_suite.yaml --no-upload

# Merge multiple config files
lef run base_config.yaml override_config.yaml

Config file format:

# eval_suite.yaml
target: myapp.chain:invoke          # Dotted import path to your target
dataset: tests/eval_data/examples.yaml  # Local file or LangSmith dataset name
evaluators:
  - exact_match                     # Built-in scorer
  - correctness_judge               # Built-in judge (auto-instantiated)
  - myapp.evals:custom_scorer       # Custom import path
experiment_prefix: "regression-test"
thresholds:
  correctness: 0.8
  safety: 0.95
  exact_match: 0.7

`lef compare`

Compare two baselines or experiments to detect regressions.

lef compare --baseline main --current feature-branch
lef compare --baseline main --current feature-branch --tolerance 0.05
lef compare --baseline main --current feature-branch --output comparison.json

`lef baseline`

Manage saved baselines.

lef baseline list              # List all saved baselines
lef baseline delete my-baseline  # Delete a saved baseline

`lef qa`

Test a deployed HTTP endpoint against a dataset. See QA Testing for details.

lef qa https://my-api.example.com/invoke \
    --data examples.yaml \
    --evaluators correctness safety \
    --threshold correctness=0.8 \
    -H "Authorization: Bearer $TOKEN" \
    --timeout 120 \
    --output results.json

`lef monitor`

Continuously monitor production runs. See Production Monitoring for details.

lef monitor --project my-chatbot \
    --evaluators safety correctness \
    --threshold safety=0.9 \
    --interval 60 --batch-size 20 --run-type chain

`lef redteam`

Run adversarial red-team evaluations. See Red-Team Testing for details.

lef redteam --target myapp.chain:invoke \
    --categories prompt_injection,jailbreak \
    --count 10 --seed-only --no-upload

`lef dataset`

Dataset management commands.

# Pull a LangSmith dataset to a local file
lef dataset pull my-dataset --output my-dataset.yaml

# Push a local file to LangSmith
lef dataset push examples.yaml --name "my-dataset" --description "QA examples"

# Diff two local dataset files
lef dataset diff examples_v1.yaml examples_v2.yaml

# Generate a synthetic dataset from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml

Generation styles: factual (fact-based Q&A), reasoning (multi-step), conversational (natural dialogue).

Threshold Assertions

Raise on failure (CI/CD)

from lef import assert_scores, EvalAssertionError

try:
    assert_scores(results, {
        "correctness": 0.8,
        "safety": 0.95,
    })
except EvalAssertionError as e:
    print(f"Failed: {e}")
    print(f"Details: {e.failures}")  # List of {key, actual, threshold}

Non-raising check

from lef import check_scores

report = check_scores(results, {"correctness": 0.8, "safety": 0.95})
for key, info in report.items():
    status = "PASS" if info["passed"] else "FAIL"
    print(f"  {key}: {status} ({info['actual']:.2f} vs {info['threshold']:.2f})")

Configuration

from lef import LefConfig, JudgeModel

# From environment (recommended)
config = LefConfig.from_env()

# Or explicit
config = LefConfig(
    langsmith_api_key="lsv2_...",
    langsmith_project="my-project",
    default_judge_model=JudgeModel.CLAUDE_SONNET,
    max_concurrency=10,
)
config.apply()  # Sets environment variables

Environment variables:

export LANGCHAIN_API_KEY=lsv2_...           # LangSmith API key
export LANGCHAIN_PROJECT=my-project         # LangSmith project name
export LANGCHAIN_TRACING_V2=true            # Enable tracing
export OPENAI_API_KEY=sk-...                # For OpenAI judges
export ANTHROPIC_API_KEY=sk-ant-...         # For Anthropic judges

Walkthrough: QA-ing a Prompt Change

Scenario: You changed a prompt in your LangGraph agent and need to verify nothing broke.

Step 1: Create test data

# evals/golden_set.yaml
- inputs:
    question: "Summarize the key points of this document"
    context: "The report shows Q3 revenue grew 15% YoY..."
  outputs:
    answer: "Q3 revenue grew 15% year-over-year"
- inputs:
    question: "What action items came out of the meeting?"
    context: "Action items: 1) Update the roadmap 2) Schedule design review"
  outputs:
    answer: "Update the roadmap and schedule a design review"

Step 2: Write the eval script

# evals/test_prompt_change.py
from lef import load_examples, run_eval, assert_scores, exact_match, correctness_judge, scorer
from my_project.agent import app

@scorer(key="mentions_key_facts")
def mentions_key_facts(*, inputs, outputs, reference_outputs, **kwargs):
    ref = reference_outputs.get("answer", "").lower()
    out = outputs.get("answer", "").lower()
    keywords = [w for w in ref.split() if len(w) > 4]
    if not keywords:
        return True
    return sum(1 for kw in keywords if kw in out) / len(keywords)

def target(inputs):
    result = app.invoke({
        "messages": [("user", inputs["question"])],
        "context": inputs.get("context", ""),
    })
    return {"answer": result["messages"][-1].content}

examples = load_examples("evals/golden_set.yaml")
results = run_eval(
    target=target,
    data=examples,
    evaluators=[exact_match, mentions_key_facts, correctness_judge()],
    upload_results=False,
    experiment_prefix="prompt-v2",
)

assert_scores(results, {
    "exact_match": 0.5,
    "mentions_key_facts": 0.7,
    "correctness": 0.8,
})
print("All QA checks passed!")

Step 3: Run it

# Local
python evals/test_prompt_change.py

# Or via CLI
lef run evals/eval_suite.yaml --no-upload

# CI: exits non-zero on failure
lef run evals/eval_suite.yaml --threshold correctness=0.8

Step 4: Add to CI

# .github/workflows/eval.yaml
- name: Run prompt regression tests
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: |
    pip install lefx
    lef run evals/eval_suite.yaml --threshold correctness=0.8

EvalResult Anatomy

All evaluators return EvalResult, a dict subclass compatible with LangSmith:

from lef import EvalResult

result = EvalResult(
    key="my_metric",          # Metric name
    score=0.85,               # float (0-1) or bool
    comment="Looks good",     # Optional explanation
    metadata={"details": {}}, # Optional metadata
)

# Dict-compatible (LangSmith requires this)
result["key"]       # "my_metric"
result["score"]     # 0.85

# Property access
result.key          # "my_metric"
result.score        # 0.85
result.comment      # "Looks good"
result.metadata     # {"details": {}}

API Reference

Full public API (85 exports)

Category	Export	Type
Core	`EvalResult`	Class (dict subclass)
	`EvalResultBatch`	Class (Pydantic model)
	`BaseEvaluator`	Abstract class
	`AsyncBaseEvaluator`	Abstract class
	`JudgeModel`	Enum (`GPT_4O`, `GPT_4O_MINI`, `CLAUDE_SONNET`, `CLAUDE_HAIKU`)
	`scorer`	Decorator
	`evaluator`	Decorator (alias for `scorer`)
Scorers	`exact_match`	Callable
	`contains`	Callable
	`regex_match`	Callable
	`json_match`	Callable
	`mean_score`	Callable
	`pass_rate`	Callable
	`create_scorer`	Factory function
	`create_composite_scorer`	Factory function
Judges	`correctness_judge`	Factory -> Callable
	`conciseness_judge`	Factory -> Callable
	`hallucination_judge`	Factory -> Callable
	`answer_relevance_judge`	Factory -> Callable
	`faithfulness_judge`	Factory -> Callable
	`response_quality_judge`	Factory -> Callable
	`safety_judge`	Factory -> Callable
	`toxicity_judge`	Factory -> Callable
	`tool_selection_judge`	Factory -> Callable
	`code_correctness_judge`	Factory -> Callable
	`plan_adherence_judge`	Factory -> Callable
	`create_judge`	Factory function
RAG	`rag_groundedness_judge`	Factory -> Callable
	`rag_helpfulness_judge`	Factory -> Callable
	`rag_retrieval_relevance_judge`	Factory -> Callable
Trajectory	`create_trajectory_evaluator`	Factory function
	`create_trajectory_judge`	Factory function
Datasets	`run_eval`	Function
	`arun_eval`	Async function
	`run_comparative_eval`	Function
	`EvalRunner`	Class (fluent builder)
	`create_dataset`	Function
	`load_examples`	Function
Online	`evaluate_run`	Function
	`create_rule`	Function
	`create_rule_config`	Function
	`OnlineEvaluator`	Class
Integrations	`evaluate_chain`	Function
	`aevaluate_chain`	Async function
	`create_chain_target`	Function
	`create_async_chain_target`	Function
	`evaluate_graph`	Function
	`aevaluate_graph`	Async function
	`create_graph_target`	Function
	`create_async_graph_target`	Function
	`create_remote_target`	Function
	`create_async_remote_target`	Async function
Config	`LefConfig`	Class (Pydantic model)
Assertions	`assert_scores`	Function (raises `EvalAssertionError`)
	`check_scores`	Function (returns report dict)
	`EvalAssertionError`	Exception class
Export	`export_json`	Function
	`export_csv`	Function
	`export_junit_xml`	Function
	`export_markdown`	Function
	`export_results`	Function (auto-detects format)
	`format_results_table`	Function
Git Context	`get_git_context`	Function
	`build_experiment_metadata`	Function
Baselines	`save_baseline`	Function
	`load_baseline`	Function
	`list_baselines`	Function
	`delete_baseline`	Function
	`compare_results`	Function
	`compare_experiments`	Function
	`ComparisonReport`	Class
Cache	`ResultCache`	Class
CI	`post_github_comment`	Function
	`post_azdo_comment`	Function
Monitor	`MonitorDaemon`	Class
Red-Team	`run_redteam`	Function
	`injection_resistance_check`	Callable
	`pii_leak_check`	Callable
	`refusal_check`	Callable
Watch	`watch_and_run`	Function
Synthetic	`generate_from_docs`	Function
	`generate_from_traces`	Function
	`generate_adversarial`	Function
	`diversify_dataset`	Function

Examples

See the examples/ directory:

quickstart.py — Get running in minutes
llm_as_judge.py — LLM-as-Judge patterns
dataset_eval.py — Dataset-driven evaluation
custom_scorer.py — Custom scorer patterns
langgraph_eval.py — LangGraph agent evaluation
online_eval.py — Production monitoring

Development

git clone https://github.com/bogware/lef.git
cd lef
uv sync --extra dev --extra all

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

draxios

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Apr 1, 2026

0.3.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lefx-0.3.1.tar.gz (118.2 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lefx-0.3.1-py3-none-any.whl (81.7 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file lefx-0.3.1.tar.gz.

File metadata

Download URL: lefx-0.3.1.tar.gz
Upload date: Apr 1, 2026
Size: 118.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lefx-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`b57dbc48ca91b99c0329f4943bbabcd6d3511f03449b311c69cfecad1f983924`
MD5	`ffc56bd96977960ba357ea1e100c5fd5`
BLAKE2b-256	`cb6182573fb7794aaf1cc3c691b217543b543c30bdf20f56f5c5eabe90ebd62a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lefx-0.3.1.tar.gz:

Publisher: release.yml on bogware/lef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lefx-0.3.1.tar.gz
- Subject digest: b57dbc48ca91b99c0329f4943bbabcd6d3511f03449b311c69cfecad1f983924
- Sigstore transparency entry: 1204376540
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: bogware/lef@25830614523babe0354323384344a55d4e3e7c92
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/bogware
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: self-hosted
- Publication workflow: release.yml@25830614523babe0354323384344a55d4e3e7c92
- Trigger Event: push

File details

Details for the file lefx-0.3.1-py3-none-any.whl.

File metadata

Download URL: lefx-0.3.1-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 81.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lefx-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`622a6923ab172fb84dadf0172a2a8b8690d5183e2957e97f5fb5444caa2c280a`
MD5	`e56fe2277f7f2ca66f7781438d38cd3f`
BLAKE2b-256	`accfaf71fd63e2802c77f5d3cd600e4f9445bf7ffdcb4aed3712c2c629278db9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lefx-0.3.1-py3-none-any.whl:

Publisher: release.yml on bogware/lef

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lefx-0.3.1-py3-none-any.whl
- Subject digest: 622a6923ab172fb84dadf0172a2a8b8690d5183e2957e97f5fb5444caa2c280a
- Sigstore transparency entry: 1204376544
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: bogware/lef@25830614523babe0354323384344a55d4e3e7c92
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/bogware
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: self-hosted
- Publication workflow: release.yml@25830614523babe0354323384344a55d4e3e7c92
- Trigger Event: push

lefx 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LEF - LangSmith Evaluation Framework

Quick Reference (AI-Friendly)

Feature Highlights

Installation

Quick Start

1. Evaluate with built-in scorers (no API keys needed)

2. Add LLM judges (needs OPENAI_API_KEY or ANTHROPIC_API_KEY)

3. Gate CI with thresholds

4. Use local data files (no LangSmith account needed)

Custom Scorers

Decorator (simplest)

Async scorer

Factory (dynamic creation)

Class-based (most flexible)

Composite scorer (combine multiple)

Custom LLM Judges

Pre-Built Evaluators

Scorers (rule-based, no API keys)

LLM Judges (need API keys)

RAG Judges

Agent Trajectory

Choosing a judge model

Dataset & Runner Patterns

Fluent runner

Create a LangSmith dataset programmatically

Load local files (YAML, JSON, CSV)

A/B comparison

LangChain Integration

LangGraph Integration

Online / Production Monitoring

Result Export

Baseline Comparison

Save a baseline

Compare against a baseline

Manage baselines

CI/CD Integration

GitHub Actions

Azure DevOps

Programmatic PR comments

Watch Mode

Result Caching

QA Testing

From the CLI

From Python

Red-Team Testing

Attack categories

From the CLI

Programmatic usage

Built-in red-team scorers

Synthetic Data Generation

From documents

From production traces

Generate adversarial examples

Diversify an existing dataset

From the CLI

Production Monitoring

Pytest Plugin

Using the lef_eval fixture

Using the @pytest.mark.lef marker

Running config files as test cases

Remote Targets

Git Context

CLI Reference

lef run

lef compare

lef baseline

lef qa

lef monitor

lef redteam

lef dataset

2. Add LLM judges (needs `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`)

Using the `lef_eval` fixture

Using the `@pytest.mark.lef` marker

`lef run`

`lef compare`

`lef baseline`

`lef qa`

`lef monitor`

`lef redteam`

`lef dataset`