LangSmith Evaluation Framework - A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects
Project description
LEF - LangSmith Evaluation Framework
A plug-and-play evaluation system for LangChain, LangGraph, and LangSmith projects. LEF wraps langsmith, openevals, and agentevals into a unified framework with built-in QA/CI support.
20+ pre-built evaluators | Local datasets (no LangSmith required) | CI/CD gating | 3 lines to get started
PyPI package name is
lefx— install withpip install lefx, import withimport lef.
Quick Reference (AI-Friendly)
Copy-paste patterns for Claude Code, Cursor, or any AI coding assistant:
# Install: pip install lefx[all]
# --- Evaluate a function against a local dataset ---
from lef import run_eval, exact_match, correctness_judge, check_scores
results = run_eval(
target=lambda inputs: {"answer": my_chain.invoke(inputs)},
data="path/to/dataset.yaml", # or a LangSmith dataset name
evaluators=[exact_match, correctness_judge()],
upload_results=False, # True to upload to LangSmith
)
report = check_scores(results, {"correctness": 0.8})
print("PASS" if all(v["passed"] for v in report.values()) else "FAIL")
# --- Test a deployed HTTP endpoint ---
from lef import create_remote_target, run_eval, export_markdown, correctness_judge
target = create_remote_target(
"https://my-endpoint.com/invoke",
headers={"Authorization": "Bearer ..."},
input_mapper=lambda inputs: {"query": inputs["question"]},
output_mapper=lambda resp: {"answer": resp["response"]},
)
results = run_eval(target, data="qa_data.yaml", evaluators=[correctness_judge()])
export_markdown(results, "report.md", thresholds={"correctness": 0.8})
# --- Create a custom scorer ---
from lef import scorer
@scorer(key="is_polite")
def is_polite(*, outputs, **kwargs) -> bool:
return any(w in outputs.get("answer", "").lower() for w in ["please", "thank", "sorry"])
# --- CLI equivalents ---
# lef run eval_config.yaml --output report.md --threshold correctness=0.8
# lef qa https://endpoint/invoke --data data.yaml --output report.md
YAML dataset format
# dataset.yaml — each entry has inputs + expected outputs
- inputs:
question: "What is Python?"
outputs:
answer: "A programming language"
- inputs:
question: "What is LangChain?"
outputs:
answer: "A framework for LLM applications"
Feature Highlights
Production-readiness features, CI/CD integration, Markdown reports, and adversarial testing:
| Feature | Description | Section |
|---|---|---|
| Result Export | JSON, CSV, JUnit XML export for CI artifacts | Result Export |
| Baseline Comparison | Save/compare baselines, detect regressions | Baseline Comparison |
| CI/CD Integration | GitHub and Azure DevOps PR comments, JUnit XML | CI/CD Integration |
| Watch Mode | Re-run evals on file changes during development | Watch Mode |
| Result Caching | Cache target outputs to skip re-invocation | Result Caching |
| QA Testing | Test deployed endpoints from the CLI | QA Testing |
| Red-Team Testing | Adversarial evaluation across 6 attack categories | Red-Team Testing |
| Synthetic Data | Generate datasets from docs, traces, or seeds | Synthetic Data Generation |
| Production Monitoring | Continuous evaluation daemon for live projects | Production Monitoring |
| Pytest Plugin | lef_eval fixture and @pytest.mark.lef marker |
Pytest Plugin |
| Remote Targets | Evaluate HTTP endpoints without wrapper code | Remote Targets |
| Dataset Management | Pull, push, diff, generate from the CLI | CLI Reference |
| Git Context | Auto-tag experiments with branch/commit metadata | Git Context |
| 7 CLI Subcommands | run, compare, baseline, qa, monitor, redteam, dataset |
CLI Reference |
Installation
pip install lefx
# Or with uv
uv add lefx
Note: The PyPI package name is
lefx, but the import remainsimport lef.
Optional extras:
pip install "lefx[langgraph]" # LangGraph support (langgraph>=0.2.0)
pip install "lefx[agents]" # Agent trajectory evaluators (agentevals>=0.0.9)
pip install "lefx[remote]" # Remote HTTP target support (httpx>=0.27.0)
pip install "lefx[all]" # Everything: LangGraph + agents + remote + OpenAI + Anthropic SDKs
| Extra | What it adds | When you need it |
|---|---|---|
langgraph |
langgraph>=0.2.0 |
Evaluating compiled StateGraph agents |
agents |
agentevals>=0.0.9 |
Trajectory evaluators (create_trajectory_evaluator, create_trajectory_judge) |
remote |
httpx>=0.27.0 |
create_remote_target(), lef qa, remote HTTP evaluation |
all |
All of the above + langchain-openai, langchain-anthropic |
Full-featured setup |
Requires Python 3.11+.
Quick Start
1. Evaluate with built-in scorers (no API keys needed)
from lef import run_eval, exact_match, contains
results = run_eval(
target=my_chain.invoke,
data="my-langsmith-dataset",
evaluators=[exact_match, contains],
)
2. Add LLM judges (needs OPENAI_API_KEY or ANTHROPIC_API_KEY)
from lef import run_eval, correctness_judge, safety_judge, exact_match
results = run_eval(
target=my_chain.invoke,
data="my-dataset",
evaluators=[correctness_judge(), safety_judge(), exact_match],
)
3. Gate CI with thresholds
from lef import assert_scores
assert_scores(results, {
"correctness": 0.8,
"safety": 0.95,
"exact_match": 0.7,
})
# Raises EvalAssertionError if any threshold fails
4. Use local data files (no LangSmith account needed)
# tests/eval_data/examples.yaml
- inputs:
question: "What is the capital of France?"
outputs:
answer: "Paris"
- inputs:
question: "What is 2+2?"
outputs:
answer: "4"
from lef import load_examples, run_eval, exact_match
examples = load_examples("tests/eval_data/examples.yaml")
results = run_eval(
target=my_app,
data=examples,
evaluators=[exact_match],
upload_results=False, # Fully offline
)
Custom Scorers
Decorator (simplest)
from lef import scorer
@scorer(key="word_count")
def word_count(*, inputs, outputs, **kwargs):
count = len(outputs.get("answer", "").split())
return min(count / 100, 1.0) # Return float 0-1
@scorer(key="has_answer")
def has_answer(*, inputs, outputs, **kwargs):
return bool(outputs.get("answer")) # Return bool (True=1.0, False=0.0)
Async scorer
@scorer(key="api_check")
async def api_check(*, inputs, outputs, **kwargs):
result = await some_async_validation(outputs["answer"])
return result # bool, float, int, dict, or EvalResult
Factory (dynamic creation)
from lef import create_scorer
def my_logic(*, inputs, outputs, **kwargs):
return len(outputs.get("answer", "")) > 10
length_check = create_scorer("min_length", my_logic)
Class-based (most flexible)
from lef import BaseEvaluator, EvalResult
class MyEvaluator(BaseEvaluator):
key = "my_metric"
def evaluate(self, *, inputs, outputs, reference_outputs=None, **kwargs):
score = 1.0 if "expected" in outputs.get("answer", "") else 0.0
return EvalResult(key=self.key, score=score, comment="Checked for keyword")
Composite scorer (combine multiple)
from lef import create_composite_scorer, exact_match, contains
quality = create_composite_scorer(
"quality",
[exact_match, contains],
aggregation="mean", # Also: "min", "max", "all_pass"
)
Custom LLM Judges
from lef import create_judge
# Custom prompt with {inputs}, {outputs}, {reference_outputs} placeholders
tone_judge = create_judge(
prompt="""Evaluate whether the response has a professional tone.
User input: {inputs}
Response: {outputs}
Return true if professional, false otherwise.""",
model="openai:gpt-4o",
feedback_key="tone",
)
Pre-Built Evaluators
Scorers (rule-based, no API keys)
| Scorer | What it does |
|---|---|
exact_match |
Exact string match (whitespace-trimmed) |
contains |
Case-insensitive substring check |
regex_match |
Regex pattern matching (reference_outputs["pattern"]) |
json_match |
Field-by-field JSON comparison, returns 0.0-1.0 |
LLM Judges (need API keys)
| Judge | What it evaluates |
|---|---|
correctness_judge() |
Output correctness vs reference |
conciseness_judge() |
Response conciseness |
hallucination_judge() |
Hallucinations beyond inputs/context |
answer_relevance_judge() |
Answer relevance to the question |
faithfulness_judge() |
Faithfulness to source context |
response_quality_judge() |
Overall quality (correctness, completeness, clarity) |
safety_judge() |
Harmful, biased, or dangerous content |
toxicity_judge() |
Toxic or offensive content |
tool_selection_judge() |
Agent tool selection accuracy |
code_correctness_judge() |
Code correctness |
plan_adherence_judge() |
Adherence to a specified plan |
RAG Judges
| Judge | What it evaluates |
|---|---|
rag_groundedness_judge() |
Response grounded in retrieved context |
rag_helpfulness_judge() |
RAG response helpfulness |
rag_retrieval_relevance_judge() |
Retrieved document relevance |
Agent Trajectory
from lef import create_trajectory_evaluator, create_trajectory_judge
# Match-based (needs reference trajectory)
traj_eval = create_trajectory_evaluator(match_mode="superset")
# Options: "strict", "unordered", "subset", "superset"
# LLM-based (no reference needed)
traj_judge = create_trajectory_judge(model="openai:gpt-4o")
Choosing a judge model
All judges accept a model parameter:
from lef import JudgeModel, correctness_judge
judge = correctness_judge(model=JudgeModel.GPT_4O) # Default
judge = correctness_judge(model=JudgeModel.GPT_4O_MINI) # Faster/cheaper
judge = correctness_judge(model=JudgeModel.CLAUDE_SONNET) # Anthropic
judge = correctness_judge(model=JudgeModel.CLAUDE_HAIKU) # Fast Anthropic
judge = correctness_judge(model="openai:gpt-4.1") # Any string
Dataset & Runner Patterns
Fluent runner
from lef import EvalRunner, correctness_judge, exact_match
runner = EvalRunner(
dataset="qa-examples", # LangSmith dataset name or list of dicts
experiment_prefix="v1",
description="Baseline evaluation",
num_repetitions=3, # For statistical significance
upload_results=True,
)
runner.add_evaluators([correctness_judge(), exact_match])
results = runner.run(target=my_chain.invoke)
# Async
results = await runner.arun(target=my_chain.ainvoke)
Create a LangSmith dataset programmatically
from lef import create_dataset
create_dataset("qa-examples", examples=[
{"inputs": {"question": "Capital of France?"}, "outputs": {"answer": "Paris"}},
{"inputs": {"question": "What is 2+2?"}, "outputs": {"answer": "4"}},
])
Load local files (YAML, JSON, CSV)
from lef import load_examples
# YAML / JSON (list of {inputs, outputs} dicts)
examples = load_examples("tests/data/examples.yaml")
# CSV with column splitting
examples = load_examples(
"tests/data/cases.csv",
input_keys=["question"],
output_keys=["answer"],
)
A/B comparison
from lef import run_comparative_eval
results = run_comparative_eval(
experiments=["v1-gpt4o", "v2-claude"],
evaluators=[my_preference_judge],
)
LangChain Integration
from lef import evaluate_chain, correctness_judge
results = evaluate_chain(
my_chain,
data="qa-dataset",
evaluators=[correctness_judge()],
output_mapper=lambda x: {"answer": x.content}, # Optional
)
# Async
results = await aevaluate_chain(my_chain, data="qa-dataset", evaluators=[...])
LangGraph Integration
from lef import evaluate_graph, correctness_judge
results = evaluate_graph(
app, # Compiled StateGraph
data="agent-dataset",
evaluators=[correctness_judge()],
input_mapper=lambda x: {"messages": [("user", x["question"])]},
output_mapper=lambda x: {"answer": x["messages"][-1].content},
)
# Async
results = await aevaluate_graph(app, data="agent-dataset", evaluators=[...])
Online / Production Monitoring
from lef import OnlineEvaluator, evaluate_run, safety_judge, response_quality_judge
# Evaluate a single run by ID
results = evaluate_run("run-uuid-here", evaluators=[safety_judge()])
# Monitor a project
online = OnlineEvaluator(project_name="my-chatbot-production")
online.add_evaluator(safety_judge())
online.add_evaluator(response_quality_judge())
results = online.evaluate_recent(limit=50, run_type="chain")
Result Export
Export evaluation results to JSON, CSV, JUnit XML, or Markdown for CI artifacts, reports, or sharing.
from lef import run_eval, export_json, export_csv, export_junit_xml, export_markdown, export_results
results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])
# Export to specific formats
export_json(results, "results.json")
export_csv(results, "results.csv")
export_junit_xml(results, "results.xml")
export_markdown(results, "results.md", thresholds={"correctness": 0.8}, metadata={"env": "staging"})
# Auto-detect format from file extension
export_results(results, "results.json") # JSON
export_results(results, "results.csv") # CSV
export_results(results, "results.xml") # JUnit XML
export_results(results, "results.md") # Markdown report
From the CLI:
lef run eval_suite.yaml --output results.json
lef run eval_suite.yaml --output results.xml # JUnit XML for CI dashboards
lef run eval_suite.yaml --output results.md # Markdown report
You can also format results as a table for terminal output:
from lef import format_results_table
print(format_results_table(results))
Baseline Comparison
Save evaluation results as named baselines, then compare against them to detect regressions across branches or releases.
Save a baseline
from lef import run_eval, save_baseline
results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])
save_baseline("main", results, metadata={"branch": "main", "version": "1.0"})
# Saved to .lef/baselines/main.json
Compare against a baseline
from lef import compare_results, load_baseline
baseline = load_baseline("main")
current_results = run_eval(target=my_chain.invoke, data="my-dataset", evaluators=[...])
report = compare_results(baseline, current_results, tolerance=0.05)
# report.regressions -> list of metrics that dropped by more than the tolerance
# report.improvements -> list of metrics that improved
Manage baselines
from lef import list_baselines, delete_baseline
baselines = list_baselines() # List all saved baselines
delete_baseline("old-baseline") # Remove a saved baseline
From the CLI:
# Save results as a baseline
lef run eval_suite.yaml --save-baseline main
# Compare against a baseline
lef run eval_suite.yaml --baseline main
# List and manage baselines
lef baseline list
lef baseline delete old-baseline
# Compare two baselines directly
lef compare --baseline main --current feature-branch --tolerance 0.05
CI/CD Integration
LEF integrates with CI/CD pipelines to post evaluation results as PR comments and generate JUnit XML reports.
GitHub Actions
# .github/workflows/eval.yaml
name: Evaluation
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install "lef[all]"
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
run: |
lef run eval_suite.yaml \
--output results.xml \
--save-baseline ${{ github.head_ref }} \
--baseline main \
--github-comment \
--threshold correctness=0.8
- name: Publish JUnit results
uses: dorny/test-reporter@v1
if: always()
with:
name: Eval Results
path: results.xml
reporter: java-junit
Azure DevOps
# azure-pipelines.yaml
steps:
- script: |
pip install "lef[all]"
lef run eval_suite.yaml \
--output results.xml \
--azdo-comment \
--threshold correctness=0.8
env:
OPENAI_API_KEY: $(OPENAI_API_KEY)
- task: PublishTestResults@2
inputs:
testResultsFiles: results.xml
testRunTitle: "LEF Evaluations"
Programmatic PR comments
from lef import post_github_comment, post_azdo_comment, format_results_table
body = format_results_table(results)
# GitHub (auto-detects repo/PR from GITHUB_REPOSITORY and GITHUB_REF)
post_github_comment(body, update_existing=True)
# Azure DevOps
post_azdo_comment(body)
Watch Mode
Re-run evaluations automatically when source files, config files, or datasets change. Useful for iterative development.
lef run eval_suite.yaml --watch
Watch mode monitors .py, .yaml, .yml, .json, and .csv files in the project directory. When a change is detected, the evaluation suite re-runs automatically.
Programmatic usage:
from lef.watch import watch_and_run
watch_and_run(
run_fn=my_eval_function,
watch_paths=["src/", "evals/", "eval_suite.yaml"],
)
Combine with --cache to avoid re-calling the target when only evaluators change:
lef run eval_suite.yaml --watch --cache
Result Caching
Cache target function outputs on disk to avoid expensive re-invocations when iterating on evaluators. Uses content-addressable hashing of inputs for cache keys.
from lef import ResultCache, run_eval
cache = ResultCache(ttl_seconds=3600) # Cache expires after 1 hour
cached_target = cache.wrap(my_chain.invoke)
# First run: calls my_chain.invoke for each input
run_eval(target=cached_target, data=examples, evaluators=[judge1])
# Second run: uses cached outputs, only re-evaluates
run_eval(target=cached_target, data=examples, evaluators=[judge2])
Cache is stored in .lef/cache/ by default. From the CLI:
lef run eval_suite.yaml --cache
QA Testing
Test deployed HTTP endpoints against datasets with pass/fail gating. Works with LangSmith Deployments, LangGraph Platform, LangServe, or any REST API.
From the CLI
# Test a deployed endpoint with a dataset
lef qa https://my-api.example.com/invoke \
--data tests/eval_data/examples.yaml \
--evaluators correctness_judge safety_judge \
--threshold correctness=0.8 \
--threshold safety=0.95
# Generate a Markdown report
lef qa https://my-api.example.com/invoke \
--data my-langsmith-dataset \
--output results/qa_report.md
# Add custom headers (e.g., for authentication)
lef qa https://my-api.example.com/invoke \
--data my-langsmith-dataset \
-H "Authorization: Bearer $API_KEY" \
--timeout 120
# Export results and skip LangSmith upload
lef qa https://my-api.example.com/invoke \
--data examples.yaml \
--output results.json \
--no-upload
From Python
from lef import (
create_remote_target, run_eval, export_markdown,
check_scores, correctness_judge, safety_judge,
)
# Point at your deployed endpoint
target = create_remote_target(
"https://my-api.example.com/invoke",
headers={"Authorization": f"Bearer {API_KEY}"},
input_mapper=lambda inputs: {"query": inputs["question"]},
output_mapper=lambda resp: {"answer": resp["response"]},
)
# Run evaluation
results = run_eval(
target,
data="my-qa-dataset", # LangSmith dataset or local file path
evaluators=[correctness_judge(), safety_judge()],
upload_results=False,
)
# Generate Markdown report with pass/fail
export_markdown(results, "qa_report.md", thresholds={"correctness": 0.8, "safety": 0.95})
# Check thresholds programmatically
report = check_scores(results, {"correctness": 0.8, "safety": 0.95})
all_passed = all(v["passed"] for v in report.values())
See examples/qa_endpoint_eval.py for a complete working example.
Red-Team Testing
Run adversarial evaluations across 6 attack categories to test your system's safety and robustness.
Attack categories
| Category | Description |
|---|---|
prompt_injection |
Attempts to override system instructions |
jailbreak |
Attempts to bypass safety guardrails |
pii_extraction |
Attempts to extract private or sensitive information |
hallucination_inducement |
Inputs designed to induce hallucinated responses |
toxicity |
Tests whether the system generates toxic content |
bias |
Tests for biased responses across demographics |
From the CLI
# Run all categories against a target
lef redteam --target myapp.chain:invoke
# Test specific categories with more examples
lef redteam --target myapp.chain:invoke \
--categories prompt_injection,jailbreak,pii_extraction \
--count 10
# Use seed examples only (no LLM generation)
lef redteam --target myapp.chain:invoke --seed-only
# Use a config file for target definition
lef redteam eval_config.yaml --categories toxicity,bias
Programmatic usage
from lef import run_redteam
report = run_redteam(
target=my_chain.invoke,
categories=["prompt_injection", "jailbreak", "pii_extraction"],
count_per_category=10,
model="openai:gpt-4o",
upload_results=False,
)
# report contains per-category scores and detailed results
Built-in red-team scorers
from lef import injection_resistance_check, pii_leak_check, refusal_check
# Use individually as evaluators
results = run_eval(
target=my_chain.invoke,
data=adversarial_examples,
evaluators=[injection_resistance_check, pii_leak_check, refusal_check],
)
Synthetic Data Generation
Generate evaluation datasets from documents, production traces, or seed examples using LLM-powered synthesis.
From documents
from lef import generate_from_docs
examples = generate_from_docs(
"docs/product_guide.md",
count=10,
style="factual", # "factual", "reasoning", or "conversational"
model="openai:gpt-4o",
)
# Returns a list of {"inputs": {"question": ...}, "outputs": {"answer": ...}} dicts
From production traces
from lef import generate_from_traces
examples = generate_from_traces(
project_name="my-chatbot",
limit=100,
model="openai:gpt-4o",
)
Generate adversarial examples
from lef import generate_adversarial
adversarial = generate_adversarial(
description="A customer support chatbot for a SaaS product",
seed_examples=[{"question": "How do I reset my password?"}],
count=20,
)
Diversify an existing dataset
from lef import diversify_dataset
expanded = diversify_dataset(
existing_examples,
count=50,
model="openai:gpt-4o",
)
From the CLI
# Generate from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml
# Generate from a directory of documents
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml
Production Monitoring
Run a long-lived daemon that continuously polls a LangSmith project for new runs and evaluates them. Useful for production monitoring and alerting.
from lef import MonitorDaemon
from lef.judges import safety_judge, correctness_judge
monitor = MonitorDaemon(
project_name="my-chatbot",
evaluators=[safety_judge(), correctness_judge()],
thresholds={"safety": 0.9, "correctness": 0.7},
poll_interval=60, # seconds between polls
batch_size=20, # runs per poll
run_type="chain", # filter by run type
)
monitor.add_alert_handler(lambda alert: print(f"ALERT: {alert}"))
monitor.run() # Blocks until interrupted (Ctrl+C)
From the CLI:
lef monitor \
--project my-chatbot \
--evaluators safety correctness \
--threshold safety=0.9 \
--threshold correctness=0.7 \
--interval 60 \
--batch-size 20 \
--run-type chain
Pytest Plugin
Run LEF evaluations as pytest test cases. The plugin provides a lef_eval fixture and a @pytest.mark.lef marker.
Using the lef_eval fixture
# tests/test_evals.py
from lef import exact_match, correctness_judge
def test_qa_correctness(lef_eval):
results = lef_eval(
target=my_chain.invoke,
data="tests/eval_data/examples.yaml",
evaluators=[exact_match, correctness_judge()],
thresholds={"correctness": 0.8, "exact_match": 0.9},
)
# Thresholds are automatically asserted -- test fails if any threshold is not met
Using the @pytest.mark.lef marker
import pytest
@pytest.mark.lef(config="eval_suite.yaml")
def test_my_eval():
pass # Eval is run automatically by the marker
Running config files as test cases
# Run eval configs as pytest test cases
pytest --lef-config eval_suite.yaml --lef-config another_suite.yaml
# Disable LangSmith upload during test runs
pytest --lef-config eval_suite.yaml --lef-no-upload
# Override experiment prefix
pytest --lef-config eval_suite.yaml --lef-prefix "ci-test"
Remote Targets
Evaluate any HTTP endpoint -- LangServe, LangGraph Platform, or plain REST APIs -- without writing wrapper code. Requires the [remote] extra.
from lef import create_remote_target, create_async_remote_target, run_eval
# Basic REST endpoint
target = create_remote_target("https://my-api.example.com/invoke")
# LangGraph Platform deployment with custom mappers
target = create_remote_target(
"https://my-assistant.langsmith.dev/runs/stream",
headers={"x-api-key": "..."},
input_mapper=lambda inputs: {
"input": {"messages": [{"role": "user", "content": inputs["question"]}]},
},
output_mapper=lambda resp: {
"answer": resp["output"]["messages"][-1]["content"],
},
timeout=120.0,
)
results = run_eval(target=target, data=examples, evaluators=[...])
# Async version
async_target = create_async_remote_target("https://my-api.example.com/invoke")
results = await arun_eval(target=async_target, data=examples, evaluators=[...])
Git Context
LEF automatically detects git branch, commit SHA, author, and other metadata to tag evaluation experiments. This enables branch comparison workflows and traceability in LangSmith.
from lef import get_git_context, build_experiment_metadata
# Get current git context
ctx = get_git_context()
# {'branch': 'feature/new-prompt', 'commit_sha': 'abc123...', 'author': '...', ...}
# Build experiment metadata (includes git context + CI detection)
metadata = build_experiment_metadata()
# Automatically used by run_eval when upload_results=True
Git context is auto-detected from the git repository and from CI environment variables (GitHub Actions, Azure DevOps, GitLab CI, Jenkins).
CLI Reference
LEF provides 7 CLI subcommands:
lef run Run evaluation suite from config file(s)
lef compare Compare two baselines or experiments
lef baseline Manage saved baselines (list, delete)
lef qa QA test a deployed HTTP endpoint
lef monitor Continuously monitor production runs
lef redteam Run adversarial red-team evaluation
lef dataset Dataset management (pull, push, diff, generate)
lef run
Run an evaluation suite defined in a YAML config file.
# Basic usage
lef run eval_suite.yaml
# Override prefix and thresholds
lef run eval_suite.yaml --prefix "v2.1" --threshold correctness=0.85
# Export results, cache outputs, and compare against a baseline
lef run eval_suite.yaml \
--output results.xml \
--cache \
--baseline main \
--save-baseline feature-branch
# Watch mode with caching
lef run eval_suite.yaml --watch --cache
# Post results as a GitHub PR comment
lef run eval_suite.yaml --github-comment
# Post results as an Azure DevOps PR comment
lef run eval_suite.yaml --azdo-comment
# Local-only (no LangSmith upload)
lef run eval_suite.yaml --no-upload
# Merge multiple config files
lef run base_config.yaml override_config.yaml
Config file format:
# eval_suite.yaml
target: myapp.chain:invoke # Dotted import path to your target
dataset: tests/eval_data/examples.yaml # Local file or LangSmith dataset name
evaluators:
- exact_match # Built-in scorer
- correctness_judge # Built-in judge (auto-instantiated)
- myapp.evals:custom_scorer # Custom import path
experiment_prefix: "regression-test"
thresholds:
correctness: 0.8
safety: 0.95
exact_match: 0.7
lef compare
Compare two baselines or experiments to detect regressions.
lef compare --baseline main --current feature-branch
lef compare --baseline main --current feature-branch --tolerance 0.05
lef compare --baseline main --current feature-branch --output comparison.json
lef baseline
Manage saved baselines.
lef baseline list # List all saved baselines
lef baseline delete my-baseline # Delete a saved baseline
lef qa
Test a deployed HTTP endpoint against a dataset. See QA Testing for details.
lef qa https://my-api.example.com/invoke \
--data examples.yaml \
--evaluators correctness safety \
--threshold correctness=0.8 \
-H "Authorization: Bearer $TOKEN" \
--timeout 120 \
--output results.json
lef monitor
Continuously monitor production runs. See Production Monitoring for details.
lef monitor --project my-chatbot \
--evaluators safety correctness \
--threshold safety=0.9 \
--interval 60 --batch-size 20 --run-type chain
lef redteam
Run adversarial red-team evaluations. See Red-Team Testing for details.
lef redteam --target myapp.chain:invoke \
--categories prompt_injection,jailbreak \
--count 10 --seed-only --no-upload
lef dataset
Dataset management commands.
# Pull a LangSmith dataset to a local file
lef dataset pull my-dataset --output my-dataset.yaml
# Push a local file to LangSmith
lef dataset push examples.yaml --name "my-dataset" --description "QA examples"
# Diff two local dataset files
lef dataset diff examples_v1.yaml examples_v2.yaml
# Generate a synthetic dataset from documents
lef dataset generate docs/guide.md --count 10 --style factual --output eval_data.yaml
lef dataset generate docs/ --count 5 --style reasoning --output eval_data.yaml
Generation styles: factual (fact-based Q&A), reasoning (multi-step), conversational (natural dialogue).
Threshold Assertions
Raise on failure (CI/CD)
from lef import assert_scores, EvalAssertionError
try:
assert_scores(results, {
"correctness": 0.8,
"safety": 0.95,
})
except EvalAssertionError as e:
print(f"Failed: {e}")
print(f"Details: {e.failures}") # List of {key, actual, threshold}
Non-raising check
from lef import check_scores
report = check_scores(results, {"correctness": 0.8, "safety": 0.95})
for key, info in report.items():
status = "PASS" if info["passed"] else "FAIL"
print(f" {key}: {status} ({info['actual']:.2f} vs {info['threshold']:.2f})")
Configuration
from lef import LefConfig, JudgeModel
# From environment (recommended)
config = LefConfig.from_env()
# Or explicit
config = LefConfig(
langsmith_api_key="lsv2_...",
langsmith_project="my-project",
default_judge_model=JudgeModel.CLAUDE_SONNET,
max_concurrency=10,
)
config.apply() # Sets environment variables
Environment variables:
export LANGCHAIN_API_KEY=lsv2_... # LangSmith API key
export LANGCHAIN_PROJECT=my-project # LangSmith project name
export LANGCHAIN_TRACING_V2=true # Enable tracing
export OPENAI_API_KEY=sk-... # For OpenAI judges
export ANTHROPIC_API_KEY=sk-ant-... # For Anthropic judges
Walkthrough: QA-ing a Prompt Change
Scenario: You changed a prompt in your LangGraph agent and need to verify nothing broke.
Step 1: Create test data
# evals/golden_set.yaml
- inputs:
question: "Summarize the key points of this document"
context: "The report shows Q3 revenue grew 15% YoY..."
outputs:
answer: "Q3 revenue grew 15% year-over-year"
- inputs:
question: "What action items came out of the meeting?"
context: "Action items: 1) Update the roadmap 2) Schedule design review"
outputs:
answer: "Update the roadmap and schedule a design review"
Step 2: Write the eval script
# evals/test_prompt_change.py
from lef import load_examples, run_eval, assert_scores, exact_match, correctness_judge, scorer
from my_project.agent import app
@scorer(key="mentions_key_facts")
def mentions_key_facts(*, inputs, outputs, reference_outputs, **kwargs):
ref = reference_outputs.get("answer", "").lower()
out = outputs.get("answer", "").lower()
keywords = [w for w in ref.split() if len(w) > 4]
if not keywords:
return True
return sum(1 for kw in keywords if kw in out) / len(keywords)
def target(inputs):
result = app.invoke({
"messages": [("user", inputs["question"])],
"context": inputs.get("context", ""),
})
return {"answer": result["messages"][-1].content}
examples = load_examples("evals/golden_set.yaml")
results = run_eval(
target=target,
data=examples,
evaluators=[exact_match, mentions_key_facts, correctness_judge()],
upload_results=False,
experiment_prefix="prompt-v2",
)
assert_scores(results, {
"exact_match": 0.5,
"mentions_key_facts": 0.7,
"correctness": 0.8,
})
print("All QA checks passed!")
Step 3: Run it
# Local
python evals/test_prompt_change.py
# Or via CLI
lef run evals/eval_suite.yaml --no-upload
# CI: exits non-zero on failure
lef run evals/eval_suite.yaml --threshold correctness=0.8
Step 4: Add to CI
# .github/workflows/eval.yaml
- name: Run prompt regression tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pip install lefx
lef run evals/eval_suite.yaml --threshold correctness=0.8
EvalResult Anatomy
All evaluators return EvalResult, a dict subclass compatible with LangSmith:
from lef import EvalResult
result = EvalResult(
key="my_metric", # Metric name
score=0.85, # float (0-1) or bool
comment="Looks good", # Optional explanation
metadata={"details": {}}, # Optional metadata
)
# Dict-compatible (LangSmith requires this)
result["key"] # "my_metric"
result["score"] # 0.85
# Property access
result.key # "my_metric"
result.score # 0.85
result.comment # "Looks good"
result.metadata # {"details": {}}
API Reference
Full public API (85 exports)
| Category | Export | Type |
|---|---|---|
| Core | EvalResult |
Class (dict subclass) |
EvalResultBatch |
Class (Pydantic model) | |
BaseEvaluator |
Abstract class | |
AsyncBaseEvaluator |
Abstract class | |
JudgeModel |
Enum (GPT_4O, GPT_4O_MINI, CLAUDE_SONNET, CLAUDE_HAIKU) |
|
scorer |
Decorator | |
evaluator |
Decorator (alias for scorer) |
|
| Scorers | exact_match |
Callable |
contains |
Callable | |
regex_match |
Callable | |
json_match |
Callable | |
mean_score |
Callable | |
pass_rate |
Callable | |
create_scorer |
Factory function | |
create_composite_scorer |
Factory function | |
| Judges | correctness_judge |
Factory -> Callable |
conciseness_judge |
Factory -> Callable | |
hallucination_judge |
Factory -> Callable | |
answer_relevance_judge |
Factory -> Callable | |
faithfulness_judge |
Factory -> Callable | |
response_quality_judge |
Factory -> Callable | |
safety_judge |
Factory -> Callable | |
toxicity_judge |
Factory -> Callable | |
tool_selection_judge |
Factory -> Callable | |
code_correctness_judge |
Factory -> Callable | |
plan_adherence_judge |
Factory -> Callable | |
create_judge |
Factory function | |
| RAG | rag_groundedness_judge |
Factory -> Callable |
rag_helpfulness_judge |
Factory -> Callable | |
rag_retrieval_relevance_judge |
Factory -> Callable | |
| Trajectory | create_trajectory_evaluator |
Factory function |
create_trajectory_judge |
Factory function | |
| Datasets | run_eval |
Function |
arun_eval |
Async function | |
run_comparative_eval |
Function | |
EvalRunner |
Class (fluent builder) | |
create_dataset |
Function | |
load_examples |
Function | |
| Online | evaluate_run |
Function |
create_rule |
Function | |
create_rule_config |
Function | |
OnlineEvaluator |
Class | |
| Integrations | evaluate_chain |
Function |
aevaluate_chain |
Async function | |
create_chain_target |
Function | |
create_async_chain_target |
Function | |
evaluate_graph |
Function | |
aevaluate_graph |
Async function | |
create_graph_target |
Function | |
create_async_graph_target |
Function | |
create_remote_target |
Function | |
create_async_remote_target |
Async function | |
| Config | LefConfig |
Class (Pydantic model) |
| Assertions | assert_scores |
Function (raises EvalAssertionError) |
check_scores |
Function (returns report dict) | |
EvalAssertionError |
Exception class | |
| Export | export_json |
Function |
export_csv |
Function | |
export_junit_xml |
Function | |
export_markdown |
Function | |
export_results |
Function (auto-detects format) | |
format_results_table |
Function | |
| Git Context | get_git_context |
Function |
build_experiment_metadata |
Function | |
| Baselines | save_baseline |
Function |
load_baseline |
Function | |
list_baselines |
Function | |
delete_baseline |
Function | |
compare_results |
Function | |
compare_experiments |
Function | |
ComparisonReport |
Class | |
| Cache | ResultCache |
Class |
| CI | post_github_comment |
Function |
post_azdo_comment |
Function | |
| Monitor | MonitorDaemon |
Class |
| Red-Team | run_redteam |
Function |
injection_resistance_check |
Callable | |
pii_leak_check |
Callable | |
refusal_check |
Callable | |
| Watch | watch_and_run |
Function |
| Synthetic | generate_from_docs |
Function |
generate_from_traces |
Function | |
generate_adversarial |
Function | |
diversify_dataset |
Function |
Examples
See the examples/ directory:
quickstart.py— Get running in minutesllm_as_judge.py— LLM-as-Judge patternsdataset_eval.py— Dataset-driven evaluationcustom_scorer.py— Custom scorer patternslanggraph_eval.py— LangGraph agent evaluationonline_eval.py— Production monitoring
Development
git clone https://github.com/bogware/lef.git
cd lef
uv sync --extra dev --extra all
# Run tests
uv run pytest
# Lint
uv run ruff check src/ tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lefx-0.3.1.tar.gz.
File metadata
- Download URL: lefx-0.3.1.tar.gz
- Upload date:
- Size: 118.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b57dbc48ca91b99c0329f4943bbabcd6d3511f03449b311c69cfecad1f983924
|
|
| MD5 |
ffc56bd96977960ba357ea1e100c5fd5
|
|
| BLAKE2b-256 |
cb6182573fb7794aaf1cc3c691b217543b543c30bdf20f56f5c5eabe90ebd62a
|
Provenance
The following attestation bundles were made for lefx-0.3.1.tar.gz:
Publisher:
release.yml on bogware/lef
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lefx-0.3.1.tar.gz -
Subject digest:
b57dbc48ca91b99c0329f4943bbabcd6d3511f03449b311c69cfecad1f983924 - Sigstore transparency entry: 1204376540
- Sigstore integration time:
-
Permalink:
bogware/lef@25830614523babe0354323384344a55d4e3e7c92 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/bogware
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
release.yml@25830614523babe0354323384344a55d4e3e7c92 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lefx-0.3.1-py3-none-any.whl.
File metadata
- Download URL: lefx-0.3.1-py3-none-any.whl
- Upload date:
- Size: 81.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
622a6923ab172fb84dadf0172a2a8b8690d5183e2957e97f5fb5444caa2c280a
|
|
| MD5 |
e56fe2277f7f2ca66f7781438d38cd3f
|
|
| BLAKE2b-256 |
accfaf71fd63e2802c77f5d3cd600e4f9445bf7ffdcb4aed3712c2c629278db9
|
Provenance
The following attestation bundles were made for lefx-0.3.1-py3-none-any.whl:
Publisher:
release.yml on bogware/lef
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lefx-0.3.1-py3-none-any.whl -
Subject digest:
622a6923ab172fb84dadf0172a2a8b8690d5183e2957e97f5fb5444caa2c280a - Sigstore transparency entry: 1204376544
- Sigstore integration time:
-
Permalink:
bogware/lef@25830614523babe0354323384344a55d4e3e7c92 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/bogware
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
release.yml@25830614523babe0354323384344a55d4e3e7c92 -
Trigger Event:
push
-
Statement type: