Skip to main content

EvalGate Python SDK — CI for AI behavior. Traces, evaluations, assertions, and regression gates for LLM apps.

Project description

pauly4010-evalgate-sdk

Build a living golden suite for AI behavior. 🚀

No infra. No lock-in. Remove anytime.

EvalGate = the full suite for AI quality in Python. Discover overlap, cluster failures, build golden datasets, run automated regression gates, and guide optimization before changes reach production.

PyPI Python License: MIT Typed Tests


The Full EvalGate Workflow

EvalGate is no longer just a pass/fail gate at the end of CI. The current workflow is a full loop:

discover -> cluster -> label/analyze -> synthesize -> gate/auto
  • Discover overlap before adding more tests with evalgate discover --manifest
  • Cluster failures by pattern with evalgate cluster --run .evalgate/runs/latest.json
  • Build a labeled golden dataset with evalgate label and evalgate analyze
  • Draft broader golden cases with evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
  • Block regressions or run guided optimization with evalgate gate, evalgate ci, and evalgate auto

The Python SDK ships the same closed-loop workflow primitives as the platform: assertions, spec execution, tracing, clustering, golden-dataset analysis, synthesis, replay decision, and guided auto iterations.


Install

pip install pauly4010-evalgate-sdk                        # Core
pip install "pauly4010-evalgate-sdk[openai]"              # + OpenAI tracing and async assertions
pip install "pauly4010-evalgate-sdk[anthropic]"           # + Anthropic tracing and async assertions
pip install "pauly4010-evalgate-sdk[all]"                 # Everything

Quickstart

No API key needed for local assertions:

from evalgate_sdk import AIEvalClient, expect
from evalgate_sdk.types import CreateTraceParams

# Local assertions — no API key needed
result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed)  # True

# Platform: trace and evaluate with API key
client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))

Same CI gate, same quality checks. Python supports the same core loop as TypeScript: assertions, test suites, OpenAI/Anthropic tracing, LangChain/CrewAI/AutoGen integrations, golden dataset workflow commands, and regression gates.

Python CLI: pip install "pauly4010-evalgate-sdk[cli]"evalgate init, evalgate run, evalgate check, evalgate gate, evalgate ci, evalgate discover, evalgate cluster, evalgate label, evalgate analyze, evalgate synthesize, evalgate replay-decision, evalgate explain, evalgate doctor, evalgate auto.

Context helpers are importable from the package root:

from evalgate_sdk import ContextMetadata, create_context

ctx: ContextMetadata = {"run_id": "test-run"}
token = create_context(ctx)

Why EvalGate?

LLMs don't fail like traditional software — they drift silently. EvalGate turns evaluations into CI gates so regressions never reach production.

What you get How it works
30+ assertions expect(output).to_contain("Paris"), .to_not_contain_pii(), .to_have_no_profanity()
DSL spec system define_eval("name", executor) with .skip and .only support
Test suites Define cases with retries, seed, strict mode, and stop-on-failure
Workflow tracing Multi-agent handoffs, decisions, costs — with offline mode
OpenAI / Anthropic Drop-in tracing wrappers + LangChain, CrewAI, AutoGen
Regression gates Block deploys when eval scores drop, with baseline tamper detection
Snapshot testing Save, compare, and diff outputs over time
Impact analysis evalgate discover → manifest → impact analysis → run only what changed
CLI evalgate run, evalgate check, evalgate gate, evalgate ci, evalgate discover, evalgate cluster, evalgate label, evalgate analyze, evalgate synthesize, evalgate replay-decision, evalgate explain, evalgate doctor, evalgate auto

Assertions

30+ built-in checks for LLM output quality, safety, and structure. All return AssertionResult with .passed, .message, .expected, .actual.

Fluent API (expect)

from evalgate_sdk import expect
 
# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("draft output").not_.to_contain("final answer")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()
expect("Clean output").to_have_no_profanity()
 
# Sentiment
expect("Great product!").to_have_sentiment("positive")
 
# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect('{"name": "Alice"}').to_match_json({"type": "object"})
expect('payload={"name": "Alice"}').to_match_json({"required": ["name"]})
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)
expect(output).to_contain_keywords(["gravity", "force"])
 
# Comparison
expect(42).to_be_greater_than(10)
expect(42).to_be_less_than(100)
expect(True).to_be_truthy()
 
# Code
expect("def hello(): pass").to_contain_code()
 
# Hallucination
expect(output).to_not_hallucinate(["Paris is the capital of France"])

Standalone Functions

from evalgate_sdk import (
    contains_keywords, has_no_toxicity, has_sentiment, similar_to,
    contains_json, has_readability_score, has_factual_accuracy,
    has_valid_code_syntax, has_sentiment_with_score, matches_pattern,
    matches_schema, responded_within_duration, responded_within_time_since,
    run_assertions,
)
 
# Sync standalone assertion helpers return AssertionResult
result = has_no_toxicity("Thank you for your help.")
print(result.passed, result.message)
 
result = has_valid_code_syntax("def hello():\n    return 'hi'", "python")
print(result.passed)  # True — uses ast.parse for Python

result = matches_schema('payload={"status": "ok"}', {"required": ["status"]})
print(result.passed, result.actual)
 
# Batch assertions
results = run_assertions([
    lambda: expect(output).to_contain("Paris"),
    lambda: expect(output).to_have_sentiment("positive"),
    lambda: expect(output).to_have_length(min=10),
    lambda: True,  # legacy bools are coerced into AssertionResult
])
all_passed = all(r.passed for r in results)

Compatibility helpers such as has_pii(), async semantic checks like has_sentiment_async(), and score-style utilities such as has_consistency() still return booleans or dictionaries where documented.

LLM-Backed Assertions (Async)

For context-aware checking beyond heuristics. Install the matching optional extra first, for example pip install "pauly4010-evalgate-sdk[openai]" when using the default OpenAI provider.

from evalgate_sdk import configure_assertions
from evalgate_sdk import has_sentiment_async, has_no_toxicity_async

configure_assertions(
    provider="openai",             # or "anthropic"
    api_key="sk-...",
    model="gpt-4o-mini",
    timeout_ms=30_000,              # 30s default, prevents hung calls
)

matches = await has_sentiment_async("subtle irony...", "negative")
is_safe = await has_no_toxicity_async("borderline text")

You can also keep using configure_assertions(AssertionLLMConfig(...)) when you prefer an explicit config object.


DSL Spec System

Define evaluation specs with the define_eval DSL — the same API as the TypeScript SDK:

from evalgate_sdk import define_eval, create_result
 
define_eval("Math Operations", async_executor)
 
# Object form with metadata
define_eval({
    "name": "String check",
    "tags": ["basic"],
    "executor": async_executor,
})
 
# Skip / Only (matches TS defineEval.skip / defineEval.only)
define_eval.skip("Skipped spec", async_executor)
define_eval.only("Focus spec", async_executor)

Test Suites

from evalgate_sdk import create_test_suite
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig
 
suite = create_test_suite("safety-checks", TestSuiteConfig(
    evaluator=my_llm_function,
    test_cases=[
        TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
        TestSuiteCase(name="pii-check", input="Describe yourself",
                      assertions=[{"type": "not_contains_pii"}]),
    ],
    retries=3,                # Retry failed cases (default: 0)
    retry_delay_ms=1000,      # Delay between retries
    retry_jitter=True,        # Add jitter to retry delay
    seed=42,                  # Deterministic ordering
    strict=True,              # Fail on warnings
    stop_on_failure=True,     # Abort on first failure
))
 
result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")

OpenAI Integration

from openai import AsyncOpenAI
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.openai import trace_openai
 
traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain gravity"}]
)
# Automatically traced with latency, tokens, and output

Batch eval with built-in assertions:

from evalgate_sdk import openai_chat_eval, OpenAIChatEvalCase
 
result = await openai_chat_eval(
    name="chat-quality",
    model="gpt-4",
    cases=[
        OpenAIChatEvalCase(
            input="Explain gravity in one sentence.",
            assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
        ),
    ],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")

Anthropic Integration

from anthropic import AsyncAnthropic
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.anthropic import trace_anthropic
 
traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain gravity"}]
)

Also available: trace_langchain, trace_crewai, trace_autogen.


Workflow Tracing

Track multi-agent systems end-to-end — handoffs, decisions, and cost:

from evalgate_sdk import AIEvalClient, WorkflowTracer
from evalgate_sdk.types import HandoffType, CostCategory, RecordCostParams
 
client = AIEvalClient.init()
tracer = WorkflowTracer(client, name="research-pipeline")
 
ctx = await tracer.start_workflow()
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})
 
await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
    agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))
 
await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")

Offline Mode

Run workflow tracing locally without an API connection:

tracer = WorkflowTracer(None, name="local-test", offline=True)
ctx = await tracer.start_workflow()  # No API calls, no crash

You can also omit the client entirely when you want local-only workflow tracing:

from evalgate_sdk import create_workflow_tracer

tracer = create_workflow_tracer(name="local-test")
ctx = await tracer.start_workflow()
assert ctx.trace_id is None

Batch Processing

batch_process(items, processor, concurrency=...) expects an async callable for processor and returns results in input order.

from evalgate_sdk import batch_process

async def double(value: int) -> int:
    return value * 2

results = await batch_process([1, 2, 3], double, concurrency=2)

If you pass a synchronous function, the SDK raises TypeError immediately instead of failing later with a generic await error.

Snapshot Testing

Snapshots are stored in .snapshots by default, relative to the current working directory.

from evalgate_sdk import compare_with_snapshot, snapshot

snapshot("Hello there", "support-reply")
comparison = compare_with_snapshot("support-reply", "Hello there")
print(comparison.matches)

Override the directory when you want snapshots under a project-specific path:

snapshot("Hello there", "support-reply", directory=".evalgate/snapshots")

Add .snapshots/ to your .gitignore unless you intentionally want snapshot files committed.


Regression Gates

Block deployments when eval scores drop:

from evalgate_sdk import evaluate_regression, to_pass_gate
 
report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"

Baseline Tamper Detection

from evalgate_sdk import compute_baseline_checksum, verify_baseline_checksum, Baseline
 
baseline = Baseline(scores={"chat-quality": 0.95, "safety": 0.99})
checksum = compute_baseline_checksum(baseline)
 
# Later — verify integrity before gating
assert verify_baseline_checksum(baseline, checksum), "Baseline tampered!"

CLI

evalgate init                          # Scaffold eval config
evalgate discover                      # Find eval spec files
evalgate discover --manifest           # Generate stable manifest
evalgate run --write-results           # Run with artifact retention
evalgate gate                          # Regression gate
evalgate ci                            # Run + gate (CI mode)
evalgate ci --base main --format github # CI with PR summary
evalgate cluster --run .evalgate/runs/latest.json
evalgate label --run .evalgate/runs/latest.json
evalgate analyze --dataset .evalgate/golden/labeled.jsonl
evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
evalgate replay-decision --previous .evalgate/runs/run-prev.json --current .evalgate/runs/run-latest.json
evalgate auto run --objective "reduce hallucination" --baseline-run previous.json --candidate-run current.json
evalgate auto daemon --objective "reduce hallucination" --cycles 3
evalgate compare --base a.json --head b.json  # Side-by-side diff
evalgate doctor                        # Preflight checklist
evalgate explain                       # Root cause analysis on last failure
evalgate impact-analysis --base main   # Run only impacted specs

Exit Codes

Code Meaning
0 Pass — no regression
1 Regression detected
2 Infra error (baseline missing, tests crashed)

Data Export & Import

from evalgate_sdk import export_data, import_data, ExportOptions, export_to_file
 
# Export
data = await export_data(client, ExportOptions(format="json"))
export_to_file(data, "backup.json")
 
# Import (2-arg API — client is optional keyword arg)
from evalgate_sdk import import_from_file
data = import_from_file("backup.json")
result = await import_data(data, client=client)
 
# LangSmith migration
from evalgate_sdk import import_from_langsmith
data = import_from_langsmith(langsmith_export)

Reliability

Feature Detail
Python 3.10, 3.11, 3.12, 3.13
Dependencies Only httpx + pydantic
Async Native async/await throughout; sync wrappers available
Type hints Full py.typed — works with mypy and Pyright
Errors Structured: RateLimitError, AuthenticationError, NetworkError, ValidationError — all have .message
Rate handling Built-in RateLimiter with configurable tiers
Batching batch_process() with concurrency control
Pagination Async PaginatedIterator with cursor support
Timeouts 30s default on all HTTP clients and LLM assertion calls
Offline WorkflowTracer(offline=True), LocalStorage for file-based dev

API Reference

Module Methods
client.traces create, list, get, update, delete, create_span, list_spans
client.evaluations create, get, list, update, delete, create_test_case, list_test_cases, create_run, list_runs, get_run
client.llm_judge evaluate, create_config, list_configs, list_results, get_alignment
client.annotations create, list, tasks.create, tasks.list, tasks.get, tasks.items.create, tasks.items.list
client.developer get_usage, get_usage_summary, api_keys.*, webhooks.*

Release Notes

v3.2.x

Highlights

  1. Full EvalGate loop: discover → cluster → label/analyze → synthesize → gate/auto
  2. Golden dataset workflow: canonical labeled dataset, analysis summaries, synthetic case generation, and replay decision helpers
  3. Guided optimization: evalgate auto run, evalgate auto daemon, and auto history/report helpers
  4. CLI parity improvements: Python CLI covers clustering, labeling, analysis, synthesis, replay-decision, and bounded auto workflows
  5. Tracing + workflow integrations: OpenAI, Anthropic, LangChain, CrewAI, and AutoGen remain first-class Python surfaces

Changelog

  1. Correctness fixes (parity with TypeScript SDK):
    • Assertion return types: sync helpers now normalize to AssertionResult, including contains_keywords, has_sentiment, has_readability_score, similar_to, contains_json, has_no_toxicity, matches_schema, has_valid_code_syntax, follows_instructions, and contains_all_required_fields
    • Toxicity blocklist: expanded from 9 → 95 terms across 8 categories; uses \b word-boundary regex (no substring false positives)
    • has_valid_code_syntax: Python uses ast.parse (real syntax validation); other languages use structural regex
    • has_factual_accuracy: entity-aware word-overlap check instead of raw substring matching
    • Expectation parity: expect(...).not_ now inverts fluent assertions and to_match_json() accepts JSON strings or embedded JSON snippets
    • Batch compatibility: run_assertions() now coerces legacy boolean and mapping results into AssertionResult
    • has_sentiment_with_score: confidence gradient scales with margin × magnitude; single-word inputs no longer return 1.0
    • WorkflowTracer: accepts name and offline kwargs; offline mode skips all API calls
    • import_data: 2-arg (data, options) signature matching TypeScript; client is keyword-only
    • Logger.child: uses : separator matching TypeScript (was .)
    • define_eval.skip / .only: attached as methods on define_eval
    • ValidationError.message: .message property on all error classes
    • AssertionLLMConfig.timeout_ms: 30s default, enforced via asyncio.wait_for
    • compute_baseline_checksum / verify_baseline_checksum: SHA-256 tamper detection
    • TestSuiteConfig: added retries, retry_delay_ms, retry_jitter, seed, strict, stop_on_failure
    • to_have_no_profanity: new method on Expectation matching TypeScript toHaveNoProfanity
    • RequestCache: removed from public exports (internal only)
  2. Production hardening:
    • 30s default timeout on all httpx.AsyncClient calls
    • API key validation before sending requests
    • URL-encoded query params in fetch_quality_latest
    • Graceful error handling in report_trace and OTel exporter (no more crashes on network errors)
    • run_report correctly sets success=False on test failures
    • GitHub Actions formatter uses GITHUB_OUTPUT (deprecated ::set-output removed)
    • Config parse errors logged as warnings instead of silently swallowed
    • save_trace / save_evaluation no longer mutate caller's dict
    • Subprocess timeout handling in regression gate

507 tests passing.


Examples

See examples/python/:

  • OpenAI Eval — Trace and evaluate OpenAI chat completions
  • RAG Eval — Evaluate retrieval-augmented generation pipelines
  • Agent Eval — Test and trace multi-agent workflows

No Lock-in

rm .evalgate/config.json

Your local assertions keep working. No account cancellation. No data export required.


Links

Platform · GitHub · TypeScript SDK

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pauly4010_evalgate_sdk-3.3.0.tar.gz (461.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl (150.6 kB view details)

Uploaded Python 3

File details

Details for the file pauly4010_evalgate_sdk-3.3.0.tar.gz.

File metadata

  • Download URL: pauly4010_evalgate_sdk-3.3.0.tar.gz
  • Upload date:
  • Size: 461.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pauly4010_evalgate_sdk-3.3.0.tar.gz
Algorithm Hash digest
SHA256 69f79622a75ad8a6fc0f25ebc0007282ec221823c49900f6921dd4e0c44598e8
MD5 3277ca67fafcf29c8908c945ea0403ea
BLAKE2b-256 355b7e729618e91f7bfb3b0df4515eb0fbebb452924a8b318adad28d81d806f5

See more details on using hashes here.

File details

Details for the file pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7529f166a54b79d334e931c60b2cc3dd05f642e8cd709823b2c4fde2b29ec0c5
MD5 ff92a6cccff16a3316f7bf4e149d89a8
BLAKE2b-256 ce657a2e30f1ad8c673f754d77fc1dd32463cd2d93fa571c96e74108b2cc0e28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page