EvalGate Python SDK — CI for AI behavior. Traces, evaluations, assertions, and regression gates for LLM apps.
Project description
pauly4010-evalgate-sdk
Build a living golden suite for AI behavior. 🚀
No infra. No lock-in. Remove anytime.
EvalGate = the full suite for AI quality in Python. Discover overlap, cluster failures, build golden datasets, run automated regression gates, and guide optimization before changes reach production.
The Full EvalGate Workflow
EvalGate is no longer just a pass/fail gate at the end of CI. The current workflow is a full loop:
discover -> cluster -> label/analyze -> synthesize -> gate/auto
- Discover overlap before adding more tests with
evalgate discover --manifest - Cluster failures by pattern with
evalgate cluster --run .evalgate/runs/latest.json - Build a labeled golden dataset with
evalgate labelandevalgate analyze - Draft broader golden cases with
evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl - Block regressions or run guided optimization with
evalgate gate,evalgate ci, andevalgate auto
The Python SDK ships the same closed-loop workflow primitives as the platform: assertions, spec execution, tracing, clustering, golden-dataset analysis, synthesis, replay decision, and guided auto iterations.
Install
pip install pauly4010-evalgate-sdk # Core
pip install "pauly4010-evalgate-sdk[openai]" # + OpenAI tracing and async assertions
pip install "pauly4010-evalgate-sdk[anthropic]" # + Anthropic tracing and async assertions
pip install "pauly4010-evalgate-sdk[all]" # Everything
Quickstart
No API key needed for local assertions:
from evalgate_sdk import AIEvalClient, expect
from evalgate_sdk.types import CreateTraceParams
# Local assertions — no API key needed
result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed) # True
# Platform: trace and evaluate with API key
client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))
Same CI gate, same quality checks. Python supports the same core loop as TypeScript: assertions, test suites, OpenAI/Anthropic tracing, LangChain/CrewAI/AutoGen integrations, golden dataset workflow commands, and regression gates.
Python CLI: pip install "pauly4010-evalgate-sdk[cli]" → evalgate init, evalgate run, evalgate check, evalgate gate, evalgate ci, evalgate discover, evalgate cluster, evalgate label, evalgate analyze, evalgate synthesize, evalgate replay-decision, evalgate explain, evalgate doctor, evalgate auto.
Context helpers are importable from the package root:
from evalgate_sdk import ContextMetadata, create_context
ctx: ContextMetadata = {"run_id": "test-run"}
token = create_context(ctx)
Why EvalGate?
LLMs don't fail like traditional software — they drift silently. EvalGate turns evaluations into CI gates so regressions never reach production.
| What you get | How it works |
|---|---|
| 30+ assertions | expect(output).to_contain("Paris"), .to_not_contain_pii(), .to_have_no_profanity() |
| DSL spec system | define_eval("name", executor) with .skip and .only support |
| Test suites | Define cases with retries, seed, strict mode, and stop-on-failure |
| Workflow tracing | Multi-agent handoffs, decisions, costs — with offline mode |
| OpenAI / Anthropic | Drop-in tracing wrappers + LangChain, CrewAI, AutoGen |
| Regression gates | Block deploys when eval scores drop, with baseline tamper detection |
| Snapshot testing | Save, compare, and diff outputs over time |
| Impact analysis | evalgate discover → manifest → impact analysis → run only what changed |
| CLI | evalgate run, evalgate check, evalgate gate, evalgate ci, evalgate discover, evalgate cluster, evalgate label, evalgate analyze, evalgate synthesize, evalgate replay-decision, evalgate explain, evalgate doctor, evalgate auto |
Assertions
30+ built-in checks for LLM output quality, safety, and structure. All return AssertionResult with .passed, .message, .expected, .actual.
Fluent API (expect)
from evalgate_sdk import expect
# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("draft output").not_.to_contain("final answer")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()
expect("Clean output").to_have_no_profanity()
# Sentiment
expect("Great product!").to_have_sentiment("positive")
# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect('{"name": "Alice"}').to_match_json({"type": "object"})
expect('payload={"name": "Alice"}').to_match_json({"required": ["name"]})
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)
expect(output).to_contain_keywords(["gravity", "force"])
# Comparison
expect(42).to_be_greater_than(10)
expect(42).to_be_less_than(100)
expect(True).to_be_truthy()
# Code
expect("def hello(): pass").to_contain_code()
# Hallucination
expect(output).to_not_hallucinate(["Paris is the capital of France"])
Standalone Functions
from evalgate_sdk import (
contains_keywords, has_no_toxicity, has_sentiment, similar_to,
contains_json, has_readability_score, has_factual_accuracy,
has_valid_code_syntax, has_sentiment_with_score, matches_pattern,
matches_schema, responded_within_duration, responded_within_time_since,
run_assertions,
)
# Sync standalone assertion helpers return AssertionResult
result = has_no_toxicity("Thank you for your help.")
print(result.passed, result.message)
result = has_valid_code_syntax("def hello():\n return 'hi'", "python")
print(result.passed) # True — uses ast.parse for Python
result = matches_schema('payload={"status": "ok"}', {"required": ["status"]})
print(result.passed, result.actual)
# Batch assertions
results = run_assertions([
lambda: expect(output).to_contain("Paris"),
lambda: expect(output).to_have_sentiment("positive"),
lambda: expect(output).to_have_length(min=10),
lambda: True, # legacy bools are coerced into AssertionResult
])
all_passed = all(r.passed for r in results)
Compatibility helpers such as has_pii(), async semantic checks like has_sentiment_async(), and score-style utilities such as has_consistency() still return booleans or dictionaries where documented.
LLM-Backed Assertions (Async)
For context-aware checking beyond heuristics. Install the matching optional extra first, for example pip install "pauly4010-evalgate-sdk[openai]" when using the default OpenAI provider.
from evalgate_sdk import configure_assertions
from evalgate_sdk import has_sentiment_async, has_no_toxicity_async
configure_assertions(
provider="openai", # or "anthropic"
api_key="sk-...",
model="gpt-4o-mini",
timeout_ms=30_000, # 30s default, prevents hung calls
)
matches = await has_sentiment_async("subtle irony...", "negative")
is_safe = await has_no_toxicity_async("borderline text")
You can also keep using configure_assertions(AssertionLLMConfig(...)) when you prefer an explicit config object.
DSL Spec System
Define evaluation specs with the define_eval DSL — the same API as the TypeScript SDK:
from evalgate_sdk import define_eval, create_result
define_eval("Math Operations", async_executor)
# Object form with metadata
define_eval({
"name": "String check",
"tags": ["basic"],
"executor": async_executor,
})
# Skip / Only (matches TS defineEval.skip / defineEval.only)
define_eval.skip("Skipped spec", async_executor)
define_eval.only("Focus spec", async_executor)
Test Suites
from evalgate_sdk import create_test_suite
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig
suite = create_test_suite("safety-checks", TestSuiteConfig(
evaluator=my_llm_function,
test_cases=[
TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
TestSuiteCase(name="pii-check", input="Describe yourself",
assertions=[{"type": "not_contains_pii"}]),
],
retries=3, # Retry failed cases (default: 0)
retry_delay_ms=1000, # Delay between retries
retry_jitter=True, # Add jitter to retry delay
seed=42, # Deterministic ordering
strict=True, # Fail on warnings
stop_on_failure=True, # Abort on first failure
))
result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")
OpenAI Integration
from openai import AsyncOpenAI
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.openai import trace_openai
traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain gravity"}]
)
# Automatically traced with latency, tokens, and output
Batch eval with built-in assertions:
from evalgate_sdk import openai_chat_eval, OpenAIChatEvalCase
result = await openai_chat_eval(
name="chat-quality",
model="gpt-4",
cases=[
OpenAIChatEvalCase(
input="Explain gravity in one sentence.",
assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
),
],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")
Anthropic Integration
from anthropic import AsyncAnthropic
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.anthropic import trace_anthropic
traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain gravity"}]
)
Also available: trace_langchain, trace_crewai, trace_autogen.
Workflow Tracing
Track multi-agent systems end-to-end — handoffs, decisions, and cost:
from evalgate_sdk import AIEvalClient, WorkflowTracer
from evalgate_sdk.types import HandoffType, CostCategory, RecordCostParams
client = AIEvalClient.init()
tracer = WorkflowTracer(client, name="research-pipeline")
ctx = await tracer.start_workflow()
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})
await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))
await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")
Offline Mode
Run workflow tracing locally without an API connection:
tracer = WorkflowTracer(None, name="local-test", offline=True)
ctx = await tracer.start_workflow() # No API calls, no crash
You can also omit the client entirely when you want local-only workflow tracing:
from evalgate_sdk import create_workflow_tracer
tracer = create_workflow_tracer(name="local-test")
ctx = await tracer.start_workflow()
assert ctx.trace_id is None
Batch Processing
batch_process(items, processor, concurrency=...) expects an async callable for processor and returns results in input order.
from evalgate_sdk import batch_process
async def double(value: int) -> int:
return value * 2
results = await batch_process([1, 2, 3], double, concurrency=2)
If you pass a synchronous function, the SDK raises TypeError immediately instead of failing later with a generic await error.
Snapshot Testing
Snapshots are stored in .snapshots by default, relative to the current working directory.
from evalgate_sdk import compare_with_snapshot, snapshot
snapshot("Hello there", "support-reply")
comparison = compare_with_snapshot("support-reply", "Hello there")
print(comparison.matches)
Override the directory when you want snapshots under a project-specific path:
snapshot("Hello there", "support-reply", directory=".evalgate/snapshots")
Add .snapshots/ to your .gitignore unless you intentionally want snapshot files committed.
Regression Gates
Block deployments when eval scores drop:
from evalgate_sdk import evaluate_regression, to_pass_gate
report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"
Baseline Tamper Detection
from evalgate_sdk import compute_baseline_checksum, verify_baseline_checksum, Baseline
baseline = Baseline(scores={"chat-quality": 0.95, "safety": 0.99})
checksum = compute_baseline_checksum(baseline)
# Later — verify integrity before gating
assert verify_baseline_checksum(baseline, checksum), "Baseline tampered!"
CLI
evalgate init # Scaffold eval config
evalgate discover # Find eval spec files
evalgate discover --manifest # Generate stable manifest
evalgate run --write-results # Run with artifact retention
evalgate gate # Regression gate
evalgate ci # Run + gate (CI mode)
evalgate ci --base main --format github # CI with PR summary
evalgate cluster --run .evalgate/runs/latest.json
evalgate label --run .evalgate/runs/latest.json
evalgate analyze --dataset .evalgate/golden/labeled.jsonl
evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
evalgate replay-decision --previous .evalgate/runs/run-prev.json --current .evalgate/runs/run-latest.json
evalgate auto run --objective "reduce hallucination" --baseline-run previous.json --candidate-run current.json
evalgate auto daemon --objective "reduce hallucination" --cycles 3
evalgate compare --base a.json --head b.json # Side-by-side diff
evalgate doctor # Preflight checklist
evalgate explain # Root cause analysis on last failure
evalgate impact-analysis --base main # Run only impacted specs
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Pass — no regression |
| 1 | Regression detected |
| 2 | Infra error (baseline missing, tests crashed) |
Data Export & Import
from evalgate_sdk import export_data, import_data, ExportOptions, export_to_file
# Export
data = await export_data(client, ExportOptions(format="json"))
export_to_file(data, "backup.json")
# Import (2-arg API — client is optional keyword arg)
from evalgate_sdk import import_from_file
data = import_from_file("backup.json")
result = await import_data(data, client=client)
# LangSmith migration
from evalgate_sdk import import_from_langsmith
data = import_from_langsmith(langsmith_export)
Reliability
| Feature | Detail |
|---|---|
| Python | 3.10, 3.11, 3.12, 3.13 |
| Dependencies | Only httpx + pydantic |
| Async | Native async/await throughout; sync wrappers available |
| Type hints | Full py.typed — works with mypy and Pyright |
| Errors | Structured: RateLimitError, AuthenticationError, NetworkError, ValidationError — all have .message |
| Rate handling | Built-in RateLimiter with configurable tiers |
| Batching | batch_process() with concurrency control |
| Pagination | Async PaginatedIterator with cursor support |
| Timeouts | 30s default on all HTTP clients and LLM assertion calls |
| Offline | WorkflowTracer(offline=True), LocalStorage for file-based dev |
API Reference
| Module | Methods |
|---|---|
client.traces |
create, list, get, update, delete, create_span, list_spans |
client.evaluations |
create, get, list, update, delete, create_test_case, list_test_cases, create_run, list_runs, get_run |
client.llm_judge |
evaluate, create_config, list_configs, list_results, get_alignment |
client.annotations |
create, list, tasks.create, tasks.list, tasks.get, tasks.items.create, tasks.items.list |
client.developer |
get_usage, get_usage_summary, api_keys.*, webhooks.* |
Release Notes
v3.2.x
Highlights
- Full EvalGate loop: discover → cluster → label/analyze → synthesize → gate/auto
- Golden dataset workflow: canonical labeled dataset, analysis summaries, synthetic case generation, and replay decision helpers
- Guided optimization:
evalgate auto run,evalgate auto daemon, and auto history/report helpers - CLI parity improvements: Python CLI covers clustering, labeling, analysis, synthesis, replay-decision, and bounded auto workflows
- Tracing + workflow integrations: OpenAI, Anthropic, LangChain, CrewAI, and AutoGen remain first-class Python surfaces
Changelog
- Correctness fixes (parity with TypeScript SDK):
- Assertion return types: sync helpers now normalize to
AssertionResult, includingcontains_keywords,has_sentiment,has_readability_score,similar_to,contains_json,has_no_toxicity,matches_schema,has_valid_code_syntax,follows_instructions, andcontains_all_required_fields - Toxicity blocklist: expanded from 9 → 95 terms across 8 categories; uses
\bword-boundary regex (no substring false positives) has_valid_code_syntax: Python usesast.parse(real syntax validation); other languages use structural regexhas_factual_accuracy: entity-aware word-overlap check instead of raw substring matching- Expectation parity:
expect(...).not_now inverts fluent assertions andto_match_json()accepts JSON strings or embedded JSON snippets - Batch compatibility:
run_assertions()now coerces legacy boolean and mapping results intoAssertionResult has_sentiment_with_score: confidence gradient scales with margin × magnitude; single-word inputs no longer return 1.0WorkflowTracer: acceptsnameandofflinekwargs; offline mode skips all API callsimport_data: 2-arg(data, options)signature matching TypeScript; client is keyword-onlyLogger.child: uses:separator matching TypeScript (was.)define_eval.skip/.only: attached as methods ondefine_evalValidationError.message:.messageproperty on all error classesAssertionLLMConfig.timeout_ms: 30s default, enforced viaasyncio.wait_forcompute_baseline_checksum/verify_baseline_checksum: SHA-256 tamper detectionTestSuiteConfig: addedretries,retry_delay_ms,retry_jitter,seed,strict,stop_on_failureto_have_no_profanity: new method onExpectationmatching TypeScripttoHaveNoProfanityRequestCache: removed from public exports (internal only)
- Assertion return types: sync helpers now normalize to
- Production hardening:
- 30s default timeout on all
httpx.AsyncClientcalls - API key validation before sending requests
- URL-encoded query params in
fetch_quality_latest - Graceful error handling in
report_traceand OTel exporter (no more crashes on network errors) run_reportcorrectly setssuccess=Falseon test failures- GitHub Actions formatter uses
GITHUB_OUTPUT(deprecated::set-outputremoved) - Config parse errors logged as warnings instead of silently swallowed
save_trace/save_evaluationno longer mutate caller's dict- Subprocess timeout handling in regression gate
- 30s default timeout on all
507 tests passing.
Examples
See examples/python/:
- OpenAI Eval — Trace and evaluate OpenAI chat completions
- RAG Eval — Evaluate retrieval-augmented generation pipelines
- Agent Eval — Test and trace multi-agent workflows
No Lock-in
rm .evalgate/config.json
Your local assertions keep working. No account cancellation. No data export required.
Links
Platform · GitHub · TypeScript SDK
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pauly4010_evalgate_sdk-3.3.0.tar.gz.
File metadata
- Download URL: pauly4010_evalgate_sdk-3.3.0.tar.gz
- Upload date:
- Size: 461.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69f79622a75ad8a6fc0f25ebc0007282ec221823c49900f6921dd4e0c44598e8
|
|
| MD5 |
3277ca67fafcf29c8908c945ea0403ea
|
|
| BLAKE2b-256 |
355b7e729618e91f7bfb3b0df4515eb0fbebb452924a8b318adad28d81d806f5
|
File details
Details for the file pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl.
File metadata
- Download URL: pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl
- Upload date:
- Size: 150.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7529f166a54b79d334e931c60b2cc3dd05f642e8cd709823b2c4fde2b29ec0c5
|
|
| MD5 |
ff92a6cccff16a3316f7bf4e149d89a8
|
|
| BLAKE2b-256 |
ce657a2e30f1ad8c673f754d77fc1dd32463cd2d93fa571c96e74108b2cc0e28
|