EvalGate Python SDK — CI for AI behavior. Traces, evaluations, assertions, and regression gates for LLM apps.

These details have not been verified by PyPI

Project links

Project description

pauly4010-evalgate-sdk

Build a living golden suite for AI behavior. 🚀

No infra. No lock-in. Remove anytime.

EvalGate = the full suite for AI quality in Python. Discover overlap, cluster failures, build golden datasets, run automated regression gates, and guide optimization before changes reach production.

The Full EvalGate Workflow

EvalGate is no longer just a pass/fail gate at the end of CI. The current workflow is a full loop:

discover -> cluster -> label/analyze -> synthesize -> gate/auto

Discover overlap before adding more tests with evalgate discover --manifest
Cluster failures by pattern with evalgate cluster --run .evalgate/runs/latest.json
Build a labeled golden dataset with evalgate label and evalgate analyze
Draft broader golden cases with evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
Block regressions or run guided optimization with evalgate gate, evalgate ci, and evalgate auto

The Python SDK ships the same closed-loop workflow primitives as the platform: assertions, spec execution, tracing, clustering, golden-dataset analysis, synthesis, replay decision, and guided auto iterations.

Install

pip install pauly4010-evalgate-sdk                        # Core
pip install "pauly4010-evalgate-sdk[openai]"              # + OpenAI tracing and async assertions
pip install "pauly4010-evalgate-sdk[anthropic]"           # + Anthropic tracing and async assertions
pip install "pauly4010-evalgate-sdk[all]"                 # Everything

Quickstart

No API key needed for local assertions:

from evalgate_sdk import AIEvalClient, expect
from evalgate_sdk.types import CreateTraceParams

# Local assertions — no API key needed
result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed)  # True

# Platform: trace and evaluate with API key
client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))

Same CI gate, same quality checks. Python supports the same core loop as TypeScript: assertions, test suites, OpenAI/Anthropic tracing, LangChain/CrewAI/AutoGen integrations, golden dataset workflow commands, and regression gates.

Python CLI: pip install "pauly4010-evalgate-sdk[cli]" → evalgate init, evalgate run, evalgate check, evalgate gate, evalgate ci, evalgate discover, evalgate cluster, evalgate label, evalgate analyze, evalgate synthesize, evalgate replay-decision, evalgate explain, evalgate doctor, evalgate auto.

Context helpers are importable from the package root:

from evalgate_sdk import ContextMetadata, create_context

ctx: ContextMetadata = {"run_id": "test-run"}
token = create_context(ctx)

Why EvalGate?

LLMs don't fail like traditional software — they drift silently. EvalGate turns evaluations into CI gates so regressions never reach production.

What you get	How it works
30+ assertions	`expect(output).to_contain("Paris")`, `.to_not_contain_pii()`, `.to_have_no_profanity()`
DSL spec system	`define_eval("name", executor)` with `.skip` and `.only` support
Test suites	Define cases with retries, seed, strict mode, and stop-on-failure
Workflow tracing	Multi-agent handoffs, decisions, costs — with offline mode
OpenAI / Anthropic	Drop-in tracing wrappers + LangChain, CrewAI, AutoGen
Regression gates	Block deploys when eval scores drop, with baseline tamper detection
Snapshot testing	Save, compare, and diff outputs over time
Impact analysis	`evalgate discover` → manifest → impact analysis → run only what changed
CLI	`evalgate run`, `evalgate check`, `evalgate gate`, `evalgate ci`, `evalgate discover`, `evalgate cluster`, `evalgate label`, `evalgate analyze`, `evalgate synthesize`, `evalgate replay-decision`, `evalgate explain`, `evalgate doctor`, `evalgate auto`

Assertions

30+ built-in checks for LLM output quality, safety, and structure. All return AssertionResult with .passed, .message, .expected, .actual.

Fluent API (`expect`)

from evalgate_sdk import expect
 
# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("draft output").not_.to_contain("final answer")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()
expect("Clean output").to_have_no_profanity()
 
# Sentiment
expect("Great product!").to_have_sentiment("positive")
 
# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect('{"name": "Alice"}').to_match_json({"type": "object"})
expect('payload={"name": "Alice"}').to_match_json({"required": ["name"]})
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)
expect(output).to_contain_keywords(["gravity", "force"])
 
# Comparison
expect(42).to_be_greater_than(10)
expect(42).to_be_less_than(100)
expect(True).to_be_truthy()
 
# Code
expect("def hello(): pass").to_contain_code()
 
# Hallucination
expect(output).to_not_hallucinate(["Paris is the capital of France"])

Standalone Functions

from evalgate_sdk import (
    contains_keywords, has_no_toxicity, has_sentiment, similar_to,
    contains_json, has_readability_score, has_factual_accuracy,
    has_valid_code_syntax, has_sentiment_with_score, matches_pattern,
    matches_schema, responded_within_duration, responded_within_time_since,
    run_assertions,
)
 
# Sync standalone assertion helpers return AssertionResult
result = has_no_toxicity("Thank you for your help.")
print(result.passed, result.message)
 
result = has_valid_code_syntax("def hello():\n    return 'hi'", "python")
print(result.passed)  # True — uses ast.parse for Python

result = matches_schema('payload={"status": "ok"}', {"required": ["status"]})
print(result.passed, result.actual)
 
# Batch assertions
results = run_assertions([
    lambda: expect(output).to_contain("Paris"),
    lambda: expect(output).to_have_sentiment("positive"),
    lambda: expect(output).to_have_length(min=10),
    lambda: True,  # legacy bools are coerced into AssertionResult
])
all_passed = all(r.passed for r in results)

Compatibility helpers such as has_pii(), async semantic checks like has_sentiment_async(), and score-style utilities such as has_consistency() still return booleans or dictionaries where documented.

LLM-Backed Assertions (Async)

For context-aware checking beyond heuristics. Install the matching optional extra first, for example pip install "pauly4010-evalgate-sdk[openai]" when using the default OpenAI provider.

from evalgate_sdk import configure_assertions
from evalgate_sdk import has_sentiment_async, has_no_toxicity_async

configure_assertions(
    provider="openai",             # or "anthropic"
    api_key="sk-...",
    model="gpt-4o-mini",
    timeout_ms=30_000,              # 30s default, prevents hung calls
)

matches = await has_sentiment_async("subtle irony...", "negative")
is_safe = await has_no_toxicity_async("borderline text")

You can also keep using configure_assertions(AssertionLLMConfig(...)) when you prefer an explicit config object.

DSL Spec System

Define evaluation specs with the define_eval DSL — the same API as the TypeScript SDK:

from evalgate_sdk import define_eval, create_result
 
define_eval("Math Operations", async_executor)
 
# Object form with metadata
define_eval({
    "name": "String check",
    "tags": ["basic"],
    "executor": async_executor,
})
 
# Skip / Only (matches TS defineEval.skip / defineEval.only)
define_eval.skip("Skipped spec", async_executor)
define_eval.only("Focus spec", async_executor)

Test Suites

from evalgate_sdk import create_test_suite
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig
 
suite = create_test_suite("safety-checks", TestSuiteConfig(
    evaluator=my_llm_function,
    test_cases=[
        TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
        TestSuiteCase(name="pii-check", input="Describe yourself",
                      assertions=[{"type": "not_contains_pii"}]),
    ],
    retries=3,                # Retry failed cases (default: 0)
    retry_delay_ms=1000,      # Delay between retries
    retry_jitter=True,        # Add jitter to retry delay
    seed=42,                  # Deterministic ordering
    strict=True,              # Fail on warnings
    stop_on_failure=True,     # Abort on first failure
))
 
result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")

OpenAI Integration

from openai import AsyncOpenAI
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.openai import trace_openai
 
traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain gravity"}]
)
# Automatically traced with latency, tokens, and output

Batch eval with built-in assertions:

from evalgate_sdk import openai_chat_eval, OpenAIChatEvalCase
 
result = await openai_chat_eval(
    name="chat-quality",
    model="gpt-4",
    cases=[
        OpenAIChatEvalCase(
            input="Explain gravity in one sentence.",
            assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
        ),
    ],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")

Anthropic Integration

from anthropic import AsyncAnthropic
from evalgate_sdk import AIEvalClient
from evalgate_sdk.integrations.anthropic import trace_anthropic
 
traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain gravity"}]
)

Also available: trace_langchain, trace_crewai, trace_autogen.

Workflow Tracing

Track multi-agent systems end-to-end — handoffs, decisions, and cost:

from evalgate_sdk import AIEvalClient, WorkflowTracer
from evalgate_sdk.types import HandoffType, CostCategory, RecordCostParams
 
client = AIEvalClient.init()
tracer = WorkflowTracer(client, name="research-pipeline")
 
ctx = await tracer.start_workflow()
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})
 
await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
    agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))
 
await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")

Offline Mode

Run workflow tracing locally without an API connection:

tracer = WorkflowTracer(None, name="local-test", offline=True)
ctx = await tracer.start_workflow()  # No API calls, no crash

You can also omit the client entirely when you want local-only workflow tracing:

from evalgate_sdk import create_workflow_tracer

tracer = create_workflow_tracer(name="local-test")
ctx = await tracer.start_workflow()
assert ctx.trace_id is None

Batch Processing

batch_process(items, processor, concurrency=...) expects an async callable for processor and returns results in input order.

from evalgate_sdk import batch_process

async def double(value: int) -> int:
    return value * 2

results = await batch_process([1, 2, 3], double, concurrency=2)

If you pass a synchronous function, the SDK raises TypeError immediately instead of failing later with a generic await error.

Snapshot Testing

Snapshots are stored in .snapshots by default, relative to the current working directory.

from evalgate_sdk import compare_with_snapshot, snapshot

snapshot("Hello there", "support-reply")
comparison = compare_with_snapshot("support-reply", "Hello there")
print(comparison.matches)

Override the directory when you want snapshots under a project-specific path:

snapshot("Hello there", "support-reply", directory=".evalgate/snapshots")

Add .snapshots/ to your .gitignore unless you intentionally want snapshot files committed.

Regression Gates

Block deployments when eval scores drop:

from evalgate_sdk import evaluate_regression, to_pass_gate
 
report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"

Baseline Tamper Detection

from evalgate_sdk import compute_baseline_checksum, verify_baseline_checksum, Baseline
 
baseline = Baseline(scores={"chat-quality": 0.95, "safety": 0.99})
checksum = compute_baseline_checksum(baseline)
 
# Later — verify integrity before gating
assert verify_baseline_checksum(baseline, checksum), "Baseline tampered!"

CLI

evalgate init                          # Scaffold eval config
evalgate discover                      # Find eval spec files
evalgate discover --manifest           # Generate stable manifest
evalgate run --write-results           # Run with artifact retention
evalgate gate                          # Regression gate
evalgate ci                            # Run + gate (CI mode)
evalgate ci --base main --format github # CI with PR summary
evalgate cluster --run .evalgate/runs/latest.json
evalgate label --run .evalgate/runs/latest.json
evalgate analyze --dataset .evalgate/golden/labeled.jsonl
evalgate synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
evalgate replay-decision --previous .evalgate/runs/run-prev.json --current .evalgate/runs/run-latest.json
evalgate auto run --objective "reduce hallucination" --baseline-run previous.json --candidate-run current.json
evalgate auto daemon --objective "reduce hallucination" --cycles 3
evalgate compare --base a.json --head b.json  # Side-by-side diff
evalgate doctor                        # Preflight checklist
evalgate explain                       # Root cause analysis on last failure
evalgate impact-analysis --base main   # Run only impacted specs

Exit Codes

Code	Meaning
0	Pass — no regression
1	Regression detected
2	Infra error (baseline missing, tests crashed)

Data Export & Import

from evalgate_sdk import export_data, import_data, ExportOptions, export_to_file
 
# Export
data = await export_data(client, ExportOptions(format="json"))
export_to_file(data, "backup.json")
 
# Import (2-arg API — client is optional keyword arg)
from evalgate_sdk import import_from_file
data = import_from_file("backup.json")
result = await import_data(data, client=client)
 
# LangSmith migration
from evalgate_sdk import import_from_langsmith
data = import_from_langsmith(langsmith_export)

Reliability

Feature	Detail
Python	3.10, 3.11, 3.12, 3.13
Dependencies	Only `httpx` + `pydantic`
Async	Native `async/await` throughout; sync wrappers available
Type hints	Full `py.typed` — works with mypy and Pyright
Errors	Structured: `RateLimitError`, `AuthenticationError`, `NetworkError`, `ValidationError` — all have `.message`
Rate handling	Built-in `RateLimiter` with configurable tiers
Batching	`batch_process()` with concurrency control
Pagination	Async `PaginatedIterator` with cursor support
Timeouts	30s default on all HTTP clients and LLM assertion calls
Offline	`WorkflowTracer(offline=True)`, `LocalStorage` for file-based dev

API Reference

Module	Methods
`client.traces`	`create`, `list`, `get`, `update`, `delete`, `create_span`, `list_spans`
`client.evaluations`	`create`, `get`, `list`, `update`, `delete`, `create_test_case`, `list_test_cases`, `create_run`, `list_runs`, `get_run`
`client.llm_judge`	`evaluate`, `create_config`, `list_configs`, `list_results`, `get_alignment`
`client.annotations`	`create`, `list`, `tasks.create`, `tasks.list`, `tasks.get`, `tasks.items.create`, `tasks.items.list`
`client.developer`	`get_usage`, `get_usage_summary`, `api_keys.`, `webhooks.`

Release Notes

v3.2.x

Highlights

Full EvalGate loop: discover → cluster → label/analyze → synthesize → gate/auto
Golden dataset workflow: canonical labeled dataset, analysis summaries, synthetic case generation, and replay decision helpers
Guided optimization: evalgate auto run, evalgate auto daemon, and auto history/report helpers
CLI parity improvements: Python CLI covers clustering, labeling, analysis, synthesis, replay-decision, and bounded auto workflows
Tracing + workflow integrations: OpenAI, Anthropic, LangChain, CrewAI, and AutoGen remain first-class Python surfaces

Changelog

Correctness fixes (parity with TypeScript SDK):
- Assertion return types: sync helpers now normalize to AssertionResult, including contains_keywords, has_sentiment, has_readability_score, similar_to, contains_json, has_no_toxicity, matches_schema, has_valid_code_syntax, follows_instructions, and contains_all_required_fields
- Toxicity blocklist: expanded from 9 → 95 terms across 8 categories; uses \b word-boundary regex (no substring false positives)
- has_valid_code_syntax: Python uses ast.parse (real syntax validation); other languages use structural regex
- has_factual_accuracy: entity-aware word-overlap check instead of raw substring matching
- Expectation parity: expect(...).not_ now inverts fluent assertions and to_match_json() accepts JSON strings or embedded JSON snippets
- Batch compatibility: run_assertions() now coerces legacy boolean and mapping results into AssertionResult
- has_sentiment_with_score: confidence gradient scales with margin × magnitude; single-word inputs no longer return 1.0
- WorkflowTracer: accepts name and offline kwargs; offline mode skips all API calls
- import_data: 2-arg (data, options) signature matching TypeScript; client is keyword-only
- Logger.child: uses : separator matching TypeScript (was .)
- define_eval.skip / .only: attached as methods on define_eval
- ValidationError.message: .message property on all error classes
- AssertionLLMConfig.timeout_ms: 30s default, enforced via asyncio.wait_for
- compute_baseline_checksum / verify_baseline_checksum: SHA-256 tamper detection
- TestSuiteConfig: added retries, retry_delay_ms, retry_jitter, seed, strict, stop_on_failure
- to_have_no_profanity: new method on Expectation matching TypeScript toHaveNoProfanity
- RequestCache: removed from public exports (internal only)
Production hardening:
- 30s default timeout on all httpx.AsyncClient calls
- API key validation before sending requests
- URL-encoded query params in fetch_quality_latest
- Graceful error handling in report_trace and OTel exporter (no more crashes on network errors)
- run_report correctly sets success=False on test failures
- GitHub Actions formatter uses GITHUB_OUTPUT (deprecated ::set-output removed)
- Config parse errors logged as warnings instead of silently swallowed
- save_trace / save_evaluation no longer mutate caller's dict
- Subprocess timeout handling in regression gate

507 tests passing.

Examples

See examples/python/:

OpenAI Eval — Trace and evaluate OpenAI chat completions
RAG Eval — Evaluate retrieval-augmented generation pipelines
Agent Eval — Test and trace multi-agent workflows

No Lock-in

rm .evalgate/config.json

Your local assertions keep working. No account cancellation. No data export required.

Links

Platform · GitHub · TypeScript SDK

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.3.0

Mar 23, 2026

3.2.7

Mar 22, 2026

3.2.6

Mar 20, 2026

3.2.5

Mar 20, 2026

3.0.1

Mar 6, 2026

3.0.0

Mar 5, 2026

2.2.2

Mar 3, 2026

2.2.1

Mar 3, 2026

2.2.0

Mar 3, 2026

2.1.3

Mar 3, 2026

2.1.2

Mar 2, 2026

2.1.1

Mar 2, 2026

2.1.0

Mar 3, 2026

2.0.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pauly4010_evalgate_sdk-3.3.0.tar.gz (461.1 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl (150.6 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file pauly4010_evalgate_sdk-3.3.0.tar.gz.

File metadata

Download URL: pauly4010_evalgate_sdk-3.3.0.tar.gz
Upload date: Mar 23, 2026
Size: 461.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pauly4010_evalgate_sdk-3.3.0.tar.gz
Algorithm	Hash digest
SHA256	`69f79622a75ad8a6fc0f25ebc0007282ec221823c49900f6921dd4e0c44598e8`
MD5	`3277ca67fafcf29c8908c945ea0403ea`
BLAKE2b-256	`355b7e729618e91f7bfb3b0df4515eb0fbebb452924a8b318adad28d81d806f5`

See more details on using hashes here.

File details

Details for the file pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl.

File metadata

Download URL: pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 150.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pauly4010_evalgate_sdk-3.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7529f166a54b79d334e931c60b2cc3dd05f642e8cd709823b2c4fde2b29ec0c5`
MD5	`ff92a6cccff16a3316f7bf4e149d89a8`
BLAKE2b-256	`ce657a2e30f1ad8c673f754d77fc1dd32463cd2d93fa571c96e74108b2cc0e28`

See more details on using hashes here.

pauly4010-evalgate-sdk 3.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pauly4010-evalgate-sdk

The Full EvalGate Workflow

Install

Quickstart

Why EvalGate?

Assertions

Fluent API (expect)

Standalone Functions

LLM-Backed Assertions (Async)

DSL Spec System

Test Suites

OpenAI Integration

Anthropic Integration

Workflow Tracing

Offline Mode

Batch Processing

Snapshot Testing

Regression Gates

Baseline Tamper Detection

CLI

Exit Codes

Data Export & Import

Reliability

API Reference

Release Notes

v3.2.x

Highlights

Changelog

Examples

No Lock-in

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Fluent API (`expect`)