AI Evaluation Platform SDK — traces, evaluations, assertions, and workflow tracing for LLM apps
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
pauly4010-evalai-sdk
Evaluation infrastructure for AI systems. Trace, test, and judge every LLM call — in five lines of Python.
Quickstart (30 seconds)
pip install pauly4010-evalai-sdk
from evalai_sdk import expect
result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed) # True
That's it. No API key needed for local assertions. When you're ready to send traces to the platform:
from evalai_sdk import AIEvalClient, CreateTraceParams
client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))
Why EvalAI?
| What you get | How it works |
|---|---|
| 20+ assertions | expect(output).to_contain("Paris"), .to_not_contain_pii(), .to_have_sentiment("positive") |
| Test suites | Define cases, run them, get pass/fail + scores |
| Workflow tracing | Track multi-agent handoffs, decisions, and costs |
| OpenAI / Anthropic | Drop-in tracing wrappers — one line to instrument |
| Regression gates | Block deploys when eval scores drop |
| Snapshot testing | Save and compare outputs over time |
| CLI | evalai run, evalai gate, evalai ci |
Install
pip install pauly4010-evalai-sdk # Core
pip install "pauly4010-evalai-sdk[openai]" # + OpenAI tracing
pip install "pauly4010-evalai-sdk[anthropic]" # + Anthropic tracing
pip install "pauly4010-evalai-sdk[all]" # Everything
Assertions
20+ built-in checks for LLM output quality, safety, and structure:
from evalai_sdk import expect
# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()
# Sentiment & similarity
expect("Great product!").to_have_sentiment("positive")
# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)
# Safety
expect("Clean response here").to_not_contain_pii()
Standalone functions work too:
from evalai_sdk import contains_keywords, has_no_toxicity, matches_pattern
assert contains_keywords("quick brown fox", ["quick", "fox"])
assert has_no_toxicity("Thank you for your help.")
assert matches_pattern("abc-123", r"\w+-\d+")
Test Suites
from evalai_sdk import create_test_suite
from evalai_sdk.types import TestSuiteCase, TestSuiteConfig
suite = create_test_suite("safety-checks", TestSuiteConfig(
evaluator=my_llm_function,
test_cases=[
TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
TestSuiteCase(name="pii-check", input="Describe yourself",
assertions=[{"type": "not_contains_pii"}]),
],
))
result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")
OpenAI Integration
One line to trace every OpenAI call:
from openai import AsyncOpenAI
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.openai import trace_openai
traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain gravity"}]
)
# ^ Automatically traced with latency, tokens, and output
Or evaluate a batch of prompts with built-in assertions:
from evalai_sdk import openai_chat_eval, OpenAIChatEvalCase
result = await openai_chat_eval(
name="chat-quality",
model="gpt-4",
cases=[
OpenAIChatEvalCase(
input="Explain gravity in one sentence.",
assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
),
],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")
Anthropic Integration
from anthropic import AsyncAnthropic
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.anthropic import trace_anthropic
traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain gravity"}]
)
Workflow Tracing
Track multi-agent systems with handoffs, decisions, and cost:
from evalai_sdk import AIEvalClient, WorkflowTracer
from evalai_sdk.types import HandoffType, CostCategory, RecordCostParams
client = AIEvalClient.init()
tracer = WorkflowTracer(client)
await tracer.start_workflow("research-pipeline")
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})
await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))
await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")
Regression Gates
Block deployments when eval scores drop:
from evalai_sdk import evaluate_regression, to_pass_gate
report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"
CLI
evalai init # Scaffold eval config
evalai run --dir ./evals # Run all evaluations
evalai gate --baseline b.json # Regression gate
evalai ci # Run + gate (CI mode)
evalai doctor # Check setup
evalai discover # Find eval files
Reliability
| Feature | Detail |
|---|---|
| Python | 3.9, 3.10, 3.11, 3.12, 3.13 |
| Dependencies | Only httpx + pydantic (2 packages) |
| Async | Native async/await throughout, sync wrappers available |
| Type hints | Full py.typed — works with mypy and Pyright |
| Errors | Structured errors: RateLimitError, AuthenticationError, NetworkError, ValidationError |
| Rate handling | Built-in RateLimiter with configurable tiers |
| Caching | RequestCache with TTL and LRU eviction |
| Batching | batch_process() with concurrency control |
| Pagination | Async PaginatedIterator with cursor support |
API Reference
| Module | Methods |
|---|---|
client.traces |
create, list, get, update, delete, create_span, list_spans |
client.evaluations |
create, get, list, update, delete, create_test_case, list_test_cases, create_run, list_runs, get_run |
client.llm_judge |
evaluate, create_config, list_configs, list_results, get_alignment |
client.annotations |
create, list, tasks.create, tasks.list, tasks.get, tasks.items.create, tasks.items.list |
client.developer |
get_usage, get_usage_summary, api_keys.*, webhooks.* |
Examples
See the examples/python/ directory for runnable scripts and Jupyter notebooks:
- OpenAI Eval — Trace and evaluate OpenAI chat completions
- RAG Eval — Evaluate retrieval-augmented generation pipelines
- Agent Eval — Test and trace multi-agent workflows
Links
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pauly4010_evalai_sdk-1.9.0.tar.gz.
File metadata
- Download URL: pauly4010_evalai_sdk-1.9.0.tar.gz
- Upload date:
- Size: 56.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb13d89e66ae1e94576c4f48244e6aa701addf48714ab4de6b861520833469e1
|
|
| MD5 |
6ca8e873379996fc967162fae1b1c7c5
|
|
| BLAKE2b-256 |
951118f9fb6a684273b16513ba4847b71e21733fc9e2b431dea6a3fe836da3f3
|
File details
Details for the file pauly4010_evalai_sdk-1.9.0-py3-none-any.whl.
File metadata
- Download URL: pauly4010_evalai_sdk-1.9.0-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a641102441575216d8270ad5fc32834624bf0355fcc911f2757814f66614d2e4
|
|
| MD5 |
649fe38844453cddfc11f86c6cba6868
|
|
| BLAKE2b-256 |
22002e29c840f325fb2662d118f5df42785a78b555f8e96a33d21f98527b6541
|