Skip to main content

AI Evaluation Platform SDK — traces, evaluations, assertions, and workflow tracing for LLM apps

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

pauly4010-evalai-sdk

Evaluation infrastructure for AI systems. Trace, test, and judge every LLM call — in five lines of Python.

PyPI Python License: MIT Typed

Quickstart (30 seconds)

pip install pauly4010-evalai-sdk
from evalai_sdk import expect

result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed)  # True

That's it. No API key needed for local assertions. When you're ready to send traces to the platform:

from evalai_sdk import AIEvalClient, CreateTraceParams

client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))

Why EvalAI?

What you get How it works
20+ assertions expect(output).to_contain("Paris"), .to_not_contain_pii(), .to_have_sentiment("positive")
Test suites Define cases, run them, get pass/fail + scores
Workflow tracing Track multi-agent handoffs, decisions, and costs
OpenAI / Anthropic Drop-in tracing wrappers — one line to instrument
Regression gates Block deploys when eval scores drop
Snapshot testing Save and compare outputs over time
CLI evalai run, evalai gate, evalai ci

Install

pip install pauly4010-evalai-sdk                        # Core
pip install "pauly4010-evalai-sdk[openai]"              # + OpenAI tracing
pip install "pauly4010-evalai-sdk[anthropic]"           # + Anthropic tracing
pip install "pauly4010-evalai-sdk[all]"                 # Everything

Assertions

20+ built-in checks for LLM output quality, safety, and structure:

from evalai_sdk import expect

# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()

# Sentiment & similarity
expect("Great product!").to_have_sentiment("positive")

# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)

# Safety
expect("Clean response here").to_not_contain_pii()

Standalone functions work too:

from evalai_sdk import contains_keywords, has_no_toxicity, matches_pattern

assert contains_keywords("quick brown fox", ["quick", "fox"])
assert has_no_toxicity("Thank you for your help.")
assert matches_pattern("abc-123", r"\w+-\d+")

Test Suites

from evalai_sdk import create_test_suite
from evalai_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite("safety-checks", TestSuiteConfig(
    evaluator=my_llm_function,
    test_cases=[
        TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
        TestSuiteCase(name="pii-check", input="Describe yourself",
                      assertions=[{"type": "not_contains_pii"}]),
    ],
))

result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")

OpenAI Integration

One line to trace every OpenAI call:

from openai import AsyncOpenAI
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.openai import trace_openai

traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain gravity"}]
)
# ^ Automatically traced with latency, tokens, and output

Or evaluate a batch of prompts with built-in assertions:

from evalai_sdk import openai_chat_eval, OpenAIChatEvalCase

result = await openai_chat_eval(
    name="chat-quality",
    model="gpt-4",
    cases=[
        OpenAIChatEvalCase(
            input="Explain gravity in one sentence.",
            assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
        ),
    ],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")

Anthropic Integration

from anthropic import AsyncAnthropic
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.anthropic import trace_anthropic

traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain gravity"}]
)

Workflow Tracing

Track multi-agent systems with handoffs, decisions, and cost:

from evalai_sdk import AIEvalClient, WorkflowTracer
from evalai_sdk.types import HandoffType, CostCategory, RecordCostParams

client = AIEvalClient.init()
tracer = WorkflowTracer(client)

await tracer.start_workflow("research-pipeline")
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})

await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
    agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))

await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")

Regression Gates

Block deployments when eval scores drop:

from evalai_sdk import evaluate_regression, to_pass_gate

report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"

CLI

evalai init                    # Scaffold eval config
evalai run --dir ./evals       # Run all evaluations
evalai gate --baseline b.json  # Regression gate
evalai ci                      # Run + gate (CI mode)
evalai doctor                  # Check setup
evalai discover                # Find eval files

Reliability

Feature Detail
Python 3.9, 3.10, 3.11, 3.12, 3.13
Dependencies Only httpx + pydantic (2 packages)
Async Native async/await throughout, sync wrappers available
Type hints Full py.typed — works with mypy and Pyright
Errors Structured errors: RateLimitError, AuthenticationError, NetworkError, ValidationError
Rate handling Built-in RateLimiter with configurable tiers
Caching RequestCache with TTL and LRU eviction
Batching batch_process() with concurrency control
Pagination Async PaginatedIterator with cursor support

API Reference

Module Methods
client.traces create, list, get, update, delete, create_span, list_spans
client.evaluations create, get, list, update, delete, create_test_case, list_test_cases, create_run, list_runs, get_run
client.llm_judge evaluate, create_config, list_configs, list_results, get_alignment
client.annotations create, list, tasks.create, tasks.list, tasks.get, tasks.items.create, tasks.items.list
client.developer get_usage, get_usage_summary, api_keys.*, webhooks.*

Examples

See the examples/python/ directory for runnable scripts and Jupyter notebooks:

  • OpenAI Eval — Trace and evaluate OpenAI chat completions
  • RAG Eval — Evaluate retrieval-augmented generation pipelines
  • Agent Eval — Test and trace multi-agent workflows

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pauly4010_evalai_sdk-1.9.0.tar.gz (56.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pauly4010_evalai_sdk-1.9.0-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file pauly4010_evalai_sdk-1.9.0.tar.gz.

File metadata

  • Download URL: pauly4010_evalai_sdk-1.9.0.tar.gz
  • Upload date:
  • Size: 56.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pauly4010_evalai_sdk-1.9.0.tar.gz
Algorithm Hash digest
SHA256 fb13d89e66ae1e94576c4f48244e6aa701addf48714ab4de6b861520833469e1
MD5 6ca8e873379996fc967162fae1b1c7c5
BLAKE2b-256 951118f9fb6a684273b16513ba4847b71e21733fc9e2b431dea6a3fe836da3f3

See more details on using hashes here.

File details

Details for the file pauly4010_evalai_sdk-1.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pauly4010_evalai_sdk-1.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a641102441575216d8270ad5fc32834624bf0355fcc911f2757814f66614d2e4
MD5 649fe38844453cddfc11f86c6cba6868
BLAKE2b-256 22002e29c840f325fb2662d118f5df42785a78b555f8e96a33d21f98527b6541

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page