Official Python SDK for TrustModel AI evaluation platform
Project description
Official Python SDK for the TrustModel AI evaluation platform
Website • Documentation • Dashboard
Evaluate AI models for safety, bias, and performance with a simple, intuitive interface.
Table of Contents
- Features
- Installation
- Prerequisites
- Quick Start
- Authentication
- Evaluation Modes
- Core Concepts
- Error Handling
- Rate Limiting
- Webhook Notifications
- Advanced Usage
- Framework Integration
- Zero-Config Auto-Capture (
auto_init) - Agentic Trace Evaluation
- Galileo Integration
- Requirements
- Support
- License
Features
- 🚀 Simple Interface: Easy-to-use client for all TrustModel operations
- 🔒 Secure: API key authentication with built-in validation
- 🎯 Type Safe: Full type hints for excellent IDE support
- 🔄 Reliable: Automatic retries and comprehensive error handling
- 📊 Comprehensive: Support for all evaluation types and configurations
- 🌍 Framework Agnostic: Works with any Python framework or standalone scripts
- 🛰️ Zero-Config Auto-Capture: Add two lines (
auto_init) and every OpenAI / Anthropic / LangChain / CrewAI / etc. call is automatically traced and evaluated
Installation
Core SDK (works on Python 3.7+):
pip install trustmodel
Add the telemetry extra to enable zero-config auto-capture of AI agent calls (Python 3.10+ required):
pip install "trustmodel[telemetry]"
The telemetry extra installs OpenTelemetry plus all OpenInference instrumentors (OpenAI, Anthropic, LangChain, LlamaIndex, Bedrock, Mistral, Groq, CrewAI, VertexAI, DSPy). See Zero-Config Auto-Capture below.
Prerequisites
Before using the SDK, you must complete the following setup in the TrustModel Dashboard:
1. Create an API Key (Required)
You need a TrustModel API key to authenticate all SDK requests:
- Go to Keys & Webhooks in the dashboard
- Click "Create API Key"
- Copy your new API key (starts with
tm-) - Store it securely - you won't be able to see it again
2. Configure Webhooks (Required)
To receive notifications when evaluations complete or fail, you must configure webhooks:
- Go to Keys & Webhooks in the dashboard
- Click "Create Webhook"
- Enter your webhook endpoint URL
- Select the events you want to receive
- Save your webhook configuration
Important: Without configuring both an API key and webhooks in the webapp, you cannot run evaluations. The API will return an error if these are not set up.
Quick Start
import trustmodel
# Initialize the client
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")
# List available models
models, api_sources = client.models.list()
print(f"Found {len(models)} models available")
# Create an evaluation
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
categories=["safety", "bias", "performance"]
)
print(f"Evaluation created with ID: {evaluation.id}")
print(f"Status: {evaluation.status}")
# You'll receive a webhook notification when the evaluation completes
# Then retrieve the results:
completed_evaluation = client.evaluations.get(evaluation.id)
print(f"Overall score: {completed_evaluation.overall_score}")
# Check your credit balance
credits = client.credits.get_balance()
print(f"Credits remaining: {credits.credits_remaining}")
Authentication
Get your API key from the TrustModel Dashboard and use it to initialize the client:
import trustmodel
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")
For production applications, store your API key securely using environment variables:
import os
import trustmodel
api_key = os.getenv("TRUSTMODEL_API_KEY")
client = trustmodel.TrustModelClient(api_key=api_key)
Evaluation Modes
TrustModel supports three ways to evaluate AI models:
| Mode | Use Case | API Key Required |
|---|---|---|
| Platform Key | Quick evaluations using TrustModel's API keys | No (uses TrustModel's keys) |
| BYOK | Use your own vendor API key for any model | Yes (your vendor API key) |
| Custom Endpoint | Evaluate private/self-hosted models | Yes (your endpoint's API key) |
Getting Available Vendors
Use client.config.get().vendors to discover available vendors:
config = client.config.get()
# Public vendors - for Platform Key and BYOK evaluations
public_vendors = config.vendors["public"]
for vendor in public_vendors:
print(f"{vendor['identifier']}: {vendor['name']}")
# Custom vendors - for Custom Endpoint evaluations only
custom_vendors = config.vendors["custom"]
for vendor in custom_vendors:
print(f"{vendor['identifier']}: {vendor['name']}")
| Vendor Type | Use With | Description |
|---|---|---|
public |
Platform Key, BYOK | Vendors like OpenAI, Anthropic, Google AI for standard evaluations |
custom |
Custom Endpoint | Validators for self-hosted/private endpoints (OpenAI-compatible, Hugging Face, Azure AI, etc.) |
Getting Available Models
Use client.models.list() to discover available models:
# Get all available models and API source info
models, api_sources = client.models.list()
# List all models with their details
for model in models:
print(f"Model: {model.name}")
print(f" Identifier: {model.model_identifier}")
print(f" Vendor: {model.vendor_identifier}")
print(f" Platform Key Available: {model.available_via_trust_model_key}")
print(f" BYOK Available: {model.available_via_byok}")
# Filter models by vendor
openai_models = [m for m in models if m.vendor_identifier == "openai"]
# Filter models available via platform key (no vendor API key needed)
platform_key_models = [m for m in models if m.available_via_trust_model_key]
# Use a model in evaluation
model = models[0]
evaluation = client.evaluations.create(
model_identifier=model.model_identifier,
vendor_identifier=model.vendor_identifier,
categories=["safety", "bias"]
)
| Model Field | Type | Description |
|---|---|---|
name |
str | Human-readable model name |
model_identifier |
str | Identifier to use in API calls |
vendor_identifier |
str | Vendor identifier |
available_via_trust_model_key |
bool | Can evaluate without vendor API key |
available_via_byok |
bool | Previously used with your own API key |
Platform Key (Default)
Use TrustModel's platform keys for quick evaluations. No vendor API key needed:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
categories=["safety", "bias"]
)
Note: Platform key availability varies by model. Check model.available_via_trust_model_key to see if a model supports this mode.
BYOK (Bring Your Own Key)
Use your own vendor API key to evaluate any model. All vendors support BYOK:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
api_key="sk-your-openai-key", # Your OpenAI API key
categories=["safety", "bias"]
)
How it works:
- You provide your vendor API key (e.g., OpenAI, Anthropic, Google)
- TrustModel validates the key before creating the evaluation
- If validation fails, a
ConnectionValidationErroris raised with details - Your key is securely stored and used for the evaluation
Getting vendor API keys:
- OpenAI: platform.openai.com/api-keys
- Anthropic: console.anthropic.com/settings/keys
- Google AI: aistudio.google.com/apikey
Example with error handling:
from trustmodel import ConnectionValidationError, InsufficientCreditsError
try:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
api_key="sk-your-openai-key",
categories=["safety", "bias"]
)
print(f"Evaluation created: {evaluation.id}")
except ConnectionValidationError as e:
# API key validation failed
print(f"Invalid API key: {e.message}")
if e.validation_details:
print(f"Details: {e.validation_details}")
except InsufficientCreditsError as e:
print(f"Need more credits: {e.credits_required} required")
Custom Endpoint
Evaluate your own OpenAI-compatible API endpoint (Ollama, vLLM, LiteLLM, Azure AI, etc.):
# Create evaluation for a custom endpoint
evaluation = client.evaluations.create_custom_endpoint(
api_endpoint="https://api.yourcompany.com/v1",
api_key="your-api-key",
model_identifier="your-model-id",
vendor_identifier="openai", # Determines which validator to use
model_name="My Custom Model", # Optional display name
categories=["safety", "bias"]
)
Available vendor identifiers for custom endpoints:
Get the list programmatically with client.config.get().vendors["custom"], or use one of these:
| Identifier | Use For |
|---|---|
openai |
OpenAI-compatible APIs (Ollama, vLLM, LiteLLM, etc.) - default |
huggingface |
Hugging Face Inference Endpoints |
azure_ai |
Azure AI / Azure OpenAI Service |
xai |
Google Vertex AI |
bedrock |
AWS Bedrock |
Examples:
# Ollama endpoint (uses default "openai" validator)
evaluation = client.evaluations.create_custom_endpoint(
api_endpoint="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't require a real key
model_identifier="llama3:8b"
)
# Azure AI endpoint
evaluation = client.evaluations.create_custom_endpoint(
api_endpoint="https://your-resource.openai.azure.com",
api_key="your-azure-key",
model_identifier="gpt-4",
vendor_identifier="azure_ai"
)
# Hugging Face endpoint
evaluation = client.evaluations.create_custom_endpoint(
api_endpoint="https://api-inference.huggingface.co/models/your-model",
api_key="hf_your_token",
model_identifier="your-model",
vendor_identifier="huggingface"
)
Core Concepts
Models
Discover available AI models:
# List all available models
models, api_sources = client.models.list()
for model in models:
print(f"Model: {model.name}")
print(f"Vendor: {model.vendor_identifier}")
print(f"Platform key available: {model.available_via_trust_model_key}")
print(f"Previously used BYOK: {model.available_via_byok}")
print("---")
# Get specific model
model = client.models.get("openai", "gpt-4")
print(f"Found model: {model.name}")
Note: available_via_byok indicates you have previously used BYOK for this vendor. All vendors support BYOK - you can use your own API key with any model.
Evaluations
Create and manage AI model evaluations:
# Platform key (default) - uses TrustModel's keys
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
categories=["safety", "bias"]
)
# BYOK - uses your own API key
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
api_key="sk-your-openai-key",
categories=["safety", "bias"]
)
# Custom endpoint - your own API
evaluation = client.evaluations.create_custom_endpoint(
api_endpoint="https://api.yourcompany.com/v1",
api_key="your-api-key",
model_identifier="custom-model-v1"
)
Re-run from Template
Re-run a previous evaluation configuration using its template ID:
# Re-run using a saved template
evaluation = client.evaluations.create_from_template(
template_id="550e8400-e29b-41d4-a716-446655440000"
)
# Optionally update the template name
evaluation = client.evaluations.create_from_template(
template_id="550e8400-e29b-41d4-a716-446655440000",
template_name="My Updated Config Name"
)
The template contains all saved configuration (model, vendor, categories, etc.) so no other parameters are required. Template IDs are returned in evaluation results via the template_id field.
Managing Evaluations
# List all evaluations
evaluations = client.evaluations.list()
# Filter by status
completed = client.evaluations.list(status="completed")
# Get detailed results
evaluation = client.evaluations.get(evaluation_id)
if evaluation.status == "completed":
print(f"Overall Score: {evaluation.overall_score}")
for score in evaluation.scores:
print(f"{score.category}: {score.score:.2f}")
# Quick status check
status = client.evaluations.get_status(evaluation_id)
print(f"Progress: {status['completion_percentage']}%")
Batch Jobs & Model Comparison
Evaluate multiple models efficiently using batch jobs. Batch jobs are ideal for comparing models, running high-volume evaluations, and reducing API quota usage.
Creating Batch Evaluations
Create a batch to evaluate multiple models in parallel:
# Create a batch to evaluate multiple models
batch = client.batch_jobs.create(
batch_type="model_evaluation",
name="GPT-4 vs Claude-3 Evaluation",
description="Comparing GPT-4 and Claude-3 performance on safety and bias",
models=[
{"vendor_identifier": "openai", "model_identifier": "gpt-4"},
{"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
],
evaluation_config={"type": "comprehensive", "test_count": 50},
categories=["safety", "bias"], # Optional: specify evaluation categories
)
print(f"Batch created with ID: {batch.id}")
print(f"Status: {batch.status}")
print(f"Total models: {batch.total_models}")
Batch Types:
| Type | Purpose |
|---|---|
model_evaluation |
Evaluate multiple models independently |
model_score_comparison |
Compare models side-by-side with ranking |
Optional Parameters:
categories: List of evaluation categories (e.g.,["safety", "bias", "performance"])api_key: Your vendor API key for BYOK evaluations across all modelstest_set_id: Use a specific test set instead of the defaultdescription: Human-readable description of the batch
Model Comparison Batch
Create a batch specifically for comparing multiple models:
# Create a comparison batch
comparison = client.batch_jobs.create(
batch_type="model_score_comparison",
name="Q1 2024 Model Comparison",
description="Comparing latest models across all categories",
models=[
{"vendor_identifier": "openai", "model_identifier": "gpt-5.2"},
{"vendor_identifier": "anthropic", "model_identifier": "claude-haiku-4-5"},
{"vendor_identifier": "mistralai", "model_identifier": "ministral-8b-2512"},
],
evaluation_config={"type": "comprehensive"},
)
print(f"Comparison batch created: {comparison.id}")
Monitoring Batch Progress
Poll for batch completion and get progress updates:
import time
batch = client.batch_jobs.get(batch_id)
# Check current status
print(f"Status: {batch.status}")
print(f"Progress: {batch.completion_percentage}%")
print(f"Completed: {batch.completed_models}/{batch.total_models}")
print(f"Failed: {batch.failed_models}")
# Poll until completion (example with 5-second intervals)
max_attempts = 120 # 10 minutes
for attempt in range(max_attempts):
batch = client.batch_jobs.get(batch_id)
print(f"[{attempt}] {batch.completion_percentage}% | {batch.completed_models}/{batch.total_models} | {batch.status}")
if batch.status in ["completed", "partially_completed", "failed"]:
break
time.sleep(5)
Batch Status Values
| Status | Meaning |
|---|---|
pending |
Batch created, waiting to start |
processing |
Batch is actively evaluating models |
completed |
All models completed successfully |
partially_completed |
Some models completed, some failed |
failed |
Batch failed to process |
Understanding Batch Results
Access detailed results after batch completion:
batch = client.batch_jobs.get(batch_id)
print(f"Overall Status: {batch.status}")
print(f"Completion: {batch.completion_percentage}%")
# Per-model results
if batch.per_model_results:
for model_id, result in batch.per_model_results.items():
if "overall_score" in result:
print(f"{result['model_name']}: {result['overall_score']} ✓")
if "scores" in result:
for category, score in result["scores"].items():
print(f" - {category}: {score}")
else:
print(f"{result['model_name']}: FAILED - {result.get('error_message')}")
# Cross-model comparison (for model_score_comparison batches)
if batch.cross_model_summary:
summary = batch.cross_model_summary
print("\n=== Ranking ===")
for i, model_result in enumerate(summary.get("all_scores_sorted", []), 1):
print(f"{i}. {model_result['model_name']}: {model_result['score']:.2f}")
if summary.get("top_model"):
print(f"\n🏆 Top Performer: {summary['top_model']['model_name']}")
if summary.get("average_score"):
print(f"📈 Average Score: {summary['average_score']:.2f}")
if summary.get("score_range"):
sr = summary["score_range"]
print(f"📉 Score Range: {sr['min']:.2f} - {sr['max']:.2f}")
Result Structure:
Each model in per_model_results contains:
model_name: Model display namevendor: Vendor identifieroverall_score: Score from 0-100 (if successful)scores: Detailed category scorescompleted_at: When the evaluation completederror_message: Error details (if failed)
Cross-Model Summary contains:
top_model: Best performing modelbottom_model: Lowest performing modelaverage_score: Mean score across all modelsscore_range: Min/max scoresall_scores_sorted: All models ranked by score
Listing Batch Jobs
List and filter batch jobs:
# List all batch jobs
batches = client.batch_jobs.list()
# Filter by type
model_evals = client.batch_jobs.list(batch_type="model_evaluation")
# Filter by status
completed = client.batch_jobs.list(status="completed")
# Pagination
page_2 = client.batch_jobs.list(limit=20, offset=20)
# Combine filters
active = client.batch_jobs.list(
batch_type="model_score_comparison",
status="processing"
)
# Access results
for batch in batches.results:
print(f"{batch.name}: {batch.status} ({batch.completion_percentage}%)")
Configuration
Discover available options for evaluations:
# Get configuration options
config = client.config.get()
print("Available application types:")
for app_type in config.application_types:
print(f" {app_type['id']}: {app_type['name']}")
print("Available categories:")
for category in config.categories:
print(f" {category}")
print(f"Credits per category: {config.credits_per_category}")
Credits Management
Monitor your API key usage:
# Check credit balance
credits = client.credits.get_balance()
print(f"API Key: {credits.api_key_name}")
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")
print(f"Credit Limit: {credits.credit_limit}")
print(f"Status: {credits.status}")
Error Handling
The SDK provides specific exceptions for different error types:
import trustmodel
from trustmodel import (
AuthenticationError,
ConnectionValidationError,
InsufficientCreditsError,
RateLimitError,
ValidationError,
APIError
)
try:
client = trustmodel.TrustModelClient(api_key="tm-your-key")
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
api_key="sk-your-openai-key" # BYOK
)
except AuthenticationError:
print("Invalid TrustModel API key")
except ConnectionValidationError as e:
# BYOK or custom endpoint validation failed
print(f"Vendor API key validation failed: {e.message}")
if e.validation_details:
status_code = e.validation_details.get("status_code")
if status_code == 401:
print("Check your vendor API key is valid and not expired")
elif status_code == 404:
print("Model not found - check the model identifier")
except InsufficientCreditsError as e:
print(f"Need {e.credits_required} credits, but only {e.credits_remaining} remaining")
except RateLimitError:
print("Rate limit exceeded, please wait")
except ValidationError as e:
print(f"Invalid input: {e}")
except APIError as e:
print(f"API error: {e.message} (status: {e.status_code})")
Exception Reference
| Exception | When Raised |
|---|---|
AuthenticationError |
Invalid TrustModel API key |
ConnectionValidationError |
BYOK or custom endpoint API key validation failed |
InsufficientCreditsError |
Not enough credits for the evaluation |
RateLimitError |
Too many requests, need to wait |
ValidationError |
Invalid input parameters |
ModelNotFoundError |
Requested model doesn't exist |
EvaluationNotFoundError |
Requested evaluation doesn't exist |
APIError |
General API error (base class) |
Rate Limiting
All API keys are rate limited to 100 requests per minute.
Rate Limit Headers
Every API response includes rate limit information in headers:
import trustmodel
client = trustmodel.TrustModelClient(api_key="tm-your-key")
try:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai"
)
except trustmodel.RateLimitError as e:
print(f"Rate limit exceeded: {e.message}")
if hasattr(e, 'retry_after'):
print(f"Retry after: {e.retry_after} seconds")
Rate Limit Headers in Response:
X-RateLimit-Limit: Maximum requests allowed per hourX-RateLimit-Remaining: Requests remaining in current hourX-RateLimit-Reset: UNIX timestamp when limit resets
Rate Limit Response (HTTP 429):
{
"detail": "Rate limit exceeded. Maximum 100 requests per hour.",
"code": "rate_limit_exceeded",
"limit": 100,
"requests_used": 100,
"reset_at": 1706515200,
"retry_after_seconds": 3600
}
Handling Rate Limits
The SDK automatically retries rate-limited requests with exponential backoff:
from trustmodel import RateLimitError
try:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
categories=["safety", "bias"]
)
except RateLimitError as e:
print(f"Rate limit exceeded after retries: {e.message}")
print(f"Current usage: {e.status_code}")
Automatic Retry Strategy:
- Retries up to 3 times (configurable via
max_retriesparameter) - Uses exponential backoff: 1s, 2s, 4s, 8s, etc.
- Automatically retries on: 429, 500, 502, 503, 504
Rate Limiting Best Practices
1. Monitor Your Usage
# Check credit balance which indicates usage
credits = client.credits.get_balance()
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")
2. Use Batch Jobs for High Volume
Batch jobs are more efficient and cost fewer quota units per evaluation:
batch = client.batch_jobs.create(
batch_type="model_evaluation",
name="Bulk Evaluation",
models=[
{"vendor_identifier": "openai", "model_identifier": "gpt-4"},
{"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
{"vendor_identifier": "google", "model_identifier": "gemini-1.5"},
],
evaluation_config={"type": "comprehensive"}
)
print(f"Batch created: 1 POST (2 quota) for 3 models instead of 3 POSTs (6 quota)")
3. Implement Exponential Backoff
The SDK handles this automatically, but you can also implement custom logic:
import time
from trustmodel import RateLimitError
max_retries = 5
for attempt in range(max_retries):
try:
result = client.evaluations.create(...)
break
except RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time} seconds...")
time.sleep(wait_time)
else:
raise
4. Plan Your Requests
Calculate estimated quota before making requests:
# Example calculation
models_to_evaluate = 10
evaluation_creates = 10 # 10 models * 2 quota each = 20
status_checks = 50 # Poll 50 times * 1 quota each = 50
total_quota_needed = evaluation_creates + status_checks
print(f"Estimated quota needed: {total_quota_needed}")
current_plan_limit = 100
remaining = 75
if total_quota_needed <= remaining:
print("Proceeding with evaluations")
else:
print("Insufficient quota, consider upgrading plan")
5. Configure Custom Timeouts and Retries
client = trustmodel.TrustModelClient(
api_key="tm-your-key",
timeout=120, # Increase timeout for large requests
max_retries=5 # More aggressive retry for rate limits
)
Upgrading Your Plan
If you consistently hit rate limits:
- Visit the TrustModel Dashboard
- Go to "Billing" or "Plan Settings"
- Select a higher tier (Starter, Pro, or Enterprise)
- Limits update immediately
Webhook Notifications
TrustModel sends webhook notifications when your evaluations complete or fail. Configure your webhook endpoint in the TrustModel Dashboard to receive these events.
Success Event: sdk_report_evaluation_success
Sent when an evaluation completes successfully:
{
"event_type": "sdk_report_evaluation_success",
"timestamp": "2026-01-21T13:41:44.253319+00:00",
"evaluation_run_id": 82,
"model_identifier": "gpt-4",
"status": "completed",
"completion_percentage": 100,
"overall_score": 65,
"category_scores": [
{
"category_name": "Accuracy",
"category_score": 100.0,
"subcategories": [
{
"subcategory_name": "Citation & Source Accuracy",
"subcategory_score": 100.0
}
]
}
]
}
Failure Event: sdk_report_evaluation_failed
Sent when an evaluation fails:
{
"event_type": "sdk_report_evaluation_failed",
"timestamp": "2026-01-21T12:38:18.349320+00:00",
"evaluation_run_id": 78,
"model_identifier": "gpt-4",
"failed_phase": "evaluation",
"failed_at": "2026-01-21T12:38:18.341673+00:00"
}
Webhook Event Fields
| Field | Description |
|---|---|
event_type |
Either sdk_report_evaluation_success or sdk_report_evaluation_failed |
timestamp |
ISO 8601 timestamp when the event was generated |
evaluation_run_id |
Unique identifier for the evaluation |
model_identifier |
The AI model that was evaluated |
status |
Current status (completed for success events) |
completion_percentage |
Progress percentage (100 for completed) |
overall_score |
Final evaluation score (success events only) |
category_scores |
Detailed scores by category (success events only) |
failed_phase |
Phase where failure occurred (failure events only) |
failed_at |
ISO 8601 timestamp of failure (failure events only) |
Advanced Usage
Context Manager
Use the client as a context manager for automatic cleanup:
with trustmodel.TrustModelClient(api_key="tm-your-key") as client:
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai"
)
# Client automatically closed when exiting context
Custom Configuration
# Custom timeouts and retries
client = trustmodel.TrustModelClient(
api_key="tm-your-key",
timeout=120, # 2 minute timeout
max_retries=5 # More aggressive retrying
)
Detailed Evaluation Configuration
evaluation = client.evaluations.create(
model_identifier="gpt-4",
vendor_identifier="openai",
categories=["safety", "bias", "performance"],
# Application context
application_type="chatbot",
application_description="Customer support chatbot for e-commerce",
# User personas
user_personas=["external-customer", "technical-user"],
# Domain expertise (when using domain-expert persona)
domain_expert_description="medical",
# Custom naming
model_config_name="GPT-4 Production Eval 2024-01"
)
Framework Integration
FastAPI
from fastapi import FastAPI, HTTPException
import trustmodel
app = FastAPI()
client = trustmodel.TrustModelClient(api_key="tm-your-key")
@app.post("/evaluate")
async def create_evaluation(model: str, vendor: str):
try:
evaluation = client.evaluations.create(
model_identifier=model,
vendor_identifier=vendor
)
return {"evaluation_id": evaluation.id, "status": evaluation.status}
except trustmodel.InsufficientCreditsError:
raise HTTPException(status_code=402, detail="Insufficient credits")
Django
# views.py
from django.http import JsonResponse
import trustmodel
def evaluate_model(request):
client = trustmodel.TrustModelClient(api_key=settings.TRUSTMODEL_API_KEY)
evaluation = client.evaluations.create(
model_identifier=request.POST["model"],
vendor_identifier=request.POST["vendor"]
)
return JsonResponse({
"evaluation_id": evaluation.id,
"status": evaluation.status
})
Flask
from flask import Flask, request, jsonify
import trustmodel
app = Flask(__name__)
client = trustmodel.TrustModelClient(api_key="tm-your-key")
@app.route("/evaluate", methods=["POST"])
def evaluate():
data = request.get_json()
evaluation = client.evaluations.create(
model_identifier=data["model"],
vendor_identifier=data["vendor"]
)
return jsonify({
"evaluation_id": evaluation.id,
"status": evaluation.status
})
Zero-Config Auto-Capture (auto_init)
Capture every LLM and tool call your AI agent makes — without changing your existing code. Two lines, and TrustModel takes care of trace collection, batching, transport, and evaluation.
How It Works
- You call
auto_init(api_key, agent_id, ...)once at startup. - The SDK auto-detects which AI libraries you're using (OpenAI, Anthropic, LangChain, etc.) and installs an OpenInference instrumentor for each.
- Every subsequent LLM / tool call is captured as an OpenTelemetry span and streamed to the TrustModel gateway.
- The TrustModel Control Plane buffers traces server-side, groups them by
agent_id+domain+frameworks, and runs evaluation on the schedule you configure (daily, weekly, or monthly).
Installation
pip install "trustmodel[telemetry]"
This installs OpenTelemetry plus the OpenInference instrumentors. Python 3.10+ required for the telemetry extra.
If you only want the core SDK (e.g., client.evaluations.create()), pip install trustmodel works on Python 3.7+ but auto_init will not be available.
Quick Start
from trustmodel.telemetry import auto_init
auto_init(
api_key="tm-...", # your TrustModel API key
agent_id="my-customer-support-agent", # any identifier you choose
domain="general_ai", # fair_lending | hr_bias | healthcare | general_ai
frameworks=["nist-ai-rmf"], # one or more compliance framework slugs
)
# Your existing agent code — no changes needed
import openai
client = openai.OpenAI()
client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
)
That's it. The OpenAI call is automatically captured and streamed to TrustModel.
Discovering Domains and Frameworks
Use the client.frameworks endpoint to list available domains and the compliance frameworks within each:
from trustmodel import TrustModelClient
client = TrustModelClient(api_key="tm-...")
# All available domains
print(client.frameworks.list_domains())
# ['fair_lending', 'hr_bias', 'healthcare', 'general_ai']
# Frameworks for a specific domain
for f in client.frameworks.list(domain="fair_lending"):
print(f"{f.slug}: {f.name} ({f.credits} credits)")
Use the slug values when calling auto_init(frameworks=[...]).
Supported AI Libraries
auto_init will automatically activate instrumentation for any of these libraries that you have installed:
| Library | Auto-detected via |
|---|---|
| OpenAI | openinference-instrumentation-openai |
| Anthropic (Claude) | openinference-instrumentation-anthropic |
| LangChain | openinference-instrumentation-langchain |
| LlamaIndex | openinference-instrumentation-llama-index |
| AWS Bedrock | openinference-instrumentation-bedrock |
| Mistral AI | openinference-instrumentation-mistralai |
| Groq | openinference-instrumentation-groq |
| CrewAI | openinference-instrumentation-crewai |
| Vertex AI | openinference-instrumentation-vertexai |
| DSPy | openinference-instrumentation-dspy |
All ten are installed by pip install "trustmodel[telemetry]". The instrumentors are no-ops when the underlying library isn't being used — there's no overhead.
Existing OpenTelemetry Setup (Datadog, Jaeger, Honeycomb, Arize)
If your application already has OpenTelemetry configured for another observability backend, auto_init detects this and adds TrustModel as an additional exporter rather than replacing your setup.
# Your existing OTel setup (e.g., Datadog APM, Jaeger, Honeycomb)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
provider = TracerProvider(...)
trace.set_tracer_provider(provider)
# Then add TrustModel — both backends now receive every span
from trustmodel.telemetry import auto_init
auto_init(api_key="tm-...", agent_id="my-agent")
Spans fan out to both your existing exporter and TrustModel's. Nothing in your existing pipeline is modified or replaced.
Parameters
| Parameter | Required | Description |
|---|---|---|
api_key |
yes | Your TrustModel API key (tm-...) |
agent_id |
yes | Any string you choose. All traces sharing this agent_id are grouped together for evaluation. |
domain |
yes | One of fair_lending, hr_bias, healthcare, general_ai. Determines which evaluators run. |
frameworks |
yes | List of compliance framework slugs (e.g., ["ecoa-regb", "fcra"]). Use client.frameworks.list(domain=...) to discover. |
service_name |
no | Logical service name (default: "default") |
Schedule and Evaluation
Configure when buffered traces are evaluated in the Control Plane dashboard:
- Manual — only when you click "Trigger Evaluation Now"
- Daily / Weekly / Monthly — automatically via cron at midnight UTC
Each evaluation produces:
- An overall trust score
- Per-category breakdowns (safety, fairness, accuracy, etc.)
- Findings and recommendations
Failure Modes (Safe by Design)
auto_init wraps everything in try/except. If anything fails — missing dependencies, network issues, an instrumentor crash — your application keeps running. Telemetry is best-effort; nothing TrustModel does will break your agent.
If the [telemetry] extras aren't installed and you try to import the telemetry module directly, you get a clear error:
ImportError: TrustModel telemetry requires extra dependencies.
Install with: pip install "trustmodel[telemetry]"
Agentic Trace Evaluation
Evaluate AI agent execution traces for safety, reasoning quality, tool usage, and goal completion. Upload a JSON or JSONL trace file and get scored across 14 dimensions.
Quick Start
import trustmodel
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")
# Check pricing
pricing = client.agentic.get_pricing()
print(f"Credits per evaluation: {pricing.credits_required}")
print(f"Price: {pricing.display_amount}")
# Evaluate an agent trace
result = client.agentic.evaluate(
file_path="traces/agent_run.json",
goal="Resolve customer billing inquiry",
name="Support Bot Evaluation",
agent_framework="langchain",
agent_model="gpt-4o",
expected_outcome="Customer receives correct billing info",
actual_outcome="Applied credit and resolved inquiry",
goal_achieved=True,
)
print(f"Evaluation started: {result.evaluation_run_id}")
print(f"Status: {result.status}")
Trace File Format
Upload a JSON file with your agent's execution trace:
{
"goal": "Resolve customer billing inquiry",
"steps": [
{"step_type": "thought", "content": "Need to look up billing records..."},
{"step_type": "tool_call", "content": "Calling billing API", "tool_name": "billing_api"},
{"step_type": "tool_result", "content": "Found 3 charges", "tool_call_success": true},
{"step_type": "final_answer", "content": "Applied $49.99 credit to your account."}
]
}
JSONL files are also supported (one JSON object per line).
Supported step types: thought, tool_call, tool_result, observation, decision, error, human_input, final_answer
Parameters
| Parameter | Required | Description |
|---|---|---|
file_path |
Yes | Local path to .json or .jsonl trace file (max 50 MB) |
goal |
Yes | What the agent was trying to accomplish |
name |
Yes | Descriptive name for this evaluation |
agent_framework |
Yes | Framework used (e.g., langchain, crewai, autogen) |
agent_model |
No | Model powering the agent (e.g., gpt-4o) |
expected_outcome |
No | What should have happened |
actual_outcome |
No | What actually happened |
goal_achieved |
No | Whether the agent achieved its goal |
File Validation
The SDK validates your trace file locally before uploading:
- File must exist
- Extension must be
.jsonor.jsonl - File size must be under 50 MB
- Content must be valid JSON (or valid JSONL — one JSON object per line)
Retrieving Results
# Get detailed results (after evaluation completes)
detail = client.agentic.get(result.evaluation_run_id)
print(f"Overall Score: {detail.overall_score}")
print(f"Grade: {detail.grade}")
for score in detail.scores:
print(f" {score['category_display_name']}: {score['score']}")
print(f" {score['findings']}")
Example response:
{
"id": 146,
"status": "completed",
"overall_score": 76.0,
"grade": "C",
"scores": [
{"category_display_name": "Tool Use Accuracy", "score": 80.0, "findings": "1 CRITICAL tool(s) used without policy/approval check."},
{"category_display_name": "Reasoning Quality", "score": 58.0, "findings": "Low risk awareness (3.0/10)."},
{"category_display_name": "Goal Completion", "score": 90.0, "findings": "50% of actions classified as harmful."},
{"category_display_name": "Safety Compliance", "score": 80.0, "findings": "1 UNSAFE action(s) without confirmation."}
]
}
Listing Evaluations
# List all agentic evaluations
evaluations = client.agentic.list()
for ev in evaluations:
score = f"{ev.overall_score:.1f}" if ev.overall_score else "pending"
print(f"[{ev.evaluation_run_id}] {ev.name} — {ev.status} (score: {score})")
Scoring Categories
Evaluations are scored across these categories:
| Category | What It Measures |
|---|---|
| Tool Use Accuracy | Correct tool selection and parameter usage |
| Reasoning Quality | Logical, evidence-based decision making |
| Goal Completion | Whether the agent achieved its objective |
| Safety Compliance | Avoiding unsafe actions, PII leaks, auth bypasses |
| Safety | Overall safety of agent behavior |
| Fairness | Unbiased treatment across scenarios |
| Accuracy | Correctness of outputs and actions |
| Privacy | Protection of sensitive data |
| Transparency | Clarity of reasoning and decision-making |
| Robustness | Handling of edge cases and errors |
| Accountability | Proper escalation and audit trails |
| Explainability | Ability to justify actions taken |
| Compliance | Adherence to policies and regulations |
| Reliability | Consistent and dependable behavior |
Grade mapping: A (90+), B (80+), C (70+), D (60+), F (<60)
Error Handling
from trustmodel import ValidationError, InsufficientCreditsError
try:
result = client.agentic.evaluate(
file_path="traces/agent_run.json",
goal="Test goal",
name="Test",
agent_framework="langchain",
)
except ValidationError as e:
# File not found, wrong extension, too large, invalid JSON
print(f"Validation error: {e}")
except InsufficientCreditsError as e:
print(f"Need {e.credits_required} credits, have {e.credits_remaining}")
Galileo Integration
Evaluate agent traces from Galileo. The SDK calls TrustModel's backend APIs, which handle all trace pulling, transformation, and evaluation server-side.
Quick Start
import trustmodel
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")
# List available Galileo projects
projects = client.galileo.list_projects(galileo_api_key="your-galileo-key")
for p in projects:
print(f" {p.name} ({p.id})")
# Run evaluation — traces are pulled and evaluated server-side
result = client.galileo.evaluate(
galileo_api_key="your-galileo-key",
project_name="Simple Chatbot",
goal="Answer user questions accurately",
name="Galileo Trace Eval",
)
print(f"Evaluation started: {result.evaluation_run_id}")
print(f"Status: {result.status}")
Parameters
| Parameter | Required | Description |
|---|---|---|
galileo_api_key |
Yes | Your Galileo API key |
project_name |
Yes | Galileo project name |
goal |
Yes | What the agent was trying to accomplish |
name |
Yes | Descriptive name for this evaluation |
log_stream_name |
No | Log stream name (defaults to first available) |
agent_framework |
No | Framework identifier (default: "galileo") |
agent_model |
No | Model powering the agent |
Retrieving Results
Galileo evaluations use the same scoring as agentic trace evaluations:
# Get results (after evaluation completes via webhook notification)
detail = client.agentic.get(result.evaluation_run_id)
print(f"Overall Score: {detail.overall_score}")
print(f"Grade: {detail.grade}")
for score in detail.scores:
print(f" {score['category_display_name']}: {score['score']}")
Requirements
- Python 3.9 or higher
requests>= 2.25.0pydantic>= 2.0.0tqdm>= 4.60.0
Support
- 💬 Support
License
This project is licensed under a proprietary license - see the LICENSE file for details.
Important: This SDK is provided exclusively for use with TrustModel's official API services. Modification, redistribution, or reverse engineering is prohibited.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trustmodel-2.2.1.tar.gz.
File metadata
- Download URL: trustmodel-2.2.1.tar.gz
- Upload date:
- Size: 55.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce99781f528323b9a9261107c35d6be3e16d78fd209c52c920ea0c430c5cc348
|
|
| MD5 |
c62c015aaf3c5b0b11f73ff42230c69c
|
|
| BLAKE2b-256 |
0066e7ec173f82fce5d8fe9dff75a0f749b36624eaaf6f2702eb3fd7bc8b4e87
|
File details
Details for the file trustmodel-2.2.1-py3-none-any.whl.
File metadata
- Download URL: trustmodel-2.2.1-py3-none-any.whl
- Upload date:
- Size: 53.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c2f2c20b590682fd4fb6a181354e48fab884489cc938d0f62c32fbaddaf5b47
|
|
| MD5 |
29cc1215fa633a849ec2130fd814af47
|
|
| BLAKE2b-256 |
da2d86c954d7624a9f0c35ca0f9ad19c27d3ca4b077b3ba6d5e7ad3c1f2ee477
|