Skip to main content

Official Python SDK for TrustModel AI evaluation platform

Project description

TrustModel

Official Python SDK for the TrustModel AI evaluation platform

WebsiteDocumentationDashboard

Python 3.9+ PyPI version


Evaluate AI models for safety, bias, and performance with a simple, intuitive interface.

Features

  • 🚀 Simple Interface: Easy-to-use client for all TrustModel operations
  • 🔒 Secure: API key authentication with built-in validation
  • 🎯 Type Safe: Full type hints for excellent IDE support
  • 🔄 Reliable: Automatic retries and comprehensive error handling
  • 📊 Comprehensive: Support for all evaluation types and configurations
  • 🌍 Framework Agnostic: Works with any Python framework or standalone scripts

Installation

pip install trustmodel

Prerequisites

Before using the SDK, you must complete the following setup in the TrustModel Dashboard:

1. Create an API Key (Required)

You need a TrustModel API key to authenticate all SDK requests:

  1. Go to Keys & Webhooks in the dashboard
  2. Click "Create API Key"
  3. Copy your new API key (starts with tm-)
  4. Store it securely - you won't be able to see it again

2. Configure Webhooks (Required)

To receive notifications when evaluations complete or fail, you must configure webhooks:

  1. Go to Keys & Webhooks in the dashboard
  2. Click "Create Webhook"
  3. Enter your webhook endpoint URL
  4. Select the events you want to receive
  5. Save your webhook configuration

Important: Without configuring both an API key and webhooks in the webapp, you cannot run evaluations. The API will return an error if these are not set up.

Quick Start

import trustmodel

# Initialize the client
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")

# List available models
models, api_sources = client.models.list()
print(f"Found {len(models)} models available")

# Create an evaluation
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias", "performance"]
)

print(f"Evaluation created with ID: {evaluation.id}")
print(f"Status: {evaluation.status}")

# You'll receive a webhook notification when the evaluation completes
# Then retrieve the results:
completed_evaluation = client.evaluations.get(evaluation.id)
print(f"Overall score: {completed_evaluation.overall_score}")

# Check your credit balance
credits = client.credits.get_balance()
print(f"Credits remaining: {credits.credits_remaining}")

Authentication

Get your API key from the TrustModel Dashboard and use it to initialize the client:

import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")

For production applications, store your API key securely using environment variables:

import os
import trustmodel

api_key = os.getenv("TRUSTMODEL_API_KEY")
client = trustmodel.TrustModelClient(api_key=api_key)

Evaluation Modes

TrustModel supports three ways to evaluate AI models:

Mode Use Case API Key Required
Platform Key Quick evaluations using TrustModel's API keys No (uses TrustModel's keys)
BYOK Use your own vendor API key for any model Yes (your vendor API key)
Custom Endpoint Evaluate private/self-hosted models Yes (your endpoint's API key)

Getting Available Vendors

Use client.config.get().vendors to discover available vendors:

config = client.config.get()

# Public vendors - for Platform Key and BYOK evaluations
public_vendors = config.vendors["public"]
for vendor in public_vendors:
    print(f"{vendor['identifier']}: {vendor['name']}")

# Custom vendors - for Custom Endpoint evaluations only
custom_vendors = config.vendors["custom"]
for vendor in custom_vendors:
    print(f"{vendor['identifier']}: {vendor['name']}")
Vendor Type Use With Description
public Platform Key, BYOK Vendors like OpenAI, Anthropic, Google AI for standard evaluations
custom Custom Endpoint Validators for self-hosted/private endpoints (OpenAI-compatible, Hugging Face, Azure AI, etc.)

Getting Available Models

Use client.models.list() to discover available models:

# Get all available models and API source info
models, api_sources = client.models.list()

# List all models with their details
for model in models:
    print(f"Model: {model.name}")
    print(f"  Identifier: {model.model_identifier}")
    print(f"  Vendor: {model.vendor_identifier}")
    print(f"  Platform Key Available: {model.available_via_trust_model_key}")
    print(f"  BYOK Available: {model.available_via_byok}")

# Filter models by vendor
openai_models = [m for m in models if m.vendor_identifier == "openai"]

# Filter models available via platform key (no vendor API key needed)
platform_key_models = [m for m in models if m.available_via_trust_model_key]

# Use a model in evaluation
model = models[0]
evaluation = client.evaluations.create(
    model_identifier=model.model_identifier,
    vendor_identifier=model.vendor_identifier,
    categories=["safety", "bias"]
)
Model Field Type Description
name str Human-readable model name
model_identifier str Identifier to use in API calls
vendor_identifier str Vendor identifier
available_via_trust_model_key bool Can evaluate without vendor API key
available_via_byok bool Previously used with your own API key

Platform Key (Default)

Use TrustModel's platform keys for quick evaluations. No vendor API key needed:

evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias"]
)

Note: Platform key availability varies by model. Check model.available_via_trust_model_key to see if a model supports this mode.

BYOK (Bring Your Own Key)

Use your own vendor API key to evaluate any model. All vendors support BYOK:

evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    api_key="sk-your-openai-key",  # Your OpenAI API key
    categories=["safety", "bias"]
)

How it works:

  1. You provide your vendor API key (e.g., OpenAI, Anthropic, Google)
  2. TrustModel validates the key before creating the evaluation
  3. If validation fails, a ConnectionValidationError is raised with details
  4. Your key is securely stored and used for the evaluation

Getting vendor API keys:

Example with error handling:

from trustmodel import ConnectionValidationError, InsufficientCreditsError

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        api_key="sk-your-openai-key",
        categories=["safety", "bias"]
    )
    print(f"Evaluation created: {evaluation.id}")
except ConnectionValidationError as e:
    # API key validation failed
    print(f"Invalid API key: {e.message}")
    if e.validation_details:
        print(f"Details: {e.validation_details}")
except InsufficientCreditsError as e:
    print(f"Need more credits: {e.credits_required} required")

Custom Endpoint

Evaluate your own OpenAI-compatible API endpoint (Ollama, vLLM, LiteLLM, Azure AI, etc.):

# Create evaluation for a custom endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api.yourcompany.com/v1",
    api_key="your-api-key",
    model_identifier="your-model-id",
    vendor_identifier="openai",  # Determines which validator to use
    model_name="My Custom Model",  # Optional display name
    categories=["safety", "bias"]
)

Available vendor identifiers for custom endpoints:

Get the list programmatically with client.config.get().vendors["custom"], or use one of these:

Identifier Use For
openai OpenAI-compatible APIs (Ollama, vLLM, LiteLLM, etc.) - default
huggingface Hugging Face Inference Endpoints
azure_ai Azure AI / Azure OpenAI Service
xai Google Vertex AI
bedrock AWS Bedrock

Examples:

# Ollama endpoint (uses default "openai" validator)
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require a real key
    model_identifier="llama3:8b"
)

# Azure AI endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://your-resource.openai.azure.com",
    api_key="your-azure-key",
    model_identifier="gpt-4",
    vendor_identifier="azure_ai"
)

# Hugging Face endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api-inference.huggingface.co/models/your-model",
    api_key="hf_your_token",
    model_identifier="your-model",
    vendor_identifier="huggingface"
)

Core Concepts

Models

Discover available AI models:

# List all available models
models, api_sources = client.models.list()

for model in models:
    print(f"Model: {model.name}")
    print(f"Vendor: {model.vendor_identifier}")
    print(f"Platform key available: {model.available_via_trust_model_key}")
    print(f"Previously used BYOK: {model.available_via_byok}")
    print("---")

# Get specific model
model = client.models.get("openai", "gpt-4")
print(f"Found model: {model.name}")

Note: available_via_byok indicates you have previously used BYOK for this vendor. All vendors support BYOK - you can use your own API key with any model.

Evaluations

Create and manage AI model evaluations:

# Platform key (default) - uses TrustModel's keys
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias"]
)

# BYOK - uses your own API key
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    api_key="sk-your-openai-key",
    categories=["safety", "bias"]
)

# Custom endpoint - your own API
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api.yourcompany.com/v1",
    api_key="your-api-key",
    model_identifier="custom-model-v1"
)

Re-run from Template

Re-run a previous evaluation configuration using its template ID:

# Re-run using a saved template
evaluation = client.evaluations.create_from_template(
    template_id="550e8400-e29b-41d4-a716-446655440000"
)

# Optionally update the template name
evaluation = client.evaluations.create_from_template(
    template_id="550e8400-e29b-41d4-a716-446655440000",
    template_name="My Updated Config Name"
)

The template contains all saved configuration (model, vendor, categories, etc.) so no other parameters are required. Template IDs are returned in evaluation results via the template_id field.

Managing Evaluations

# List all evaluations
evaluations = client.evaluations.list()

# Filter by status
completed = client.evaluations.list(status="completed")

# Get detailed results
evaluation = client.evaluations.get(evaluation_id)
if evaluation.status == "completed":
    print(f"Overall Score: {evaluation.overall_score}")
    for score in evaluation.scores:
        print(f"{score.category}: {score.score:.2f}")

# Quick status check
status = client.evaluations.get_status(evaluation_id)
print(f"Progress: {status['completion_percentage']}%")

Batch Jobs & Model Comparison

Evaluate multiple models efficiently using batch jobs. Batch jobs are ideal for comparing models, running high-volume evaluations, and reducing API quota usage.

Creating Batch Evaluations

Create a batch to evaluate multiple models in parallel:

# Create a batch to evaluate multiple models
batch = client.batch_jobs.create(
    batch_type="model_evaluation",
    name="GPT-4 vs Claude-3 Evaluation",
    description="Comparing GPT-4 and Claude-3 performance on safety and bias",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-4"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
    ],
    evaluation_config={"type": "comprehensive", "test_count": 50},
    categories=["safety", "bias"],  # Optional: specify evaluation categories
)

print(f"Batch created with ID: {batch.id}")
print(f"Status: {batch.status}")
print(f"Total models: {batch.total_models}")

Batch Types:

Type Purpose
model_evaluation Evaluate multiple models independently
model_score_comparison Compare models side-by-side with ranking

Optional Parameters:

  • categories: List of evaluation categories (e.g., ["safety", "bias", "performance"])
  • api_key: Your vendor API key for BYOK evaluations across all models
  • test_set_id: Use a specific test set instead of the default
  • description: Human-readable description of the batch

Model Comparison Batch

Create a batch specifically for comparing multiple models:

# Create a comparison batch
comparison = client.batch_jobs.create(
    batch_type="model_score_comparison",
    name="Q1 2024 Model Comparison",
    description="Comparing latest models across all categories",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-5.2"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-haiku-4-5"},
        {"vendor_identifier": "mistralai", "model_identifier": "ministral-8b-2512"},
    ],
    evaluation_config={"type": "comprehensive"},
)

print(f"Comparison batch created: {comparison.id}")

Monitoring Batch Progress

Poll for batch completion and get progress updates:

import time

batch = client.batch_jobs.get(batch_id)

# Check current status
print(f"Status: {batch.status}")
print(f"Progress: {batch.completion_percentage}%")
print(f"Completed: {batch.completed_models}/{batch.total_models}")
print(f"Failed: {batch.failed_models}")

# Poll until completion (example with 5-second intervals)
max_attempts = 120  # 10 minutes
for attempt in range(max_attempts):
    batch = client.batch_jobs.get(batch_id)

    print(f"[{attempt}] {batch.completion_percentage}% | {batch.completed_models}/{batch.total_models} | {batch.status}")

    if batch.status in ["completed", "partially_completed", "failed"]:
        break

    time.sleep(5)

Batch Status Values

Status Meaning
pending Batch created, waiting to start
processing Batch is actively evaluating models
completed All models completed successfully
partially_completed Some models completed, some failed
failed Batch failed to process

Understanding Batch Results

Access detailed results after batch completion:

batch = client.batch_jobs.get(batch_id)

print(f"Overall Status: {batch.status}")
print(f"Completion: {batch.completion_percentage}%")

# Per-model results
if batch.per_model_results:
    for model_id, result in batch.per_model_results.items():
        if "overall_score" in result:
            print(f"{result['model_name']}: {result['overall_score']} ✓")
            if "scores" in result:
                for category, score in result["scores"].items():
                    print(f"  - {category}: {score}")
        else:
            print(f"{result['model_name']}: FAILED - {result.get('error_message')}")

# Cross-model comparison (for model_score_comparison batches)
if batch.cross_model_summary:
    summary = batch.cross_model_summary

    print("\n=== Ranking ===")
    for i, model_result in enumerate(summary.get("all_scores_sorted", []), 1):
        print(f"{i}. {model_result['model_name']}: {model_result['score']:.2f}")

    if summary.get("top_model"):
        print(f"\n🏆 Top Performer: {summary['top_model']['model_name']}")

    if summary.get("average_score"):
        print(f"📈 Average Score: {summary['average_score']:.2f}")

    if summary.get("score_range"):
        sr = summary["score_range"]
        print(f"📉 Score Range: {sr['min']:.2f} - {sr['max']:.2f}")

Result Structure:

Each model in per_model_results contains:

  • model_name: Model display name
  • vendor: Vendor identifier
  • overall_score: Score from 0-100 (if successful)
  • scores: Detailed category scores
  • completed_at: When the evaluation completed
  • error_message: Error details (if failed)

Cross-Model Summary contains:

  • top_model: Best performing model
  • bottom_model: Lowest performing model
  • average_score: Mean score across all models
  • score_range: Min/max scores
  • all_scores_sorted: All models ranked by score

Listing Batch Jobs

List and filter batch jobs:

# List all batch jobs
batches = client.batch_jobs.list()

# Filter by type
model_evals = client.batch_jobs.list(batch_type="model_evaluation")

# Filter by status
completed = client.batch_jobs.list(status="completed")

# Pagination
page_2 = client.batch_jobs.list(limit=20, offset=20)

# Combine filters
active = client.batch_jobs.list(
    batch_type="model_score_comparison",
    status="processing"
)

# Access results
for batch in batches.results:
    print(f"{batch.name}: {batch.status} ({batch.completion_percentage}%)")

Configuration

Discover available options for evaluations:

# Get configuration options
config = client.config.get()

print("Available application types:")
for app_type in config.application_types:
    print(f"  {app_type['id']}: {app_type['name']}")

print("Available categories:")
for category in config.categories:
    print(f"  {category}")

print(f"Credits per category: {config.credits_per_category}")

Credits Management

Monitor your API key usage:

# Check credit balance
credits = client.credits.get_balance()

print(f"API Key: {credits.api_key_name}")
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")
print(f"Credit Limit: {credits.credit_limit}")
print(f"Status: {credits.status}")

Error Handling

The SDK provides specific exceptions for different error types:

import trustmodel
from trustmodel import (
    AuthenticationError,
    ConnectionValidationError,
    InsufficientCreditsError,
    RateLimitError,
    ValidationError,
    APIError
)

try:
    client = trustmodel.TrustModelClient(api_key="tm-your-key")
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        api_key="sk-your-openai-key"  # BYOK
    )
except AuthenticationError:
    print("Invalid TrustModel API key")
except ConnectionValidationError as e:
    # BYOK or custom endpoint validation failed
    print(f"Vendor API key validation failed: {e.message}")
    if e.validation_details:
        status_code = e.validation_details.get("status_code")
        if status_code == 401:
            print("Check your vendor API key is valid and not expired")
        elif status_code == 404:
            print("Model not found - check the model identifier")
except InsufficientCreditsError as e:
    print(f"Need {e.credits_required} credits, but only {e.credits_remaining} remaining")
except RateLimitError:
    print("Rate limit exceeded, please wait")
except ValidationError as e:
    print(f"Invalid input: {e}")
except APIError as e:
    print(f"API error: {e.message} (status: {e.status_code})")

Exception Reference

Exception When Raised
AuthenticationError Invalid TrustModel API key
ConnectionValidationError BYOK or custom endpoint API key validation failed
InsufficientCreditsError Not enough credits for the evaluation
RateLimitError Too many requests, need to wait
ValidationError Invalid input parameters
ModelNotFoundError Requested model doesn't exist
EvaluationNotFoundError Requested evaluation doesn't exist
APIError General API error (base class)

Rate Limiting

All API keys are rate limited to 100 requests per minute.

Rate Limit Headers

Every API response includes rate limit information in headers:

import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-key")

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai"
    )
except trustmodel.RateLimitError as e:
    print(f"Rate limit exceeded: {e.message}")
    if hasattr(e, 'retry_after'):
        print(f"Retry after: {e.retry_after} seconds")

Rate Limit Headers in Response:

  • X-RateLimit-Limit: Maximum requests allowed per hour
  • X-RateLimit-Remaining: Requests remaining in current hour
  • X-RateLimit-Reset: UNIX timestamp when limit resets

Rate Limit Response (HTTP 429):

{
  "detail": "Rate limit exceeded. Maximum 100 requests per hour.",
  "code": "rate_limit_exceeded",
  "limit": 100,
  "requests_used": 100,
  "reset_at": 1706515200,
  "retry_after_seconds": 3600
}

Handling Rate Limits

The SDK automatically retries rate-limited requests with exponential backoff:

from trustmodel import RateLimitError

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        categories=["safety", "bias"]
    )
except RateLimitError as e:
    print(f"Rate limit exceeded after retries: {e.message}")
    print(f"Current usage: {e.status_code}")

Automatic Retry Strategy:

  • Retries up to 3 times (configurable via max_retries parameter)
  • Uses exponential backoff: 1s, 2s, 4s, 8s, etc.
  • Automatically retries on: 429, 500, 502, 503, 504

Rate Limiting Best Practices

1. Monitor Your Usage

# Check credit balance which indicates usage
credits = client.credits.get_balance()
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")

2. Use Batch Jobs for High Volume

Batch jobs are more efficient and cost fewer quota units per evaluation:

batch = client.batch_jobs.create(
    batch_type="model_evaluation",
    name="Bulk Evaluation",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-4"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
        {"vendor_identifier": "google", "model_identifier": "gemini-1.5"},
    ],
    evaluation_config={"type": "comprehensive"}
)

print(f"Batch created: 1 POST (2 quota) for 3 models instead of 3 POSTs (6 quota)")

3. Implement Exponential Backoff

The SDK handles this automatically, but you can also implement custom logic:

import time
from trustmodel import RateLimitError

max_retries = 5
for attempt in range(max_retries):
    try:
        result = client.evaluations.create(...)
        break
    except RateLimitError:
        if attempt < max_retries - 1:
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            raise

4. Plan Your Requests

Calculate estimated quota before making requests:

# Example calculation
models_to_evaluate = 10
evaluation_creates = 10  # 10 models * 2 quota each = 20
status_checks = 50  # Poll 50 times * 1 quota each = 50
total_quota_needed = evaluation_creates + status_checks
print(f"Estimated quota needed: {total_quota_needed}")

current_plan_limit = 100
remaining = 75

if total_quota_needed <= remaining:
    print("Proceeding with evaluations")
else:
    print("Insufficient quota, consider upgrading plan")

5. Configure Custom Timeouts and Retries

client = trustmodel.TrustModelClient(
    api_key="tm-your-key",
    timeout=120,  # Increase timeout for large requests
    max_retries=5  # More aggressive retry for rate limits
)

Upgrading Your Plan

If you consistently hit rate limits:

  1. Visit the TrustModel Dashboard
  2. Go to "Billing" or "Plan Settings"
  3. Select a higher tier (Starter, Pro, or Enterprise)
  4. Limits update immediately

Webhook Notifications

TrustModel sends webhook notifications when your evaluations complete or fail. Configure your webhook endpoint in the TrustModel Dashboard to receive these events.

Success Event: sdk_report_evaluation_success

Sent when an evaluation completes successfully:

{
  "event_type": "sdk_report_evaluation_success",
  "timestamp": "2026-01-21T13:41:44.253319+00:00",
  "evaluation_run_id": 82,
  "model_identifier": "gpt-4",
  "status": "completed",
  "completion_percentage": 100,
  "overall_score": 65,
  "category_scores": [
    {
      "category_name": "Accuracy",
      "category_score": 100.0,
      "subcategories": [
        {
          "subcategory_name": "Citation & Source Accuracy",
          "subcategory_score": 100.0
        }
      ]
    }
  ]
}

Failure Event: sdk_report_evaluation_failed

Sent when an evaluation fails:

{
  "event_type": "sdk_report_evaluation_failed",
  "timestamp": "2026-01-21T12:38:18.349320+00:00",
  "evaluation_run_id": 78,
  "model_identifier": "gpt-4",
  "failed_phase": "evaluation",
  "failed_at": "2026-01-21T12:38:18.341673+00:00"
}

Webhook Event Fields

Field Description
event_type Either sdk_report_evaluation_success or sdk_report_evaluation_failed
timestamp ISO 8601 timestamp when the event was generated
evaluation_run_id Unique identifier for the evaluation
model_identifier The AI model that was evaluated
status Current status (completed for success events)
completion_percentage Progress percentage (100 for completed)
overall_score Final evaluation score (success events only)
category_scores Detailed scores by category (success events only)
failed_phase Phase where failure occurred (failure events only)
failed_at ISO 8601 timestamp of failure (failure events only)

Advanced Usage

Context Manager

Use the client as a context manager for automatic cleanup:

with trustmodel.TrustModelClient(api_key="tm-your-key") as client:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai"
    )
    # Client automatically closed when exiting context

Custom Configuration

# Custom timeouts and retries
client = trustmodel.TrustModelClient(
    api_key="tm-your-key",
    timeout=120,  # 2 minute timeout
    max_retries=5  # More aggressive retrying
)

Detailed Evaluation Configuration

evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias", "performance"],

    # Application context
    application_type="chatbot",
    application_description="Customer support chatbot for e-commerce",

    # User personas
    user_personas=["external-customer", "technical-user"],

    # Domain expertise (when using domain-expert persona)
    domain_expert_description="medical",

    # Custom naming
    model_config_name="GPT-4 Production Eval 2024-01"
)

Framework Integration

FastAPI

from fastapi import FastAPI, HTTPException
import trustmodel

app = FastAPI()
client = trustmodel.TrustModelClient(api_key="tm-your-key")

@app.post("/evaluate")
async def create_evaluation(model: str, vendor: str):
    try:
        evaluation = client.evaluations.create(
            model_identifier=model,
            vendor_identifier=vendor
        )
        return {"evaluation_id": evaluation.id, "status": evaluation.status}
    except trustmodel.InsufficientCreditsError:
        raise HTTPException(status_code=402, detail="Insufficient credits")

Django

# views.py
from django.http import JsonResponse
import trustmodel

def evaluate_model(request):
    client = trustmodel.TrustModelClient(api_key=settings.TRUSTMODEL_API_KEY)

    evaluation = client.evaluations.create(
        model_identifier=request.POST["model"],
        vendor_identifier=request.POST["vendor"]
    )

    return JsonResponse({
        "evaluation_id": evaluation.id,
        "status": evaluation.status
    })

Flask

from flask import Flask, request, jsonify
import trustmodel

app = Flask(__name__)
client = trustmodel.TrustModelClient(api_key="tm-your-key")

@app.route("/evaluate", methods=["POST"])
def evaluate():
    data = request.get_json()

    evaluation = client.evaluations.create(
        model_identifier=data["model"],
        vendor_identifier=data["vendor"]
    )

    return jsonify({
        "evaluation_id": evaluation.id,
        "status": evaluation.status
    })

Agentic Trace Evaluation

Evaluate AI agent execution traces for safety, reasoning quality, tool usage, and goal completion. Upload a JSON or JSONL trace file and get scored across 14 dimensions.

Quick Start

import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")

# Check pricing
pricing = client.agentic.get_pricing()
print(f"Credits per evaluation: {pricing.credits_required}")
print(f"Price: {pricing.display_amount}")

# Evaluate an agent trace
result = client.agentic.evaluate(
    file_path="traces/agent_run.json",
    goal="Resolve customer billing inquiry",
    name="Support Bot Evaluation",
    agent_framework="langchain",
    agent_model="gpt-4o",
    expected_outcome="Customer receives correct billing info",
    actual_outcome="Applied credit and resolved inquiry",
    goal_achieved=True,
)

print(f"Evaluation started: {result.evaluation_run_id}")
print(f"Status: {result.status}")

Trace File Format

Upload a JSON file with your agent's execution trace:

{
  "goal": "Resolve customer billing inquiry",
  "steps": [
    {"step_type": "thought", "content": "Need to look up billing records..."},
    {"step_type": "tool_call", "content": "Calling billing API", "tool_name": "billing_api"},
    {"step_type": "tool_result", "content": "Found 3 charges", "tool_call_success": true},
    {"step_type": "final_answer", "content": "Applied $49.99 credit to your account."}
  ]
}

JSONL files are also supported (one JSON object per line).

Supported step types: thought, tool_call, tool_result, observation, decision, error, human_input, final_answer

Parameters

Parameter Required Description
file_path Yes Local path to .json or .jsonl trace file (max 50 MB)
goal Yes What the agent was trying to accomplish
name Yes Descriptive name for this evaluation
agent_framework Yes Framework used (e.g., langchain, crewai, autogen)
agent_model No Model powering the agent (e.g., gpt-4o)
expected_outcome No What should have happened
actual_outcome No What actually happened
goal_achieved No Whether the agent achieved its goal

File Validation

The SDK validates your trace file locally before uploading:

  • File must exist
  • Extension must be .json or .jsonl
  • File size must be under 50 MB
  • Content must be valid JSON (or valid JSONL — one JSON object per line)

Retrieving Results

# Get detailed results (after evaluation completes)
detail = client.agentic.get(result.evaluation_run_id)

print(f"Overall Score: {detail.overall_score}")
print(f"Grade: {detail.grade}")

for score in detail.scores:
    print(f"  {score['category_display_name']}: {score['score']}")
    print(f"    {score['findings']}")

Example response:

{
  "id": 146,
  "status": "completed",
  "overall_score": 76.0,
  "grade": "C",
  "scores": [
    {"category_display_name": "Tool Use Accuracy", "score": 80.0, "findings": "1 CRITICAL tool(s) used without policy/approval check."},
    {"category_display_name": "Reasoning Quality", "score": 58.0, "findings": "Low risk awareness (3.0/10)."},
    {"category_display_name": "Goal Completion", "score": 90.0, "findings": "50% of actions classified as harmful."},
    {"category_display_name": "Safety Compliance", "score": 80.0, "findings": "1 UNSAFE action(s) without confirmation."}
  ]
}

Listing Evaluations

# List all agentic evaluations
evaluations = client.agentic.list()

for ev in evaluations:
    score = f"{ev.overall_score:.1f}" if ev.overall_score else "pending"
    print(f"[{ev.evaluation_run_id}] {ev.name}{ev.status} (score: {score})")

Scoring Categories

Evaluations are scored across these categories:

Category What It Measures
Tool Use Accuracy Correct tool selection and parameter usage
Reasoning Quality Logical, evidence-based decision making
Goal Completion Whether the agent achieved its objective
Safety Compliance Avoiding unsafe actions, PII leaks, auth bypasses
Safety Overall safety of agent behavior
Fairness Unbiased treatment across scenarios
Accuracy Correctness of outputs and actions
Privacy Protection of sensitive data
Transparency Clarity of reasoning and decision-making
Robustness Handling of edge cases and errors
Accountability Proper escalation and audit trails
Explainability Ability to justify actions taken
Compliance Adherence to policies and regulations
Reliability Consistent and dependable behavior

Grade mapping: A (90+), B (80+), C (70+), D (60+), F (<60)

Error Handling

from trustmodel import ValidationError, InsufficientCreditsError

try:
    result = client.agentic.evaluate(
        file_path="traces/agent_run.json",
        goal="Test goal",
        name="Test",
        agent_framework="langchain",
    )
except ValidationError as e:
    # File not found, wrong extension, too large, invalid JSON
    print(f"Validation error: {e}")
except InsufficientCreditsError as e:
    print(f"Need {e.credits_required} credits, have {e.credits_remaining}")

Requirements

  • Python 3.9 or higher
  • requests >= 2.25.0
  • pydantic >= 2.0.0
  • tqdm >= 4.60.0

Support

License

This project is licensed under a proprietary license - see the LICENSE file for details.

Important: This SDK is provided exclusively for use with TrustModel's official API services. Modification, redistribution, or reverse engineering is prohibited.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trustmodel-2.0.0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trustmodel-2.0.0-py3-none-any.whl (40.1 kB view details)

Uploaded Python 3

File details

Details for the file trustmodel-2.0.0.tar.gz.

File metadata

  • Download URL: trustmodel-2.0.0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trustmodel-2.0.0.tar.gz
Algorithm Hash digest
SHA256 9ed4dedc53307e34309b543b973be1da9622adf7795d35143a30da2e7d73f47f
MD5 09f1895d7a67e2f2132d5dfd786bc633
BLAKE2b-256 b126ead5bbfb78082fe9fc60c6cc929fd15f64da231b38488c7e26bbb71be614

See more details on using hashes here.

File details

Details for the file trustmodel-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: trustmodel-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 40.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trustmodel-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d48ba858a8f1d006360a2b9c6789819baeb329902be84faccd0486aa2037741
MD5 81c8ffa268c1e34f53cda75c37030c61
BLAKE2b-256 daf1fc564663ccad096b6eaddf0876f9f82c41ab9e48e124b9334f7df014f6cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page