Skip to main content

Standalone evaluation engine for LLM applications

Project description

cat-experiments

Standalone evaluation engine for LLM applications.

Cat Experiments

A flexible, DataFrame-compatible evaluation system that works standalone or integrates with cat-cafe server infrastructure.

Features

  • Flexible Data Models: Support any dataset structure with dictionary-based input/output
  • Deterministic Preview Runs: Limit execution to an exact number of examples with preview_examples and preview_seed
  • Explicit Repetitions: Run each example multiple times and track repetition metadata end-to-end
  • Comprehensive Evaluators: Built-in evaluators for tool call correctness and more
  • Modern Python: Targets Python 3.12+ with modern typing features
  • Async Support: Full async/await support for evaluation pipelines
  • Tool Call Evaluation: Advanced matching algorithms for tool call correctness

Quick Start

from cat.experiments import (
    DatasetExample,
    TestCase,
    ExperimentConfig,
    ExperimentRunner,
    basic_tool_correctness_evaluator,
)

# Describe your dataset
dataset = [
    DatasetExample(
        input={"messages": [{"role": "user", "content": "Hello"}]},
        output={"messages": [{"role": "assistant", "content": "Hi there!"}]},
    )
]

# Define the system under test
def my_llm_function(example: DatasetExample) -> str:
    return "Hi there!"

# Execute a small preview with two repetitions per example
runner = ExperimentRunner()
summary = runner.run(
    dataset=dataset,
    task=my_llm_function,
    evaluators=[basic_tool_correctness_evaluator],
    config=ExperimentConfig(
        name="Smoke Test",
        preview_examples=1,
        repetitions=2,
    ),
)

print(summary.total_examples)  # => 2 example runs (1 example × 2 repetitions)

If you prefer the lower-level APIs, generate() now accepts TestCase objects so you can decide exactly which example/repetition pairs to process:

from cat.experiments import TestCase, generate, evaluate

runs = [TestCase(example=dataset[0], repetition_number=1)]
contexts = generate(runs, my_llm_function)
results = evaluate(contexts, [basic_tool_correctness_evaluator])

Phoenix Integration

To mirror the Phoenix “Run Experiments” tutorial while remaining offline-friendly, plug the PhoenixExperimentListener into the cat-experiments runner. Because Phoenix support depends on the optional phoenix-client, import it explicitly:

from cat.experiments.adapters.phoenix import PhoenixExperimentListener, PhoenixSyncConfig
# Configure phoenix-client per its docs (set env vars, config files, etc.)
export CAT_EVALS_DATASET=support-ticket-demo
python packages/cat-experiments/examples/phoenix_experiment_example.py

The script in packages/cat-experiments/examples/phoenix_experiment_example.py shows how to:

  • Fetch a dataset with phoenix-client
  • Convert it to DatasetExample objects
  • Run a cat.experiments experiment (task + evaluator)
  • Stream runs/evaluations back to Phoenix using the PhoenixExperimentListener

If the named dataset does not exist, the script will automatically create a sample support-ticket dataset so you can get started immediately.

CAT Cafe Integration

CAT Cafe users can mirror the server-side experiment records directly from cat-experiments by attaching CatCafeExperimentListener. A minimal setup:

from cat_cafe.sdk.client import CATCafeClient
from cat.experiments.adapters import CatCafeExperimentListener, CatCafeSyncConfig
from cat.experiments import ExperimentRunner, ExperimentConfig

client = CATCafeClient(base_url="http://localhost:8000")
listener = CatCafeExperimentListener(client, config=CatCafeSyncConfig(submission_mode="on_completion"))

runner = ExperimentRunner()
runner.listeners.append(listener)
runner.run(dataset=examples, task=my_task, evaluators=[my_evaluator],
           config=ExperimentConfig(name="My CAT experiment", dataset_id="dataset-123"))

Each completed example is transformed into CAT Cafe's experiment result schema, and the listener automatically calls start_experiment, submit_results, and complete_experiment so the run appears in the CAT Cafe UI.

To see a full working example that seeds a dataset and streams a run to CAT Cafe, run:

export CAT_BASE_URL=http://localhost:8000
export CAT_DATASET=cat-experiments-support-demo
uv run packages/cat-experiments/examples/cat_cafe_experiment_example.py

The script follows the same offline-friendly pattern as the Phoenix example, automatically creating a sample dataset if the name is not found.

Runner Builders

If you prefer not to wire listeners manually, use the builder helpers:

from cat.experiments import (
    build_local_runner,
    build_phoenix_runner,
    build_cat_cafe_runner,
)

local_runner = build_local_runner()
cat_runner = build_cat_cafe_runner()
phoenix_runner = build_phoenix_runner()

Each factory returns an ExperimentRunner with the matching adapter configured plus the local storage adapter, so you can immediately call runner.run(...) without additional plumbing.

Resume Cached Experiments

When runs are cached locally, you can resume unfinished repetitions without touching Phoenix or CAT Cafe:

from cat.experiments.adapters import LocalCacheResumeCoordinator

coordinator = LocalCacheResumeCoordinator()
plan = coordinator.build_task_resume_plan("exp_123")

if plan.has_work:
    coordinator.resume_task_runs(
        experiment_id="exp_123",
        task=test_function,
        evaluators=[my_evaluator],
    )

The local storage adapter captures config.json, examples.jsonl, and runs.jsonl per experiment so the resume coordinator can replay only the pending (example, repetition) pairs.

For an end-to-end walkthrough that stays entirely on disk, run the local storage example:

uv run packages/cat-experiments/examples/local_storage_evaluator_example.py

It writes runs via LocalStorageExperimentListener, then uses LocalEvaluationCoordinator plus ExperimentRunner.rerun_evaluators() to append a new evaluator without re-running the task phase.

Re-run Evaluators Later

To mirror Phoenix's "persist first, evaluate later" flow, both the local cache and CAT Cafe adapters now expose evaluation coordinators that rehydrate recorded runs before executing new evaluators.

from cat.experiments import ExperimentRunner
from cat.experiments.adapters import (
    LocalEvaluationCoordinator,
    CatCafeEvaluationCoordinator,
    PhoenixEvaluationCoordinator,
)
from cat_cafe.sdk.client import CATCafeClient
from phoenix.client import Client as PhoenixClient

local_eval = LocalEvaluationCoordinator()
local_eval.run_evaluators(
    experiment_id="exp_123",
    evaluators=[accuracy_evaluator, safety_check],
)

cat_eval = CatCafeEvaluationCoordinator(CATCafeClient())
cat_eval.run_evaluators(
    experiment_id="exp_456",
    evaluators=[hallucination_score],
)

runner = ExperimentRunner()
runner.rerun_evaluators(
    experiment_id="exp_123",
    evaluators=[latency_grade],
    backend=local_eval,
)

phoenix_eval = PhoenixEvaluationCoordinator(PhoenixClient())
phoenix_eval.run_evaluators(
    experiment_id="exp_789",
    evaluators=[cost_score],
)

LocalEvaluationCoordinator updates the cached runs.jsonl with the new metrics, while CatCafeEvaluationCoordinator automatically resubmits the enriched results to CAT Cafe so the UI can display the added evaluators without rerunning any tasks. ExperimentRunner.rerun_evaluators centralizes the evaluate-only flow so you can plug in any backend that knows how to fetch/persist runs.

Core Components

  • DatasetExample – Flexible dataset storage
  • TestCase – Execution plan objects that pair an example with a repetition_number before running
  • EvaluationContext – Rich evaluation context with tool call support
  • EvaluationMetric – Structured evaluation results
  • generate() / evaluate() – Core evaluation pipeline functions
  • ExperimentRunner / AsyncExperimentRunner – High-level orchestration with preview + repetition controls
  • Built-in evaluators for common evaluation tasks

Architecture

This package is designed to be standalone and framework-agnostic, focusing purely on evaluation logic without server dependencies.

Tracing & Instrumentation

Cat-evals ships OpenTelemetry helpers (install with pip install cat-experiments[tracing]) such as capture_agent_trace() and ExperimentTraceCapture, but they do not activate OpenInference instrumentors for you. Configure any instrumentation you need (for example openinference.instrumentation.openai.OpenAIInstrumentor().instrument()) before entering the capture context:

from openinference.instrumentation.openai import OpenAIInstrumentor
from cat.experiments.tracing import capture_experiment_trace

OpenAIInstrumentor().instrument()

with capture_experiment_trace(example_id="ex-1", experiment_id="exp-123") as (root_span, capture):
    ...

This keeps cat-experiments lightweight while ensuring clients stay in control of which SDKs are instrumented.

Enabling the OTEL run observer

Tracing is now wired through a generic observer plugin system. After installing the tracing extra, importing cat.experiments.tracing automatically registers the OTEL observer so tool calls and trace identifiers are captured for each run.

You can build your own observers by implementing cat.experiments.observers.RunObserver and calling register_observer().

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_experiments-0.0.1.tar.gz (186.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cat_experiments-0.0.1-py3-none-any.whl (63.1 kB view details)

Uploaded Python 3

File details

Details for the file cat_experiments-0.0.1.tar.gz.

File metadata

  • Download URL: cat_experiments-0.0.1.tar.gz
  • Upload date:
  • Size: 186.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for cat_experiments-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5761d8a4070ad25918c19a58023c2e3566f3c66c63f88f72e727f1fa1d9bb9b5
MD5 f910e082273feb7999527dc6248739fd
BLAKE2b-256 c14dc2c79c9fbb215b5dfba9e13c8db5d6d096ffd3c5ad2d2c640eb2d74c05ea

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cat_experiments-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 65297ff0c193c1fc7e422b02954c0b3ad206cc9d17c15297b2108572d6a1c5d5
MD5 eb866c7c352b5e782f784b038b697bc6
BLAKE2b-256 fdb0201a5e64f4923cc3da8ebc792efd47dbf6955f4eb0f31ec04fa4891fa597

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page