Standalone evaluation engine for LLM applications

These details have not been verified by PyPI

Project description

cat-experiments

Agnostic experiment runner for LLM applications that you can take to any server stack.

Cat Experiments

Most experiment frameworks are glued to a specific hosted platform, forcing you to swap libraries when you switch servers. cat-experiments keeps the core experiment loop (data model, runner, evaluators) identical whether you are running locally or wiring into Phoenix, CAT Cafe, or another backend. That gives teams a common starting point for new projects while still letting them plug into whichever server platform fits the deployment.

A flexible, DataFrame-compatible evaluation system that runs locally by default and plugs into common server infrastructures like CAT Cafe or Phoenix when you need them.

Features

Flexible Data Models: Support any dataset structure with dictionary-based input/output
Deterministic Preview Runs: Limit execution to an exact number of examples with preview_examples and preview_seed
Explicit Repetitions: Run each example multiple times and track repetition metadata end-to-end
Comprehensive Evaluators: Built-in evaluators for tool call correctness and more
Modern Python: Targets Python 3.12+ with modern typing features
Async Support: Full async/await support for evaluation pipelines
Tool Call Evaluation: Advanced matching algorithms for tool call correctness

Install

# from PyPI
pip install cat-experiments              # core package
pip install "cat-experiments[cat-cafe]"  # add extras for CAT Cafe examples
pip install "cat-experiments[phoenix]"   # add extras for Phoenix examples

Build and Run an Experiment

The flow mirrors the Phoenix “Run Experiments” tutorial: load data, write a task, attach evaluators, and run. Use preview_examples while iterating, then lift it for full runs.

from cat.experiments import (
    DatasetExample,
    ExperimentConfig,
    ExperimentRunner,
    EvaluationContext,
    EvaluationMetric,
    basic_tool_correctness_evaluator,
)

# 1) Shape your dataset
dataset = [
    DatasetExample(
        input={"question": "How do I reset my password?"},
        output={"answer": "Visit the reset page and follow the emailed verification link."},
        metadata={"category": "support"},
    ),
]

# 2) Implement the system under test (sync or async)
def task(example: DatasetExample) -> dict[str, str]:
    return {"answer": example.input["question"].upper()}

# 3) Add evaluators (use built-ins or custom)
def exact_match(context: EvaluationContext) -> EvaluationMetric:
    expected = context.output.get("answer")
    actual = context.actual_output.get("answer")
    score = float(expected == actual)
    return EvaluationMetric(
        name="exact_match",
        score=score,
        label="match" if score == 1 else "mismatch",
        metadata={"expected": expected, "actual": actual},
    )

# 4) Configure and run
runner = ExperimentRunner()  # swap in build_local_runner() or build_phoenix_runner() as needed
summary = runner.run(
    dataset=dataset,
    task=task,
    evaluators=[exact_match, basic_tool_correctness_evaluator],
    config=ExperimentConfig(
        name="Support Q&A smoke test",
        description="Walkthrough of the core experiment loop.",
        preview_examples=1,  # deterministic subset for quick debugging
        repetitions=2,       # run each example multiple times
    ),
)

print(summary.total_examples)           # => 2 example runs (1 example × 2 repetitions)
print(summary.average_scores["exact_match"])

Lower-level APIs let you drive the pipeline with explicit (example, repetition) pairs:

from cat.experiments import TestCase, generate, evaluate

runs = [TestCase(example=dataset[0], repetition_number=1)]
contexts = generate(runs, task)
results = evaluate(contexts, [exact_match])

CAT Cafe Integration

examples/cat_cafe_experiment_example.py mirrors the Phoenix flow but targets CAT Cafe:

Connect with cat-cafe-client (env CAT_BASE_URL and optional auth)
Load or create a dataset (CAT_DATASET defaults to cat-experiments-support-demo)
Convert dataset rows to DatasetExample
Define the task + evaluators
Stream runs to CAT Cafe via CatCafeExperimentListener

export CAT_BASE_URL=http://localhost:8000
export CAT_DATASET=cat-experiments-support-demo
uv run examples/cat_cafe_experiment_example.py

To resume only unfinished work in CAT Cafe, use the evaluation/resume helpers inside the example script (or wire them yourself with CatCafeEvaluationCoordinator).

Phoenix Integration

examples/phoenix_experiment_example.py reproduces the Phoenix “Run Experiments” tutorial using cat-experiments while staying offline-friendly:

Connect with phoenix-client (env vars like PHOENIX_BASE_URL, PHOENIX_API_KEY)
Load or create a dataset (CAT_EVALS_DATASET defaults to support-ticket-demo)
Convert Phoenix examples to DatasetExample
Define the task + evaluators
Stream runs back to Phoenix via PhoenixExperimentListener

uv run examples/phoenix_experiment_example.py

# Resume unfinished Phoenix experiment without re-running completed work
uv run examples/phoenix_experiment_example.py --resume exp_123

The important hooks if you are wiring this yourself:

from cat.experiments.runner_builders import build_phoenix_runner
from cat.experiments.adapters.phoenix import PhoenixResumeCoordinator

runner = build_phoenix_runner(client=phoenix_client)
summary = runner.run(dataset=examples, task=task, evaluators=[exact_match], config=config)

coordinator = PhoenixResumeCoordinator(phoenix_client)
coordinator.resume_task_runs(experiment_id="exp_123", task=task, evaluators=[exact_match], runner=runner)

Runner Builders

If you prefer not to wire listeners manually, use the builder helpers:

from cat.experiments import (
    build_local_runner,
    build_phoenix_runner,
    build_cat_cafe_runner,
)

local_runner = build_local_runner()
cat_runner = build_cat_cafe_runner()
phoenix_runner = build_phoenix_runner()

Each factory returns an ExperimentRunner with the matching adapter configured plus the local storage adapter, so you can immediately call runner.run(...) without additional plumbing.

Resume Cached Experiments

When runs are cached locally, you can resume unfinished repetitions without touching Phoenix or CAT Cafe:

from cat.experiments.adapters import LocalCacheResumeCoordinator

coordinator = LocalCacheResumeCoordinator()
plan = coordinator.build_task_resume_plan("exp_123")

if plan.has_work:
    coordinator.resume_task_runs(
        experiment_id="exp_123",
        task=test_function,
        evaluators=[my_evaluator],
    )

The local storage adapter captures config.json, examples.jsonl, and runs.jsonl per experiment so the resume coordinator can replay only the pending (example, repetition) pairs.

For an end-to-end walkthrough that stays entirely on disk, run the local storage example:

uv run examples/local_storage_evaluator_example.py

It writes runs via LocalStorageExperimentListener, then uses LocalEvaluationCoordinator plus ExperimentRunner.rerun_evaluators() to append a new evaluator without re-running the task phase.

Re-run Evaluators Later

To mirror Phoenix's "persist first, evaluate later" flow, both the local cache and CAT Cafe adapters now expose evaluation coordinators that rehydrate recorded runs before executing new evaluators.

from cat.experiments import ExperimentRunner
from cat.experiments.adapters import (
    LocalEvaluationCoordinator,
    CatCafeEvaluationCoordinator,
    PhoenixEvaluationCoordinator,
)
from cat.cafe.client import CATCafeClient
from phoenix.client import Client as PhoenixClient

local_eval = LocalEvaluationCoordinator()
local_eval.run_evaluators(
    experiment_id="exp_123",
    evaluators=[accuracy_evaluator, safety_check],
)
local_plan = local_eval.fetch_experiment("exp_123")  # inspect cached runs without re-running

cat_eval = CatCafeEvaluationCoordinator(CATCafeClient())
cat_eval.run_evaluators(
    experiment_id="exp_456",
    evaluators=[hallucination_score],
)

plan = cat_eval.fetch_experiment(experiment_id="exp_456")  # plan.results holds recorded runs

runner = ExperimentRunner()
runner.rerun_evaluators(
    experiment_id="exp_123",
    evaluators=[latency_grade],
    backend=local_eval,
)

phoenix_eval = PhoenixEvaluationCoordinator(PhoenixClient())
phoenix_eval.run_evaluators(
    experiment_id="exp_789",
    evaluators=[cost_score],
)
fetched = phoenix_eval.fetch_experiment("exp_789")  # includes config/examples/results/summary

LocalEvaluationCoordinator updates the cached runs.jsonl with the new metrics, while CatCafeEvaluationCoordinator automatically resubmits the enriched results to CAT Cafe so the UI can display the added evaluators without rerunning any tasks. ExperimentRunner.rerun_evaluators centralizes the evaluate-only flow so you can plug in any backend that knows how to fetch/persist runs.

Core Components

DatasetExample – Flexible dataset storage
TestCase – Execution plan objects that pair an example with a repetition_number before running
EvaluationContext – Rich evaluation context with tool call support
EvaluationMetric – Structured evaluation results
generate() / evaluate() – Core evaluation pipeline functions
ExperimentRunner / AsyncExperimentRunner – High-level orchestration with preview + repetition controls
Built-in evaluators for common evaluation tasks

Architecture

This package is designed to be standalone and framework-agnostic, focusing purely on evaluation logic without server dependencies.

Tracing & Instrumentation

Cat-evals ships OpenTelemetry helpers (install with pip install cat-experiments[tracing]) such as capture_agent_trace() and ExperimentTraceCapture, but they do not activate OpenInference instrumentors for you. Configure any instrumentation you need (for example openinference.instrumentation.openai.OpenAIInstrumentor().instrument()) before entering the capture context:

from openinference.instrumentation.openai import OpenAIInstrumentor
from cat.experiments.tracing import capture_experiment_trace

OpenAIInstrumentor().instrument()

with capture_experiment_trace(example_id="ex-1", experiment_id="exp-123") as (root_span, capture):
    ...

This keeps cat-experiments lightweight while ensuring clients stay in control of which SDKs are instrumented.

Enabling the OTEL run observer

Tracing is now wired through a generic observer plugin system. After installing the tracing extra, importing cat.experiments.tracing automatically registers the OTEL observer so tool calls and trace identifiers are captured for each run.

You can build your own observers by implementing cat.experiments.observers.RunObserver and calling register_observer().

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.12

Jan 19, 2026

0.0.11

Jan 15, 2026

0.0.10

Jan 14, 2026

0.0.9

Jan 11, 2026

0.0.8

Jan 9, 2026

0.0.7

Jan 9, 2026

0.0.6

Jan 7, 2026

0.0.5

Jan 6, 2026

0.0.4

Jan 5, 2026

This version

0.0.3

Dec 22, 2025

0.0.2

Nov 20, 2025

0.0.1

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_experiments-0.0.3.tar.gz (192.6 kB view details)

Uploaded Dec 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_experiments-0.0.3-py3-none-any.whl (71.9 kB view details)

Uploaded Dec 22, 2025 Python 3

File details

Details for the file cat_experiments-0.0.3.tar.gz.

File metadata

Download URL: cat_experiments-0.0.3.tar.gz
Upload date: Dec 22, 2025
Size: 192.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cat_experiments-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`91af29dfa8655ab1f7b7a21c651fc202962e7cc9ac4e81ba74b26601c2caf8e1`
MD5	`089d691470f09b2a05e8abad93483325`
BLAKE2b-256	`d805b509a6e0d6cbb6f576f512a9cf82cfecae79fe9eb380a23f0e6f93a31768`

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.3-py3-none-any.whl.

File metadata

Download URL: cat_experiments-0.0.3-py3-none-any.whl
Upload date: Dec 22, 2025
Size: 71.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cat_experiments-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6294480b5b5559928b867b5a192ca528efe36793662596b5b10e484307175ea4`
MD5	`3dd21ae3532095a44f218d4c2e9da883`
BLAKE2b-256	`3686792385d31dab7620c846572502431c834e98a9b8d9bbcfad909995cad402`

See more details on using hashes here.

cat-experiments 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cat-experiments

Features

Install

Build and Run an Experiment

CAT Cafe Integration

Phoenix Integration

Runner Builders

Resume Cached Experiments

Re-run Evaluators Later

Core Components

Architecture

Tracing & Instrumentation

Enabling the OTEL run observer

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes