Standalone evaluation engine for LLM applications

These details have not been verified by PyPI

Project description

cat-experiments

Agnostic experiment runner for LLM applications that you can take to any server stack.

Cat Experiments

Most experiment frameworks are glued to a specific hosted platform, forcing you to swap libraries when you switch servers. cat-experiments keeps the core experiment loop (data model, runner, evaluators) identical whether you are running locally or wiring into Phoenix, CAT Cafe, or another backend. That gives teams a common starting point for new projects while still letting them plug into whichever server platform fits the deployment.

Features

CLI-First Design: Define experiments as Python files and run them with cat-experiments run
Flexible Data Models: Support any dataset structure with dictionary-based input/output
Deterministic Preview Runs: Limit execution to an exact number of examples with --dry-run
Explicit Repetitions: Run each example multiple times with --repetitions
Comprehensive Evaluators: Built-in evaluators for tool call correctness and more
Modern Python: Targets Python 3.12+ with modern typing features
Async Support: Full async/await support for evaluation pipelines
Tool Call Evaluation: Advanced matching algorithms for tool call correctness

Install

# from PyPI
pip install cat-experiments              # core package
pip install "cat-experiments[cat-cafe]"  # add extras for CAT Cafe
pip install "cat-experiments[phoenix]"   # add extras for Phoenix

Quick Start

Create an experiment file:

# my_experiment.py
from cat.experiments.protocol import TaskInput, TaskOutput, EvalInput, EvalOutput
from cat.experiments.sdk import task, evaluator

@task
async def my_task(input: TaskInput) -> TaskOutput:
    """The system under test."""
    question = input.input.get("question", "")
    # In a real experiment, you'd call your LLM here
    return TaskOutput(output={"answer": question.upper()})

@evaluator
def exact_match(input: EvalInput) -> EvalOutput:
    """Check if actual matches expected."""
    expected = input.expected_output.get("answer", "") if input.expected_output else ""
    actual = input.actual_output.get("answer", "") if input.actual_output else ""
    score = 1.0 if expected == actual else 0.0
    return EvalOutput(
        score=score,
        label="match" if score == 1.0 else "mismatch",
    )

Run it:

# With a local JSON/JSONL dataset
cat-experiments run my_experiment.py --dataset data.jsonl

# With dry-run mode (run only N examples, no persistence)
cat-experiments run my_experiment.py --dataset data.jsonl --dry-run 5

# With multiple repetitions
cat-experiments run my_experiment.py --dataset data.jsonl --repetitions 3

# With parallel workers
cat-experiments run my_experiment.py --dataset data.jsonl --max-workers 10

# Stream results to Phoenix
cat-experiments run my_experiment.py --dataset data.jsonl --storage phoenix

# Stream results to CAT Cafe
cat-experiments run my_experiment.py --dataset data.jsonl --storage cat-cafe

Dataset Format

Datasets are JSON or JSONL files with examples containing input, output, and optional metadata:

{"input": {"question": "How do I reset my password?"}, "output": {"answer": "Visit settings."}, "metadata": {"category": "support"}}
{"input": {"question": "Can I upgrade mid-cycle?"}, "output": {"answer": "Yes, prorated."}, "metadata": {"category": "billing"}}

Writing Experiments

Task Function

The @task decorator marks your system under test. It receives a TaskInput with the example data and returns the output:

from cat.experiments.protocol import TaskInput, TaskOutput
from cat.experiments.sdk import task

@task
async def my_llm_task(input: TaskInput) -> TaskOutput:
    """Call your LLM or agent here."""
    question = input.input.get("question", "")
    params = input.params  # Access experiment params (e.g., model name)
    
    # Your LLM call here
    response = await call_my_llm(question, model=params.get("model"))
    
    return TaskOutput(
        output={"answer": response},
        metadata={"tokens": 150},  # Optional metadata
    )

# Module-level config (optional)
params = {"model": "gpt-4o-mini"}
name = "My Experiment"

Evaluator Functions

The @evaluator decorator marks evaluation functions. They receive an EvalInput with both expected and actual outputs:

from cat.experiments.protocol import EvalInput, EvalOutput
from cat.experiments.sdk import evaluator

@evaluator
def accuracy(input: EvalInput) -> EvalOutput:
    """Check if the answer is correct."""
    expected = input.expected_output.get("answer", "") if input.expected_output else ""
    actual = input.actual_output.get("answer", "") if input.actual_output else ""
    
    score = 1.0 if expected.lower() == actual.lower() else 0.0
    return EvalOutput(score=score, label="correct" if score else "incorrect")

@evaluator 
async def llm_judge(input: EvalInput) -> EvalOutput:
    """Use an LLM to judge quality (async evaluators supported)."""
    # Your LLM judge call here
    judgment = await call_judge_llm(input.actual_output)
    return EvalOutput(score=judgment.score, metadata={"reasoning": judgment.reason})

Storage Backends

Local Storage (default)

Results are stored locally:

cat-experiments run my_experiment.py --dataset data.jsonl

Phoenix Integration

Stream results to Phoenix:

# Set Phoenix connection (or use PHOENIX_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
    --storage phoenix \
    --storage-url http://localhost:6006

CAT Cafe Integration

Stream results to CAT Cafe:

# Set CAT Cafe connection (or use CAT_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
    --storage cat-cafe \
    --storage-url http://localhost:8000

CLI Options

cat-experiments run <experiment.py> [OPTIONS]

Dataset Options:
  --dataset PATH          Local dataset file (JSON/JSONL)
  --dataset-name NAME     Remote dataset name
  --dataset-id ID         Remote dataset ID

Storage Options:
  --storage TYPE          Storage backend: local, phoenix, cat-cafe (default: local)
  --storage-url URL       URL for remote storage

Experiment Options:
  --param KEY=VALUE       Override params (can be repeated)
  --max-workers N         Parallel workers (default: 5)
  --repetitions N         Repetitions per example (default: 1)
  --dry-run [N]           Run N examples without persistence (default: 1)
  --resume EXPERIMENT_ID  Resume a previous experiment

Output Options:
  --output FORMAT         Output format: text, json (default: text)

Built-in Tool Call Evaluators

The package includes tool call correctness evaluators for agent evaluation:

from cat.experiments.sdk import match_tool_calls, ToolCallMatch, ToolCallMatchingResult

# Match expected vs actual tool calls
result: ToolCallMatchingResult = match_tool_calls(
    expected_calls=[{"name": "search", "arguments": {"query": "weather"}}],
    actual_calls=[{"name": "search", "arguments": {"query": "weather today"}}],
)

print(result.precision, result.recall, result.f1_score)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    CLI / Orchestrator                   │
│  - Dataset loading                                      │
│  - Parallel execution (windowed dispatch)               │
│  - Storage backends (Local, Phoenix, Cat Cafe)          │
│  - Progress reporting                                   │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      Executor                           │
│  - Loads experiment file                                │
│  - Runs @task and @evaluator functions                  │
│  - Returns results                                      │
└─────────────────────────────────────────────────────────┘

The package uses an executor protocol that separates orchestration from execution, enabling future support for:

Multi-language experiments (TypeScript, etc.)
Distributed execution
Custom execution backends

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.12

Jan 19, 2026

0.0.11

Jan 15, 2026

0.0.10

Jan 14, 2026

0.0.9

Jan 11, 2026

0.0.8

Jan 9, 2026

0.0.7

Jan 9, 2026

0.0.6

Jan 7, 2026

This version

0.0.5

Jan 6, 2026

0.0.4

Jan 5, 2026

0.0.3

Dec 22, 2025

0.0.2

Nov 20, 2025

0.0.1

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_experiments-0.0.5.tar.gz (196.6 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_experiments-0.0.5-py3-none-any.whl (69.4 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file cat_experiments-0.0.5.tar.gz.

File metadata

Download URL: cat_experiments-0.0.5.tar.gz
Upload date: Jan 6, 2026
Size: 196.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cat_experiments-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`32968d6d1d0c8ec0cf2f693a5a29fcaea8db66a22436cee9b62f65790770ef88`
MD5	`7d7d135ee0d2f52df3fb92f46dee861c`
BLAKE2b-256	`44a5c897bc88978eab29d8ceba8236e21ef8a8da576f405300aa9aef72640d8c`

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.5-py3-none-any.whl.

File metadata

Download URL: cat_experiments-0.0.5-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 69.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cat_experiments-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`71ce1a024c0313dc0dc7941ce0065aa555c8cdf2d7e24629d7d122616435dfd0`
MD5	`79cd8a512b7c7147ac4162de9ef19306`
BLAKE2b-256	`5db91f19a6bfd51b7abc03426b22c36ead6ed8b9da5f7f59f0acf49be8439a09`

See more details on using hashes here.

cat-experiments 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cat-experiments

Features

Install

Quick Start

Dataset Format

Writing Experiments

Task Function

Evaluator Functions

Storage Backends

Local Storage (default)

Phoenix Integration

CAT Cafe Integration

CLI Options

Built-in Tool Call Evaluators

Architecture

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes