Skip to main content

Standalone evaluation engine for LLM applications

Project description

cat-experiments

Agnostic experiment runner for LLM applications that you can take to any server stack.

Cat Experiments

Most experiment frameworks are glued to a specific hosted platform, forcing you to swap libraries when you switch servers. cat-experiments keeps the core experiment loop (data model, runner, evaluators) identical whether you are running locally or wiring into Phoenix, CAT Cafe, or another backend. That gives teams a common starting point for new projects while still letting them plug into whichever server platform fits the deployment.

Features

  • CLI-First Design: Define experiments as Python files and run them with cat-experiments run
  • Flexible Data Models: Support any dataset structure with dictionary-based input/output
  • Deterministic Preview Runs: Limit execution to an exact number of examples with --dry-run
  • Explicit Repetitions: Run each example multiple times with --repetitions
  • Comprehensive Evaluators: Built-in evaluators for tool call correctness and more
  • Modern Python: Targets Python 3.12+ with modern typing features
  • Async Support: Full async/await support for evaluation pipelines
  • Tool Call Evaluation: Advanced matching algorithms for tool call correctness

Install

# from PyPI
pip install cat-experiments              # core package
pip install "cat-experiments[cat-cafe]"  # add extras for CAT Cafe
pip install "cat-experiments[phoenix]"   # add extras for Phoenix

Quick Start

Create an experiment file:

# my_experiment.py
from cat.experiments.protocol import TaskInput, TaskOutput, EvalInput, EvalOutput
from cat.experiments.sdk import task, evaluator

@task
async def my_task(input: TaskInput) -> TaskOutput:
    """The system under test."""
    question = input.input.get("question", "")
    # In a real experiment, you'd call your LLM here
    return TaskOutput(output={"answer": question.upper()})

@evaluator
def exact_match(input: EvalInput) -> EvalOutput:
    """Check if actual matches expected."""
    expected = input.expected_output.get("answer", "") if input.expected_output else ""
    actual = input.actual_output.get("answer", "") if input.actual_output else ""
    score = 1.0 if expected == actual else 0.0
    return EvalOutput(
        score=score,
        label="match" if score == 1.0 else "mismatch",
    )

Run it:

# With a local JSON/JSONL dataset
cat-experiments run my_experiment.py --dataset data.jsonl

# With dry-run mode (run only N examples, no persistence)
cat-experiments run my_experiment.py --dataset data.jsonl --dry-run 5

# With multiple repetitions
cat-experiments run my_experiment.py --dataset data.jsonl --repetitions 3

# With parallel workers
cat-experiments run my_experiment.py --dataset data.jsonl --max-workers 10

# Stream results to Phoenix
cat-experiments run my_experiment.py --dataset data.jsonl --storage phoenix

# Stream results to CAT Cafe
cat-experiments run my_experiment.py --dataset data.jsonl --storage cat-cafe

Dataset Format

Datasets are JSON or JSONL files with examples containing input, output, and optional metadata:

{"input": {"question": "How do I reset my password?"}, "output": {"answer": "Visit settings."}, "metadata": {"category": "support"}}
{"input": {"question": "Can I upgrade mid-cycle?"}, "output": {"answer": "Yes, prorated."}, "metadata": {"category": "billing"}}

Writing Experiments

Task Function

The @task decorator marks your system under test. It receives a TaskInput with the example data and returns the output:

from cat.experiments.protocol import TaskInput, TaskOutput
from cat.experiments.sdk import task

@task
async def my_llm_task(input: TaskInput) -> TaskOutput:
    """Call your LLM or agent here."""
    question = input.input.get("question", "")
    params = input.params  # Access experiment params (e.g., model name)

    # Your LLM call here
    response = await call_my_llm(question, model=params.get("model"))

    return TaskOutput(
        output={"answer": response},
        metadata={"tokens": 150},  # Optional metadata
    )

# Module-level config (optional)
params = {"model": "gpt-4o-mini"}
name = "My Experiment"

Evaluator Functions

The @evaluator decorator marks evaluation functions. They receive an EvalInput with both expected and actual outputs:

from cat.experiments.protocol import EvalInput, EvalOutput
from cat.experiments.sdk import evaluator

@evaluator
def accuracy(input: EvalInput) -> EvalOutput:
    """Check if the answer is correct."""
    expected = input.expected_output.get("answer", "") if input.expected_output else ""
    actual = input.actual_output.get("answer", "") if input.actual_output else ""

    score = 1.0 if expected.lower() == actual.lower() else 0.0
    return EvalOutput(score=score, label="correct" if score else "incorrect")

@evaluator
async def llm_judge(input: EvalInput) -> EvalOutput:
    """Use an LLM to judge quality (async evaluators supported)."""
    # Your LLM judge call here
    judgment = await call_judge_llm(input.actual_output)
    return EvalOutput(score=judgment.score, metadata={"reasoning": judgment.reason})

Storage Backends

Local Storage (default)

Results are stored locally:

cat-experiments run my_experiment.py --dataset data.jsonl

Phoenix Integration

Stream results to Phoenix:

# Set Phoenix connection (or use PHOENIX_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
    --storage phoenix \
    --storage-url http://localhost:6006

CAT Cafe Integration

Stream results to CAT Cafe:

# Set CAT Cafe connection (or use CAT_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
    --storage cat-cafe \
    --storage-url http://localhost:8000

CLI Options

cat-experiments run <experiment.py> [OPTIONS]

Dataset Options:
  --dataset PATH          Local dataset file (JSON/JSONL)
  --dataset-name NAME     Remote dataset name
  --dataset-id ID         Remote dataset ID

Storage Options:
  --storage TYPE          Storage backend: local, phoenix, cat-cafe (default: local)
  --storage-url URL       URL for remote storage

Experiment Options:
  --param KEY=VALUE       Override params (can be repeated)
  --max-workers N         Parallel workers (default: 5)
  --repetitions N         Repetitions per example (default: 1)
  --dry-run [N]           Run N examples without persistence (default: 1)
  --resume EXPERIMENT_ID  Resume a previous experiment

Output Options:
  --output FORMAT         Output format: text, json (default: text)

Built-in Tool Call Evaluators

The package includes tool call correctness evaluators for agent evaluation:

from cat.experiments.sdk import match_tool_calls, ToolCallMatch, ToolCallMatchingResult

# Match expected vs actual tool calls
result: ToolCallMatchingResult = match_tool_calls(
    expected_calls=[{"name": "search", "arguments": {"query": "weather"}}],
    actual_calls=[{"name": "search", "arguments": {"query": "weather today"}}],
)

print(result.precision, result.recall, result.f1_score)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    CLI / Orchestrator                   │
│  - Dataset loading                                      │
│  - Parallel execution (windowed dispatch)               │
│  - Storage backends (Local, Phoenix, Cat Cafe)          │
│  - Progress reporting                                   │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      Executor                           │
│  - Loads experiment file                                │
│  - Runs @task and @evaluator functions                  │
│  - Returns results                                      │
└─────────────────────────────────────────────────────────┘

The package uses an executor protocol that separates orchestration from execution, enabling future support for:

  • Multi-language experiments (TypeScript, etc.)
  • Distributed execution
  • Custom execution backends

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_experiments-0.0.12.tar.gz (360.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cat_experiments-0.0.12-py3-none-win_amd64.whl (6.0 MB view details)

Uploaded Python 3Windows x86-64

cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.3 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl (5.7 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl (6.0 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file cat_experiments-0.0.12.tar.gz.

File metadata

  • Download URL: cat_experiments-0.0.12.tar.gz
  • Upload date:
  • Size: 360.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12.tar.gz
Algorithm Hash digest
SHA256 f6041f778420dbde8491b834887ef3866cbacb2096cf02dba5821d2b005e72ec
MD5 92940929d43248623f0cb8afc7dbe06b
BLAKE2b-256 aa5de3e556342d0dd104ae93a40806352430eb68fccd0e549c94e309bb68330b

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.12-py3-none-win_amd64.whl.

File metadata

  • Download URL: cat_experiments-0.0.12-py3-none-win_amd64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 4a415e59ce01d700b66499c239f1622563ce184926544810a02d3de2d2b369fb
MD5 c5978f4ee09852d04b32e10e56053473
BLAKE2b-256 6c043d01b2ec704eaffc48096794aff3c5fb4fe0bd8937b37c48c35b0dbce313

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 5.8 MB
  • Tags: Python 3, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31eca5f4e5b93c36f788350762b1104a0a242822e7db5a33a1b38ce3752a3b82
MD5 c2ed5b1fe6b25ad396d8be70ad034dac
BLAKE2b-256 2833c29054d83557f92b220d1b39fdedcdd9b462cea2e624039265fc8aabed94

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

  • Download URL: cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
  • Upload date:
  • Size: 5.3 MB
  • Tags: Python 3, manylinux: glibc 2.17+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 23a2cd0148d405b2eafdaf778fbfec6768380371f0b8c318791ad537b6f23078
MD5 4c7ace06809beabe531a970e724d9489
BLAKE2b-256 409cf8d71a76bbc96b866bcc44358887d1641d945b063314303e79619555bb45

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl.

File metadata

  • Download URL: cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: Python 3, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5533664fb9543374ea1aab98509eabea2a560b910310617576face1d33b7075e
MD5 9b6e29d7f122c8012542a7adf5b37737
BLAKE2b-256 0fbbae3e9f169720e151636957e21fa0bfec4753c5d32e02f856f35b1be13ae3

See more details on using hashes here.

File details

Details for the file cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 3, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3c2a03a8b273d4ffcd809a7231f903e4e34cc3f2549eca90f0540645165e8e0e
MD5 5801953f3fbe970de84ebe6835bd7f26
BLAKE2b-256 a27e2bedfd7984a69c7123e68fc5c2f5730e5c88534828222c6c70946fd2ded5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page