Standalone evaluation engine for LLM applications
Project description
cat-experiments
Agnostic experiment runner for LLM applications that you can take to any server stack.
Most experiment frameworks are glued to a specific hosted platform, forcing you to swap libraries when you switch servers. cat-experiments keeps the core experiment loop (data model, runner, evaluators) identical whether you are running locally or wiring into Phoenix, CAT Cafe, or another backend. That gives teams a common starting point for new projects while still letting them plug into whichever server platform fits the deployment.
Features
- CLI-First Design: Define experiments as Python files and run them with
cat-experiments run - Flexible Data Models: Support any dataset structure with dictionary-based input/output
- Deterministic Preview Runs: Limit execution to an exact number of examples with
--dry-run - Explicit Repetitions: Run each example multiple times with
--repetitions - Comprehensive Evaluators: Built-in evaluators for tool call correctness and more
- Modern Python: Targets Python 3.12+ with modern typing features
- Async Support: Full async/await support for evaluation pipelines
- Tool Call Evaluation: Advanced matching algorithms for tool call correctness
Install
# from PyPI
pip install cat-experiments # core package
pip install "cat-experiments[cat-cafe]" # add extras for CAT Cafe
pip install "cat-experiments[phoenix]" # add extras for Phoenix
Quick Start
Create an experiment file:
# my_experiment.py
from cat.experiments.protocol import TaskInput, TaskOutput, EvalInput, EvalOutput
from cat.experiments.sdk import task, evaluator
@task
async def my_task(input: TaskInput) -> TaskOutput:
"""The system under test."""
question = input.input.get("question", "")
# In a real experiment, you'd call your LLM here
return TaskOutput(output={"answer": question.upper()})
@evaluator
def exact_match(input: EvalInput) -> EvalOutput:
"""Check if actual matches expected."""
expected = input.expected_output.get("answer", "") if input.expected_output else ""
actual = input.actual_output.get("answer", "") if input.actual_output else ""
score = 1.0 if expected == actual else 0.0
return EvalOutput(
score=score,
label="match" if score == 1.0 else "mismatch",
)
Run it:
# With a local JSON/JSONL dataset
cat-experiments run my_experiment.py --dataset data.jsonl
# With dry-run mode (run only N examples, no persistence)
cat-experiments run my_experiment.py --dataset data.jsonl --dry-run 5
# With multiple repetitions
cat-experiments run my_experiment.py --dataset data.jsonl --repetitions 3
# With parallel workers
cat-experiments run my_experiment.py --dataset data.jsonl --max-workers 10
# Stream results to Phoenix
cat-experiments run my_experiment.py --dataset data.jsonl --storage phoenix
# Stream results to CAT Cafe
cat-experiments run my_experiment.py --dataset data.jsonl --storage cat-cafe
Dataset Format
Datasets are JSON or JSONL files with examples containing input, output, and optional metadata:
{"input": {"question": "How do I reset my password?"}, "output": {"answer": "Visit settings."}, "metadata": {"category": "support"}}
{"input": {"question": "Can I upgrade mid-cycle?"}, "output": {"answer": "Yes, prorated."}, "metadata": {"category": "billing"}}
Writing Experiments
Task Function
The @task decorator marks your system under test. It receives a TaskInput with the example data and returns the output:
from cat.experiments.protocol import TaskInput, TaskOutput
from cat.experiments.sdk import task
@task
async def my_llm_task(input: TaskInput) -> TaskOutput:
"""Call your LLM or agent here."""
question = input.input.get("question", "")
params = input.params # Access experiment params (e.g., model name)
# Your LLM call here
response = await call_my_llm(question, model=params.get("model"))
return TaskOutput(
output={"answer": response},
metadata={"tokens": 150}, # Optional metadata
)
# Module-level config (optional)
params = {"model": "gpt-4o-mini"}
name = "My Experiment"
Evaluator Functions
The @evaluator decorator marks evaluation functions. They receive an EvalInput with both expected and actual outputs:
from cat.experiments.protocol import EvalInput, EvalOutput
from cat.experiments.sdk import evaluator
@evaluator
def accuracy(input: EvalInput) -> EvalOutput:
"""Check if the answer is correct."""
expected = input.expected_output.get("answer", "") if input.expected_output else ""
actual = input.actual_output.get("answer", "") if input.actual_output else ""
score = 1.0 if expected.lower() == actual.lower() else 0.0
return EvalOutput(score=score, label="correct" if score else "incorrect")
@evaluator
async def llm_judge(input: EvalInput) -> EvalOutput:
"""Use an LLM to judge quality (async evaluators supported)."""
# Your LLM judge call here
judgment = await call_judge_llm(input.actual_output)
return EvalOutput(score=judgment.score, metadata={"reasoning": judgment.reason})
Storage Backends
Local Storage (default)
Results are stored locally:
cat-experiments run my_experiment.py --dataset data.jsonl
Phoenix Integration
Stream results to Phoenix:
# Set Phoenix connection (or use PHOENIX_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
--storage phoenix \
--storage-url http://localhost:6006
CAT Cafe Integration
Stream results to CAT Cafe:
# Set CAT Cafe connection (or use CAT_BASE_URL env var)
cat-experiments run my_experiment.py --dataset data.jsonl \
--storage cat-cafe \
--storage-url http://localhost:8000
CLI Options
cat-experiments run <experiment.py> [OPTIONS]
Dataset Options:
--dataset PATH Local dataset file (JSON/JSONL)
--dataset-name NAME Remote dataset name
--dataset-id ID Remote dataset ID
Storage Options:
--storage TYPE Storage backend: local, phoenix, cat-cafe (default: local)
--storage-url URL URL for remote storage
Experiment Options:
--param KEY=VALUE Override params (can be repeated)
--max-workers N Parallel workers (default: 5)
--repetitions N Repetitions per example (default: 1)
--dry-run [N] Run N examples without persistence (default: 1)
--resume EXPERIMENT_ID Resume a previous experiment
Output Options:
--output FORMAT Output format: text, json (default: text)
Built-in Tool Call Evaluators
The package includes tool call correctness evaluators for agent evaluation:
from cat.experiments.sdk import match_tool_calls, ToolCallMatch, ToolCallMatchingResult
# Match expected vs actual tool calls
result: ToolCallMatchingResult = match_tool_calls(
expected_calls=[{"name": "search", "arguments": {"query": "weather"}}],
actual_calls=[{"name": "search", "arguments": {"query": "weather today"}}],
)
print(result.precision, result.recall, result.f1_score)
Architecture
┌─────────────────────────────────────────────────────────┐
│ CLI / Orchestrator │
│ - Dataset loading │
│ - Parallel execution (windowed dispatch) │
│ - Storage backends (Local, Phoenix, Cat Cafe) │
│ - Progress reporting │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Executor │
│ - Loads experiment file │
│ - Runs @task and @evaluator functions │
│ - Returns results │
└─────────────────────────────────────────────────────────┘
The package uses an executor protocol that separates orchestration from execution, enabling future support for:
- Multi-language experiments (TypeScript, etc.)
- Distributed execution
- Custom execution backends
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cat_experiments-0.0.12.tar.gz.
File metadata
- Download URL: cat_experiments-0.0.12.tar.gz
- Upload date:
- Size: 360.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6041f778420dbde8491b834887ef3866cbacb2096cf02dba5821d2b005e72ec
|
|
| MD5 |
92940929d43248623f0cb8afc7dbe06b
|
|
| BLAKE2b-256 |
aa5de3e556342d0dd104ae93a40806352430eb68fccd0e549c94e309bb68330b
|
File details
Details for the file cat_experiments-0.0.12-py3-none-win_amd64.whl.
File metadata
- Download URL: cat_experiments-0.0.12-py3-none-win_amd64.whl
- Upload date:
- Size: 6.0 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a415e59ce01d700b66499c239f1622563ce184926544810a02d3de2d2b369fb
|
|
| MD5 |
c5978f4ee09852d04b32e10e56053473
|
|
| BLAKE2b-256 |
6c043d01b2ec704eaffc48096794aff3c5fb4fe0bd8937b37c48c35b0dbce313
|
File details
Details for the file cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: cat_experiments-0.0.12-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 5.8 MB
- Tags: Python 3, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31eca5f4e5b93c36f788350762b1104a0a242822e7db5a33a1b38ce3752a3b82
|
|
| MD5 |
c2ed5b1fe6b25ad396d8be70ad034dac
|
|
| BLAKE2b-256 |
2833c29054d83557f92b220d1b39fdedcdd9b462cea2e624039265fc8aabed94
|
File details
Details for the file cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: cat_experiments-0.0.12-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 5.3 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23a2cd0148d405b2eafdaf778fbfec6768380371f0b8c318791ad537b6f23078
|
|
| MD5 |
4c7ace06809beabe531a970e724d9489
|
|
| BLAKE2b-256 |
409cf8d71a76bbc96b866bcc44358887d1641d945b063314303e79619555bb45
|
File details
Details for the file cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: cat_experiments-0.0.12-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.7 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5533664fb9543374ea1aab98509eabea2a560b910310617576face1d33b7075e
|
|
| MD5 |
9b6e29d7f122c8012542a7adf5b37737
|
|
| BLAKE2b-256 |
0fbbae3e9f169720e151636957e21fa0bfec4753c5d32e02f856f35b1be13ae3
|
File details
Details for the file cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl.
File metadata
- Download URL: cat_experiments-0.0.12-py3-none-macosx_10_12_x86_64.whl
- Upload date:
- Size: 6.0 MB
- Tags: Python 3, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c2a03a8b273d4ffcd809a7231f903e4e34cc3f2549eca90f0540645165e8e0e
|
|
| MD5 |
5801953f3fbe970de84ebe6835bd7f26
|
|
| BLAKE2b-256 |
a27e2bedfd7984a69c7123e68fc5c2f5730e5c88534828222c6c70946fd2ded5
|