Skip to main content

A benchmarking framework for LLM-powered agents

Project description

PacaBench

A local-first Benchmark Harness for LLM agents

Stop playing script whack-a-mole with your benchmarks & start looking at reproducible results.

CI License Built with Rust PyPI Version

Give a Star

Live run TUI with distributions and rolling failures


The Problem

Benchmarking LLM agents should be simple. In reality it usually looks like this

  • A long run fails at 13% due to an API hiccup after ~5hrs.
  • You restart from scratch.
  • Some cases silently succeed while others crash your scripts.
  • You copy JSON blobs around trying to recover partial results and write one-off scripts to juggle it.
  • You don't know how many tokens were actually used or how long responses truly took.

What should be a "start it, walk away, come back for results" evaluation turns into a multi-day slog of brittle scripts, half-finished results, and unreliable metrics.

Benchmarks shouldn't be harder than building the agent.

You don't need an enterprise platform that takes weeks to integrate. You need a tool that works.

What is PacaBench

PacaBench is a harness built for the reality of agentic LLM development. It handles the messy parts of benchmarking so you can focus on your agents.

  • It doesn't crash. Agents run in isolated processes. If one crashes, the harness records the failure and keeps moving.
  • It remembers where it left off. State is saved after every single case. If you kill the process or your machine restarts, you resume exactly where you stopped.
  • It handles the retry loop. Run the suite, let it finish, then retry failures with a single command.
  • It measures reality. A built-in proxy sits between your agent and the LLM provider to track exact latency and token usage. No more guessing or relying on self-reported metrics.

Examples | Issues


Quick Start

Installation

pip install pacabench

Or run directly without installing:

uvx pacabench@latest --help

Usage

Initialize a new project:

pacabench init

Run a quick test:

pacabench run --limit 10

Live run TUI with distributions and rolling failures

Live run summary with distributions and a rolling failures log.

See all runs:

pacabench show

List of recent runs

Run list overview.

Drill into a specific run:

pacabench show <run-id>
pacabench show <run-id> --cases
pacabench show <run-id> --failures

Run summary with distributions, costs, and failures

Run summary with distributions, cost breakdowns, and failures.

Retry failures:

pacabench retry <run-id>

Export for analysis:

pacabench export <run-id> > results.json

CLI Reference

Command Description
pacabench show List all runs
pacabench show <run> Show run details
pacabench show <run> --cases Show individual case results
pacabench show <run> --failures Show only failed cases
pacabench run Start a benchmark run
pacabench run --limit N Run with limited cases (for testing)
pacabench run -a agent1,agent2 Run only specific agents
pacabench retry <run> Retry failed cases from a run
pacabench export <run> Export results to JSON
pacabench export <run> --format md Export results to Markdown
pacabench show-config Show parsed configuration
pacabench init Create a new project

Partial run IDs work - just type enough to uniquely match (e.g., pacabench show 120358).


Configuration

Define your entire benchmark in one pacabench.yaml file. Configure it once, run it forever.

name: memory-benchmark
description: Evaluating long-term memory capabilities
version: "1.0.0"

config:
  concurrency: 4
  timeout_seconds: 60

agents:
  - name: "mem0-agent"
    command: "python agents/mem0_agent.py"

datasets:
  - name: "membench"
    source: "git:https://github.com/import-myself/Membench.git"
    prepare: "python scripts/prepare_membench.py"
    input_map:
      input: "question"
      expected: "ground_truth"
    evaluator:
      type: "llm_judge"
      model: "gpt-4o-mini"

output:
  directory: "./runs"

Why YAML?

Because you should be able to describe a benchmark, not build a bespoke system for every new test suite.


Agent Interface

Your agent needs to read JSON from stdin and write JSON to stdout. No new SDK to learn here.

Input (STDIN) Output (STDOUT)
{"case_id": "1", "input": "Hi"} {"output": "Hello!", "error": null}

Write your agent as a hook, or straight up usage in python, golang, rust, node, whatever you fancy.


Why?

Because I was sick of my own benchmarks blowing up. I tried running serious agent benchmarks locally and kept hitting the same wall:

  • Runs would fail at 60% or 20% because of one bad response.
  • I ended up with script spaghetti just to get through a single dataset.
  • Re-running failures meant copy/pasting JSON blobs and praying nothing broke.
  • I didn't want a heavyweight enterprise system like Arize. I wanted something that just works.
  • I wanted a tool I could configure once, leave overnight, then run and re-run locally without thinking.

Benchmarking agents became a game of whack-a-mole:

run → isolate failures → rerun → inspect → repeat → rage

PacaBench exists because I wanted to stop fighting my tools and start getting actual signal from my agents.

Architecture

PacaBench is built in Rust for failure isolation and reliability, while remaining easy to install via pip/uvx. It isolates your code from the harness.

graph LR
    H[Harness] -->|1. Spawn| R[Runner Process]
    R -->|2. Input JSON| A[Your Agent]
    A -->|3. API Call| P[Metrics Proxy]
    P -->|4. Forward| O[OpenAI/LLM Provider]
    O -->|5. Response| P
    P -->|6. Record Metrics| H
    P -->|7. Return| A
    A -->|8. Output JSON| R
    R -->|9. Result| H

Key Components

  1. Harness: Manages the run loop, persistence, and retries.
  2. Proxy: Intercepts API calls to provide ground-truth metrics (OPENAI_BASE_URL injection).
  3. Runners: Worker processes that ensure a bad agent doesn't kill the benchmark.
  4. Evaluator: Flexible scoring (LLM judges, regex, F1, exact match, etc).

Contributing

We welcome contributions. See Contributing Guidelines.


License

Apache 2.0 - see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pacabench-0.0.3.tar.gz (99.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pacabench-0.0.3-py3-none-macosx_11_0_arm64.whl (9.9 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file pacabench-0.0.3.tar.gz.

File metadata

  • Download URL: pacabench-0.0.3.tar.gz
  • Upload date:
  • Size: 99.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for pacabench-0.0.3.tar.gz
Algorithm Hash digest
SHA256 1db58557634b1afaf2af2c6dfb19de27e5dc22e7582a3260834c6ea4d8d454b0
MD5 a1cb306547a4de8c08c746d35ee7842c
BLAKE2b-256 d86e51bd05c4ce2d2dac8604ede1f97be442d2194fd6bc875e468ac113d3cb99

See more details on using hashes here.

File details

Details for the file pacabench-0.0.3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pacabench-0.0.3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e5c1a73bcec25d29a3f6e248857c0f8f7f2a7161a4457c74fe75e302da3e5037
MD5 9c8ae83ce57489c8048df5567c4e9cd4
BLAKE2b-256 ab2e28d3265faadef58f2f729a5e425e4fb763efe1d240d1db132aa367c8697c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page