Run a prompt against multiple coding agents in parallel and compare results

These details have not been verified by PyPI

Project description

AgentTester

⚠️ Experimental — This project is under active development. APIs, config format, and CLI flags may change without notice.

Send a single prompt to multiple coding agents running in parallel and compare the results. Each agent works in its own git worktree on a separate branch so they never interfere with each other. Optionally, configure LLM evaluators to review each agent's diff and drive an iterative refinement loop.

Install

uv pip install -e ".[dev]"

Quick Start

# List built-in agents
agent-tester agents

# Run two agents on the same prompt
agent-tester run "Add unit tests for the auth module" --agents claude,aider

# Give the run a descriptive name (used in branch and report filenames)
agent-tester run "Refactor auth module" --agents claude,aider --name auth-refactor

# Use a prompt file
agent-tester run --prompt-file task.md --agents claude,codex,aider

# Keep worktrees for manual inspection
agent-tester run "Refactor logging" --agents claude,aider --keep-worktrees

How It Works

You provide a prompt and select agents
AgentTester creates a git worktree + branch for each agent from the current HEAD
All agents run concurrently, each in its own worktree
Agent output streams to the terminal with colored prefixes
A markdown comparison report is generated with diff stats and timing
Worktrees are cleaned up (branches are preserved for git diff)

Branches are named agenttester/<agent-name>/<run-name> so you can compare results:

git diff agenttester/claude/auth-refactor agenttester/aider/auth-refactor

When no --name is given, a slug is derived from the first six words of the prompt plus a short hash (e.g. add-unit-tests-for-the-auth-a3f2c1).

Configuration

Copy config.example.yaml to agent-tester.yaml (or agent-tester.yml) in your target repo to customize agents. Built-in presets are available for claude, aider, and codex.

Config file discovery

Auto-detected local config files must use a .yml or .yaml extension. The following names are checked in order:

agent-tester.yaml
agent-tester.yml
.agent-tester.yaml
.agent-tester.yml

You can also pass a config file explicitly — no extension required:

agent-tester run "Fix the bug" --agents claude --config /path/to/myconfig

A global config at ~/.config/agenttester/config.yml or ~/.config/agenttester/config.yaml is merged automatically. Local project config takes precedence over global, which takes precedence over built-in presets.

Reports

Reports are written to ~/.config/agenttester/projects/<repo-name>/ by default. You can override this per-project:

Local config (agent-tester.yaml in your repo):

reports_dir: ~/my-reports/myproject

Global config (~/.config/agenttester/config.yml), per named project:

projects:
  myproject:
    reports_dir: ~/my-reports/myproject

Local config takes priority over the global projects: setting.

Command Placeholders

{prompt} — replaced with the shell-escaped prompt text
{prompt_file} — replaced with a path to a temp file containing the prompt
If neither placeholder is present, the prompt is piped to the agent via stdin

Agent Settings

Field	Description	Default
`command`	Shell command template	(required)
`commit_style`	`auto` (agent commits) or `manual` (agenttester commits)	`auto`
`timeout`	Max seconds before the agent is killed	`600`
`env`	Extra environment variables (key-value map)	`{}`

Skills

Skills are markdown instruction files prepended to every agent prompt. They tell agents what they are allowed to do and how to behave. AgentTester ships with four built-in skills:

Skill	Description
`editing.md`	Permission to read and edit files freely; look for reusable code before writing new code; prioritise readability
`testing.md`	Run the test suite and linter after making changes; don't mark a task complete until tests pass
`git.md`	Permitted git operations (branch, commit, push, pull, rebase); never push to the default branch
`bash.md`	Permitted bash operations scoped to code editing and testing; no system-level changes outside the worktree

Overriding or extending skills

You can override any built-in skill or add new ones at two levels:

Global (~/.config/agenttester/skills/): applies to all projects.

Local (.agent-tester/skills/ inside your repo): applies to this project only.

A skill file with the same name as a built-in replaces it entirely. New filenames add additional instructions. Skills are always output in priority order — built-ins first, global skills second, local skills last — so user-defined instructions appear closest to the prompt and carry the most weight with the model.

~/.config/agenttester/skills/testing.md   # overrides built-in testing skill globally
your-repo/.agent-tester/skills/testing.md # overrides for this project only
your-repo/.agent-tester/skills/style.md   # adds a new skill for this project

LLM-Based Code Evaluation

Configure one or more LLM evaluators to review each agent's diff after it runs. Multiple independent reviewers reduce the risk of hallucinated assessments, and an aggregate report is synthesized from all of them.

Add an evaluators block to your agent-tester.yaml:

evaluators:
  - name: claude
    api: anthropic          # uses ANTHROPIC_API_KEY
    model: claude-opus-4-7

  - name: llama3
    endpoint: http://localhost:8004   # any OpenAI-compatible endpoint
    model: meta-llama/Meta-Llama-3-70B-Instruct

evaluation:
  inject_raw_reports: false   # true → send raw reports instead of aggregate
  max_aggregate_tokens: 2000  # aggregate is summarized before injection if too long

After each iteration, each evaluator independently critiques every agent's diff for:

Accuracy — does the code implement what was asked?
Readability — is it clear and well-named?
Code smells — duplication, dead code, poor design
Correctness — bugs, missed edge cases, unsafe patterns

An aggregate assessment is then synthesized across evaluators. The terminal shows the aggregate; raw per-evaluator reports are preserved in the markdown report.

Iterative Refinement

When evaluators are configured, AgentTester enters a refinement loop:

Agents run and commit their changes (iter-1 commit message)
Evaluators review each agent's diff
You select which agents to re-run (1–all, or press Enter to stop)
Selected agents re-run with the aggregate feedback injected into their prompt
New commits are appended to the same branch (iter-2, iter-3, …)
New evaluator reports are generated for each iteration

All iterations land on the same branch — use git log to see the progression.

Interactive Model REPL

For comparing responses from vLLM model servers interactively, with persistent conversation history within a session:

agent-tester repl                        # auto-discovers agent-tester.yaml, falls back to global config
agent-tester repl --config custom.yaml   # explicit config path

The REPL discovers any agent in your config whose command matches the agenttester query pattern, fans out each prompt to all of them in parallel, and maintains separate conversation history per model. Use /reset to clear history or exit to quit.

Config resolution follows the same priority as run: global config first, then local (or explicit) config, with local taking precedence on conflicts. Models defined only in the global config are available in the REPL even when a local config is present.

See config.example.yaml for example vLLM agent entries.

Development

uv pip install -e ".[dev]"
ruff check src/ tests/
ruff format src/ tests/
pytest

Docker

# Run against the current directory
docker compose run --rm agent-tester run "Fix the bug" --agents claude

# Run against a different repo
REPO_PATH=/path/to/repo docker compose run --rm agent-tester run "Add tests" --agents claude,aider

Library Usage

import asyncio
from pathlib import Path
from rich.console import Console
from agenttester import Orchestrator, load_config
from agenttester.config import get_reports_dir

async def main():
    repo = Path(".").resolve()
    agents = load_config()
    selected = [agents["claude"], agents["aider"]]
    orch = Orchestrator(repo, Console(), get_reports_dir(repo))
    results = await orch.run("Add unit tests", selected, run_name="add-tests")
    for r in results:
        print(f"{r.agent_name}: exit={r.exit_code} duration={r.duration:.1f}s")

asyncio.run(main())

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.16.0

May 15, 2026

0.15.1

May 15, 2026

0.15.0

May 15, 2026

0.14.0

May 15, 2026

0.13.5

May 15, 2026

0.13.4

May 15, 2026

0.13.3

May 15, 2026

0.13.2

May 15, 2026

0.13.1

May 15, 2026

0.13.0

May 15, 2026

0.12.1

May 15, 2026

0.12.0

May 15, 2026

0.11.5

May 15, 2026

0.11.4

May 15, 2026

0.11.3

May 14, 2026

0.11.2

May 14, 2026

0.10.1

May 14, 2026

0.10.0

May 14, 2026

0.9.1

May 14, 2026

0.8.0

May 14, 2026

This version

0.7.0

May 14, 2026

0.6.0

May 13, 2026

0.5.2

May 13, 2026

0.5.1

May 12, 2026

0.5.0

May 12, 2026

0.4.5

May 12, 2026

0.4.4

May 12, 2026

0.4.3

May 12, 2026

0.4.1

May 12, 2026

0.4.0

May 12, 2026

0.3.7

May 11, 2026

0.3.6

May 11, 2026

0.3.5

May 11, 2026

0.3.4

May 11, 2026

0.3.3

May 11, 2026

0.3.2

May 11, 2026

0.3.1

May 11, 2026

0.3.0

May 11, 2026

0.2.0

May 11, 2026

0.1.1

May 9, 2026

0.1.0

May 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenttester-0.7.0.tar.gz (61.9 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agenttester-0.7.0-py3-none-any.whl (33.4 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file agenttester-0.7.0.tar.gz.

File metadata

Download URL: agenttester-0.7.0.tar.gz
Upload date: May 14, 2026
Size: 61.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenttester-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`f9615dc77de5e937e6ec2214b25d311eb7855619dea906cbd0d0207d5c8d6b7f`
MD5	`93eb78aec0d40d848cebda36e7f12e5f`
BLAKE2b-256	`b20981ace10ba9005036ed5291056a598588a5b0ceef3c4ac0d32f3c89c42c89`

See more details on using hashes here.

File details

Details for the file agenttester-0.7.0-py3-none-any.whl.

File metadata

Download URL: agenttester-0.7.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 33.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenttester-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46154f8803e6b47f884073bda8934db8266d2ef5c10167d69cd677276569b67f`
MD5	`08cb449eab68988a29e3b9e4a11cff2d`
BLAKE2b-256	`1c6bb1da27b27cfea6543fa798d27f83192d8753da9284cce27c8c8a61afd1af`

See more details on using hashes here.

agenttester 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AgentTester

Install

Quick Start

How It Works

Configuration

Config file discovery

Reports

Command Placeholders

Agent Settings

Skills

Overriding or extending skills

LLM-Based Code Evaluation

Iterative Refinement

Interactive Model REPL

Development

Docker

Library Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes