Skip to main content

Multi-model consensus via Cursor Cloud Agents

Project description

Agentic Arena

Multi-model consensus via Cursor Cloud Agents.

Frontier models solve a task through iterative rounds of independent work, anonymized critique, and verified consensus. The orchestrator is a Python CLI that communicates with the Cursor Cloud Agents API over HTTP.

generate --> evaluate
               |-- CONSENSUS (score >= 9) --> done
               |-- CONTINUE  (score < 9)  --> generate (next round)
               '-- max rounds reached     --> done

Getting started

Prerequisites

  • Python 3.13+ (or uv / pixi which install it automatically).
  • A Cursor API key (see below).
  • A GitHub repository connected to your Cursor account.

Obtaining a Cursor API key

The arena uses the Cloud Agents API to launch and manage agents. You need a User API key (not a BYOK key for third-party providers).

  1. Sign in at cursor.com/dashboard.
  2. Go to the Integrations tab (direct link).
  3. Click Create New API Key, give it a name, and copy the generated key. You will not be able to see the key again after leaving the page.
  4. Export the key in your shell or put it in a .env file in your working directory:
# Option A: environment variable
export CURSOR_API_KEY="key_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Option B: .env file
echo 'CURSOR_API_KEY=key_xxx...' > .env

Note: Free-plan API keys do not support the Cloud Agents API. You need a paid Cursor plan (Pro, Business, or Enterprise).

Quick start

# With uvx (recommended)
uvx agentic-arena --help

# Or with pipx / pip
pipx install agentic-arena
arena --help

Requires Python 3.13+. The arena and agentic-arena commands are equivalent. All examples below use arena; substitute uvx agentic-arena if you haven't installed the package.

Developer install

# Install pixi if you don't have it
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc

# Clone and install dependencies (pixi handles everything)
git clone https://github.com/maresb/agentic-arena.git
cd agentic-arena
pixi install

All dependencies (Python 3.13, requests, pydantic, typer, pytest, mypy, ruff) are declared in pixi.toml and resolved via conda-forge. The project is also installed as an editable package via pyproject.toml, which provides the arena console entrypoint.

Verify the install

pixi run test       # unit tests
pixi run lint       # ruff
pixi run format     # ruff format
pixi run typecheck  # mypy

Design

  • Why Cloud Agents: each agent runs in an isolated VM with its own branch, exposes structured conversations via API, and removes local tmux/worktree orchestration complexity.
  • Anonymization: model identities are mapped to randomized aliases per run (for example, agent_a -> opus) and prompt ordering is shuffled to reduce positional bias.
  • Non-goals: real-time interaction, correctness guarantees from consensus, and building a general-purpose agent framework.

Usage

The CLI has five commands: init, run, step, status, and add-comment.

Initialize an arena

arena init \
  --task "Review the authentication module for security issues" \
  --repo owner/repo \
  --base-branch main \
  --max-rounds 3

This creates arenas/0001/state.yaml with a randomized alias-to-model mapping and sets the phase to generate. By default all three models (Claude Opus, GPT, Gemini) are used; use --models to select a subset.

CLI flags

Flag Default Description
--task placeholder Task description for the agents to solve (edit state.yaml before running)
--repo git remote GitHub repository (owner/repo format); auto-detected from origin
--base-branch main Branch the agents work from
--max-rounds 3 Cap on generate-evaluate cycles (1-10)
--models all Comma-separated model list (e.g. opus,gpt)
--verify-commands none Comma-separated commands to run on consensus (e.g. "pixi run pytest,pixi run mypy .")
--verify-mode advisory advisory (log failures) or gating (override consensus on failure)
--arena-dir auto Next sequentially-numbered directory under arenas/

Run the orchestrator

export CURSOR_API_KEY="your-key-here"
arena run

The orchestrator loops through phases until consensus is reached or max rounds are exhausted. It is fully resumable -- kill it at any point and restart; previously completed work is never re-done.

Progress is logged to both stderr and arenas/NNNN/orchestrator.log. Add -v for DEBUG-level output.

During polling, the orchestrator prints dots (.) to stderr so you know it is still working. These dots are suppressed when verbose logging is enabled.

Single-step mode

arena step

Executes exactly one phase transition (e.g. generate → evaluate) and exits. Useful for debugging or running phases manually.

Check status

arena status

Shows the current phase, round, alias mapping, agent IDs, and per-agent progress.


Configuration

Model selection

By default, the arena uses all three models: opus, gpt, and gemini. Use --models to select a subset:

# Two-model arena
arena init --task "..." --repo owner/repo --models opus,gpt

# Single-model smoke test
arena init --task "..." --repo owner/repo --models opus --max-rounds 1

The alias list (agent_a, agent_b, ...) is automatically sized to match the number of models.

Verify commands

Verify commands run after the judge declares consensus. They let you gate consensus on passing tests:

arena init \
  --task "Fix the login bug" \
  --repo owner/repo \
  --verify-commands "pytest,mypy ." \
  --verify-mode gating
  • advisory (default): Log verify failures but accept the consensus.
  • gating: Override consensus to CONTINUE if any verify command fails, forcing another generate-evaluate round.

Inject operator comments

Use add-comment to inject a message into running agent conversations:

# Interactive mode (walks through delivery, targets, framing)
arena add-comment

# Non-interactive: deliver immediately to all agents
arena add-comment --message "Focus on error handling" --immediate

# Queue for next phase start
arena add-comment --message "Ignore the failing lint rule" --queue

Comments can target specific agents with --targets agent_a,agent_b and can include file contents via --file path/to/context.md.


Crash recovery and restart semantics

The orchestrator is designed to survive crashes at any point:

  • Atomic state writes. State is written to a temp file and renamed, so a crash during write never leaves a corrupt state.yaml.
  • Idempotent phases. Each agent's progress is tracked individually (pending → sent → done). On restart, only unfinished agents are re-processed.
  • Crash-safe follow-ups. Before sending a follow-up, the message count is persisted. On restart, the orchestrator compares the current message count to the saved count to detect whether the follow-up was actually delivered, preventing duplicate prompts.
  • Vote state is persisted. Per-agent vote progress is saved before each verdict prompt, so a crash won't lose already-collected votes.

To resume after a crash, simply re-run the same command:

arena run

Output layout

Each arena run produces:

arenas/0001/
  state.yaml                  Main state file (file: references to artifacts)
  orchestrator.log            Full debug log
  report.md                   Rolling summary report (updated each phase)
  winning-solution.md         Winner's final solution (on completion)
  artifacts/                  Externalized large text from state
    solutions_agent_a.md
    critiques_agent_a.md
    final_verdict.md
    ...
  00-1-generate-opus-solution-a1b2c3.md   Round 0, generate phase archive
  00-2-evaluate-gpt-critique-d4e5f6.md    Round 0, evaluate phase archive
  00-2-evaluate-gpt-verdict-789abc.json   Round 0, verdict archive
  ...

Archive naming: {round:02d}-{phase_num}-{phase}-{model}-{artifact}-{uid}.{ext} where uid is a content-addressed SHA-256 prefix. Files are deduplicated -- restarting the orchestrator does not create duplicate archives.

Artifact externalization: Large text fields (solutions, critiques, verify results, final verdict) are stored as separate .md files under artifacts/. The YAML state file stores file: references that are resolved transparently on load. Old inline state files (without file: references) are still loaded correctly.


Project structure

arena/
  __init__.py        Package root (version)
  __main__.py        Typer CLI: init, run, step, status, add-comment
  api.py             Cursor Cloud Agents HTTP client with retry/backoff
  extraction.py      JSON verdict parsing, VoteVerdict model, fallback heuristics
  git.py             Git remote URL parsing
  orchestrator.py    Main loop, round archival, report generation
  phases.py          Phase functions: generate, evaluate
  prompts.py         Prompt templates, model name mapping, branch hints
  state.py           Pydantic models (ArenaConfig, ArenaState), persistence

tests/
  test_api.py          API client tests
  test_cli.py          CLI commands via Typer CliRunner
  test_extraction.py   JSON verdict parsing, fallback heuristics
  test_git.py          Git remote URL parsing tests
  test_integration.py  Live API tests (requires CURSOR_API_KEY)
  test_orchestrator.py Report generation, archive deduplication
  test_phases.py       Phase control flow with mock API
  test_prompts.py      Prompt template content, branch hints
  test_state.py        Pydantic models, serialization, externalization

.github/workflows/
  ci.yml                   CI pipeline: test, lint, format, typecheck
  pypi.yaml                Trusted publishing to PyPI on release
pyproject.toml             Package metadata, console_scripts entrypoint
pixi.toml                  Dependencies and task definitions

Key types

Type Module Purpose
ArenaConfig state.py Frozen config: task, repo, branch, rounds, models, verify
ArenaState state.py Full mutable state persisted to state.yaml
Phase state.py StrEnum: generate, evaluate, done
ProgressStatus state.py StrEnum: pending, sent, done
DEFAULT_MODELS state.py Default model short names: opus, gpt, gemini
Verdict extraction.py Parsed judge verdict with decision, score, etc.
CursorCloudAPI api.py HTTP client for the Cursor Cloud Agents endpoints

Testing

Unit tests (no API key needed)

The test suite mocks all API calls and validates control flow, state transitions, extraction logic, prompt construction, and serialization:

pixi run test

Integration tests (requires API key)

Live API tests are in tests/test_integration.py. They are skipped by default and require an explicit opt-in (they launch real agents and cost real money):

RUN_INTEGRATION_TESTS=1 CURSOR_API_KEY=... pixi run pytest tests/test_integration.py -v

These tests verify authentication, model listing, repository listing, and agent launch/stop against the real Cursor Cloud API.

CI

The GitHub Actions pipeline (.github/workflows/ci.yml) runs on every push and PR to main:

  • pixi run test — unit tests
  • pixi run lint — ruff linter
  • pixi run format-check — ruff format check
  • pixi run typecheck — mypy

Troubleshooting

CURSOR_API_KEY environment variable is not set

Export your key or create a .env file. See Obtaining a Cursor API key.

Agent stuck in RUNNING / CREATING

The orchestrator polls agents with exponential backoff. If an agent appears stuck, check the Cursor dashboard for the agent's status. The orchestrator will wait indefinitely by default; kill and restart it if needed — it will resume from where it left off.

No arena state found

Run arena init ... first to create the state file.

Verify commands fail in gating mode

In --verify-mode gating, failing verify commands override consensus and force another round. Check the verify command output in the report or in arenas/NNNN/artifacts/verify_results_*.md. Common causes:

  • Tests that depend on the local environment (missing dependencies, wrong Python version).
  • Tests that are unrelated to the task and were already failing before the arena run.

Rate limiting on /repositories endpoint

The Cursor API may rate-limit repository listing requests. The API client retries with exponential backoff (up to 5 attempts). If you hit persistent rate limits, wait a few minutes before retrying.

Verdict parsing failures

When an agent's evaluate response cannot be parsed as a valid JSON verdict, the orchestrator logs a warning and uses fallback heuristics to extract scores and votes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_arena-0.1.0.tar.gz (68.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_arena-0.1.0-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file agentic_arena-0.1.0.tar.gz.

File metadata

  • Download URL: agentic_arena-0.1.0.tar.gz
  • Upload date:
  • Size: 68.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentic_arena-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0d5e44366be8f7941039cbfff8de65a432920d2bc54998358bc76e651a89921
MD5 d1866398551a85154821d149205ca70e
BLAKE2b-256 a04a925092b0b1638a09b9e90f54f7d7edc5befa9cdd227e5d25a18e5bd37abf

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_arena-0.1.0.tar.gz:

Publisher: pypi.yaml on maresb/agentic-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentic_arena-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentic_arena-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentic_arena-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb798e97ecf3a02541a8265a2e373d45d073dfcbf5d11d37dc2003c672d3bcce
MD5 b1ccafc15c1baa4024efd7a3ee52648c
BLAKE2b-256 5b378049454bf3d23fe30106fc288115abe4bd63610dc95962bdd4a18cf42a8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_arena-0.1.0-py3-none-any.whl:

Publisher: pypi.yaml on maresb/agentic-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page