Multi-model consensus via Cursor Cloud Agents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

maresb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13

Project description

Agentic Arena

Multi-model consensus via Cursor Cloud Agents.

Frontier models solve a task through iterative rounds of independent work, anonymized critique, and verified consensus. The orchestrator is a Python CLI that communicates with the Cursor Cloud Agents API over HTTP.

generate --> evaluate
               |-- CONSENSUS (score >= 9) --> done
               |-- CONTINUE  (score < 9)  --> generate (next round)
               '-- max rounds reached     --> done

Getting started

Prerequisites

Python 3.13+ (or uv / pixi which install it automatically).
A Cursor API key (see below).
A GitHub repository connected to your Cursor account.

Obtaining a Cursor API key

The arena uses the Cloud Agents API to launch and manage agents. You need a User API key (not a BYOK key for third-party providers).

Sign in at cursor.com/dashboard.
Go to the Integrations tab (direct link).
Click Create New API Key, give it a name, and copy the generated key. You will not be able to see the key again after leaving the page.
Export the key in your shell or put it in a .env file in your working directory:

# Option A: environment variable
export CURSOR_API_KEY="key_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Option B: .env file
echo 'CURSOR_API_KEY=key_xxx...' > .env

Note: Free-plan API keys do not support the Cloud Agents API. You need a paid Cursor plan (Pro, Business, or Enterprise).

Quick start

# With uvx (recommended)
uvx agentic-arena --help

# Or with pipx / pip
pipx install agentic-arena
arena --help

Requires Python 3.13+. The arena and agentic-arena commands are equivalent. All examples below use arena; substitute uvx agentic-arena if you haven't installed the package.

Developer install

# Install pixi if you don't have it
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc

# Clone and install dependencies (pixi handles everything)
git clone https://github.com/maresb/agentic-arena.git
cd agentic-arena
pixi install

All dependencies (Python 3.13, requests, pydantic, typer, pytest, mypy, ruff) are declared in pixi.toml and resolved via conda-forge. The project is also installed as an editable package via pyproject.toml, which provides the arena console entrypoint.

Verify the install

pixi run test       # unit tests
pixi run lint       # ruff
pixi run format     # ruff format
pixi run typecheck  # mypy

Design

Why Cloud Agents: each agent runs in an isolated VM with its own branch, exposes structured conversations via API, and removes local tmux/worktree orchestration complexity.
Anonymization: model identities are mapped to randomized aliases per run (for example, agent_a -> opus) and prompt ordering is shuffled to reduce positional bias.
Non-goals: real-time interaction, correctness guarantees from consensus, and building a general-purpose agent framework.

Usage

The CLI has five commands: init, run, step, status, and add-comment.

Initialize an arena

arena init \
  --task "Review the authentication module for security issues" \
  --repo owner/repo \
  --base-branch main \
  --max-rounds 3

This creates arenas/0001/state.yaml with a randomized alias-to-model mapping and sets the phase to generate. By default all three models (Claude Opus, GPT, Gemini) are used; use --models to select a subset.

CLI flags

Flag	Default	Description
`--task`	placeholder	Task description for the agents to solve (edit `state.yaml` before running)
`--repo`	git remote	GitHub repository (`owner/repo` format); auto-detected from `origin`
`--base-branch`	`main`	Branch the agents work from
`--max-rounds`	`3`	Cap on generate-evaluate cycles (1-10)
`--models`	all	Comma-separated model list (e.g. `opus,gpt`)
`--verify-commands`	none	Comma-separated commands to run on consensus (e.g. `"pixi run pytest,pixi run mypy ."`)
`--verify-mode`	`advisory`	`advisory` (log failures) or `gating` (override consensus on failure)
`--arena-dir`	auto	Next sequentially-numbered directory under `arenas/`

Run the orchestrator

export CURSOR_API_KEY="your-key-here"
arena run

The orchestrator loops through phases until consensus is reached or max rounds are exhausted. It is fully resumable -- kill it at any point and restart; previously completed work is never re-done.

Progress is logged to both stderr and arenas/NNNN/orchestrator.log. Add -v for DEBUG-level output.

During polling, the orchestrator prints dots (.) to stderr so you know it is still working. These dots are suppressed when verbose logging is enabled.

Single-step mode

arena step

Executes exactly one phase transition (e.g. generate → evaluate) and exits. Useful for debugging or running phases manually.

Check status

arena status

Shows the current phase, round, alias mapping, agent IDs, and per-agent progress.

Configuration

Model selection

By default, the arena uses all three models: opus, gpt, and gemini. Use --models to select a subset:

# Two-model arena
arena init --task "..." --repo owner/repo --models opus,gpt

# Single-model smoke test
arena init --task "..." --repo owner/repo --models opus --max-rounds 1

The alias list (agent_a, agent_b, ...) is automatically sized to match the number of models.

Verify commands

Verify commands run after the judge declares consensus. They let you gate consensus on passing tests:

arena init \
  --task "Fix the login bug" \
  --repo owner/repo \
  --verify-commands "pytest,mypy ." \
  --verify-mode gating

advisory (default): Log verify failures but accept the consensus.
gating: Override consensus to CONTINUE if any verify command fails, forcing another generate-evaluate round.

Inject operator comments

Use add-comment to inject a message into running agent conversations:

# Interactive mode (walks through delivery, targets, framing)
arena add-comment

# Non-interactive: deliver immediately to all agents
arena add-comment --message "Focus on error handling" --immediate

# Queue for next phase start
arena add-comment --message "Ignore the failing lint rule" --queue

Comments can target specific agents with --targets agent_a,agent_b and can include file contents via --file path/to/context.md.

Crash recovery and restart semantics

The orchestrator is designed to survive crashes at any point:

Atomic state writes. State is written to a temp file and renamed, so a crash during write never leaves a corrupt state.yaml.
Idempotent phases. Each agent's progress is tracked individually (pending → sent → done). On restart, only unfinished agents are re-processed.
Crash-safe follow-ups. Before sending a follow-up, the message count is persisted. On restart, the orchestrator compares the current message count to the saved count to detect whether the follow-up was actually delivered, preventing duplicate prompts.
Vote state is persisted. Per-agent vote progress is saved before each verdict prompt, so a crash won't lose already-collected votes.

To resume after a crash, simply re-run the same command:

arena run

Output layout

Each arena run produces:

arenas/0001/
  state.yaml                  Main state file (file: references to artifacts)
  orchestrator.log            Full debug log
  report.md                   Rolling summary report (updated each phase)
  winning-solution.md         Winner's final solution (on completion)
  artifacts/                  Externalized large text from state
    solutions_agent_a.md
    critiques_agent_a.md
    final_verdict.md
    ...
  00-1-generate-opus-solution-a1b2c3.md   Round 0, generate phase archive
  00-2-evaluate-gpt-critique-d4e5f6.md    Round 0, evaluate phase archive
  00-2-evaluate-gpt-verdict-789abc.json   Round 0, verdict archive
  ...

Archive naming: {round:02d}-{phase_num}-{phase}-{model}-{artifact}-{uid}.{ext} where uid is a content-addressed SHA-256 prefix. Files are deduplicated -- restarting the orchestrator does not create duplicate archives.

Artifact externalization: Large text fields (solutions, critiques, verify results, final verdict) are stored as separate .md files under artifacts/. The YAML state file stores file: references that are resolved transparently on load. Old inline state files (without file: references) are still loaded correctly.

Project structure

arena/
  __init__.py        Package root (version)
  __main__.py        Typer CLI: init, run, step, status, add-comment
  api.py             Cursor Cloud Agents HTTP client with retry/backoff
  extraction.py      JSON verdict parsing, VoteVerdict model, fallback heuristics
  git.py             Git remote URL parsing
  orchestrator.py    Main loop, round archival, report generation
  phases.py          Phase functions: generate, evaluate
  prompts.py         Prompt templates, model name mapping, branch hints
  state.py           Pydantic models (ArenaConfig, ArenaState), persistence

tests/
  test_api.py          API client tests
  test_cli.py          CLI commands via Typer CliRunner
  test_extraction.py   JSON verdict parsing, fallback heuristics
  test_git.py          Git remote URL parsing tests
  test_integration.py  Live API tests (requires CURSOR_API_KEY)
  test_orchestrator.py Report generation, archive deduplication
  test_phases.py       Phase control flow with mock API
  test_prompts.py      Prompt template content, branch hints
  test_state.py        Pydantic models, serialization, externalization

.github/workflows/
  ci.yml                   CI pipeline: test, lint, format, typecheck
  pypi.yaml                Trusted publishing to PyPI on release
pyproject.toml             Package metadata, console_scripts entrypoint
pixi.toml                  Dependencies and task definitions

Key types

Type	Module	Purpose
`ArenaConfig`	`state.py`	Frozen config: task, repo, branch, rounds, models, verify
`ArenaState`	`state.py`	Full mutable state persisted to `state.yaml`
`Phase`	`state.py`	StrEnum: generate, evaluate, done
`ProgressStatus`	`state.py`	StrEnum: pending, sent, done
`DEFAULT_MODELS`	`state.py`	Default model short names: opus, gpt, gemini
`Verdict`	`extraction.py`	Parsed judge verdict with decision, score, etc.
`CursorCloudAPI`	`api.py`	HTTP client for the Cursor Cloud Agents endpoints

Testing

Unit tests (no API key needed)

The test suite mocks all API calls and validates control flow, state transitions, extraction logic, prompt construction, and serialization:

pixi run test

Integration tests (requires API key)

Live API tests are in tests/test_integration.py. They are skipped by default and require an explicit opt-in (they launch real agents and cost real money):

RUN_INTEGRATION_TESTS=1 CURSOR_API_KEY=... pixi run pytest tests/test_integration.py -v

These tests verify authentication, model listing, repository listing, and agent launch/stop against the real Cursor Cloud API.

CI

The GitHub Actions pipeline (.github/workflows/ci.yml) runs on every push and PR to main:

pixi run test — unit tests
pixi run lint — ruff linter
pixi run format-check — ruff format check
pixi run typecheck — mypy

Troubleshooting

`CURSOR_API_KEY environment variable is not set`

Export your key or create a .env file. See Obtaining a Cursor API key.

Agent stuck in `RUNNING` / `CREATING`

The orchestrator polls agents with exponential backoff. If an agent appears stuck, check the Cursor dashboard for the agent's status. The orchestrator will wait indefinitely by default; kill and restart it if needed — it will resume from where it left off.

`No arena state found`

Run arena init ... first to create the state file.

Verify commands fail in gating mode

In --verify-mode gating, failing verify commands override consensus and force another round. Check the verify command output in the report or in arenas/NNNN/artifacts/verify_results_*.md. Common causes:

Tests that depend on the local environment (missing dependencies, wrong Python version).
Tests that are unrelated to the task and were already failing before the arena run.

Rate limiting on `/repositories` endpoint

The Cursor API may rate-limit repository listing requests. The API client retries with exponential backoff (up to 5 attempts). If you hit persistent rate limits, wait a few minutes before retrying.

Verdict parsing failures

When an agent's evaluate response cannot be parsed as a valid JSON verdict, the orchestrator logs a warning and uses fallback heuristics to extract scores and votes.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

maresb

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.13

Release history Release notifications | RSS feed

This version

0.1.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_arena-0.1.0.tar.gz (68.9 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_arena-0.1.0-py3-none-any.whl (43.5 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file agentic_arena-0.1.0.tar.gz.

File metadata

Download URL: agentic_arena-0.1.0.tar.gz
Upload date: Feb 23, 2026
Size: 68.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentic_arena-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b0d5e44366be8f7941039cbfff8de65a432920d2bc54998358bc76e651a89921`
MD5	`d1866398551a85154821d149205ca70e`
BLAKE2b-256	`a04a925092b0b1638a09b9e90f54f7d7edc5befa9cdd227e5d25a18e5bd37abf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_arena-0.1.0.tar.gz:

Publisher: pypi.yaml on maresb/agentic-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_arena-0.1.0.tar.gz
- Subject digest: b0d5e44366be8f7941039cbfff8de65a432920d2bc54998358bc76e651a89921
- Sigstore transparency entry: 983733916
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: maresb/agentic-arena@b502217a72363fd0a02fc7e0add8df286aa38518
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/maresb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@b502217a72363fd0a02fc7e0add8df286aa38518
- Trigger Event: release

File details

Details for the file agentic_arena-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentic_arena-0.1.0-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 43.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentic_arena-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb798e97ecf3a02541a8265a2e373d45d073dfcbf5d11d37dc2003c672d3bcce`
MD5	`b1ccafc15c1baa4024efd7a3ee52648c`
BLAKE2b-256	`5b378049454bf3d23fe30106fc288115abe4bd63610dc95962bdd4a18cf42a8f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_arena-0.1.0-py3-none-any.whl:

Publisher: pypi.yaml on maresb/agentic-arena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentic_arena-0.1.0-py3-none-any.whl
- Subject digest: cb798e97ecf3a02541a8265a2e373d45d073dfcbf5d11d37dc2003c672d3bcce
- Sigstore transparency entry: 983733926
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: maresb/agentic-arena@b502217a72363fd0a02fc7e0add8df286aa38518
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/maresb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@b502217a72363fd0a02fc7e0add8df286aa38518
- Trigger Event: release

agentic-arena 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Agentic Arena

Getting started

Prerequisites

Obtaining a Cursor API key

Quick start

Developer install

Verify the install

Design

Usage

Initialize an arena

CLI flags

Run the orchestrator

Single-step mode

Check status

Configuration

Model selection

Verify commands

Inject operator comments

Crash recovery and restart semantics

Output layout

Project structure

Key types

Testing

Unit tests (no API key needed)

Integration tests (requires API key)

CI

Troubleshooting

CURSOR_API_KEY environment variable is not set

Agent stuck in RUNNING / CREATING

No arena state found

Verify commands fail in gating mode

Rate limiting on /repositories endpoint

Verdict parsing failures

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`CURSOR_API_KEY environment variable is not set`

Agent stuck in `RUNNING` / `CREATING`

`No arena state found`

Rate limiting on `/repositories` endpoint