Skip to main content

Interactive-ARC benchmark for evaluating LLMs on ARC tasks

Project description

Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.

Features

  • Interactive evaluation: models build solutions incrementally using grid-editing tools
  • Full action traces: every tool call, grid state, and token count is recorded
  • 1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
  • Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
  • Concurrent execution: evaluates tasks in parallel with configurable concurrency
  • Checkpointing: interrupted runs resume from where they left off

Installation

pip install interactive-arc

Requires Python 3.12+.

Quick Start

With a cloud provider

# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0

With a local model (vLLM, Ollama, etc.)

interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42

Inspect a single task

interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514

CLI Reference

interactive-arc run [OPTIONS]
Option Default Description
--dataset arc-agi-1 Dataset (arc-agi-1 or arc-agi-2)
--split training Split (training or evaluation)
--provider bedrock LLM provider (anthropic, bedrock, openai)
--model Model identifier
--base-url Base URL for OpenAI-compatible endpoints
--renderer text Grid format sent to model (text, json, markdown)
--sample all Number of tasks to sample
--seed Random seed for reproducible sampling
--output Path for summary statistics JSON
--traces ./traces Directory for full trace files
--max-attempts 2 Submission attempts per task (1-10)
--enabled-tools all Comma-separated subset of tools to enable
--grid-feedback both Grid state shown after actions (both, output, none)

Tools

Models interact with the grid through these tools:

Tool Description
set_cell(x, y, color) Set a single cell
set_width(width) Resize grid width
set_height(height) Resize grid height
flood_fill(x, y, color) Fill connected region
copy_input() Copy test input to output grid
copy_region(x, y, w, h) Copy a rectangular region to clipboard
paste_region(x, y) Paste clipboard at position
undo() Undo last operation
reset() Reset grid to initial state
submit(explanation) Submit current grid as answer

Python API

from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")

Output

Each run produces:

  • Summary JSON: success rate, action efficiency, token usage, cost estimates
  • Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

  1. Environment: grid state machine, tool execution, task loading
  2. Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
  3. Runner: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

Development

git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interactive_arc-0.1.1.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interactive_arc-0.1.1-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file interactive_arc-0.1.1.tar.gz.

File metadata

  • Download URL: interactive_arc-0.1.1.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.1.tar.gz
Algorithm Hash digest
SHA256 44e02a9cf501a11b0efb203c6964b3b9a37f69ce1d14716bc9548a9e59ec4bf0
MD5 a9f817ec48e78029ace92e13b1714554
BLAKE2b-256 92752b6cf25223b3c74f348c81218958d6c7d9aab28114ee7eb1062705dc8e43

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.1.tar.gz:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file interactive_arc-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: interactive_arc-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f43adac0a30c8fa510abe74378d9c79ed23b9c749b4d0f5ec1ffd97e9a2c38a6
MD5 f15ed032a9283b514cf8df6befbf31d2
BLAKE2b-256 4482ba6efaaf7204c121022496794f790da17502ef425cfdd6518b88abf8a8bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.1-py3-none-any.whl:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page