Skip to main content

Interactive-ARC benchmark for evaluating LLMs on ARC tasks

Project description

Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.

Features

  • Interactive evaluation: models build solutions incrementally using grid-editing tools
  • Full action traces: every tool call, grid state, and token count is recorded
  • 1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
  • Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
  • Concurrent execution: evaluates tasks in parallel with configurable concurrency
  • Checkpointing: interrupted runs resume from where they left off

Installation

pip install interactive-arc

Requires Python 3.12+.

Quick Start

With a cloud provider

# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0

With a local model (vLLM, Ollama, etc.)

interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42

Inspect a single task

interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514

CLI Reference

interactive-arc run [OPTIONS]
Option Default Description
--dataset arc-agi-1 Dataset (arc-agi-1 or arc-agi-2)
--split training Split (training or evaluation)
--provider bedrock LLM provider (anthropic, bedrock, openai)
--model Model identifier
--base-url Base URL for OpenAI-compatible endpoints
--renderer text Grid format sent to model (text, json, markdown)
--sample all Number of tasks to sample
--seed Random seed for reproducible sampling
--output Path for summary statistics JSON
--traces ./traces Directory for full trace files
--max-attempts 2 Submission attempts per task (1-10)
--enabled-tools all Comma-separated subset of tools to enable
--grid-feedback both Grid state shown after actions (both, output, none)

Tools

Models interact with the grid through these tools:

Tool Description
set_cell(x, y, color) Set a single cell
set_width(width) Resize grid width
set_height(height) Resize grid height
flood_fill(x, y, color) Fill connected region
copy_input() Copy test input to output grid
copy_region(x, y, w, h) Copy a rectangular region to clipboard
paste_region(x, y) Paste clipboard at position
undo() Undo last operation
reset() Reset grid to initial state
submit(explanation) Submit current grid as answer

Python API

from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")

Output

Each run produces:

  • Summary JSON: success rate, action efficiency, token usage, cost estimates
  • Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

  1. Environment: grid state machine, tool execution, task loading
  2. Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
  3. Runner: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

Development

git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interactive_arc-0.1.2.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interactive_arc-0.1.2-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file interactive_arc-0.1.2.tar.gz.

File metadata

  • Download URL: interactive_arc-0.1.2.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2fd1c913c8a411bc2e8a2616a17d1310e2b1259d7b93e0ce90d4fdc184abcf08
MD5 a1ebe5ba4ce5c87780cd1b1ac90faa3f
BLAKE2b-256 c573f2b1a105823e59a1b912788d469db09fe749f31aa4ba208465104c4f281a

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.2.tar.gz:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file interactive_arc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: interactive_arc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 68c7f8b0d43b69a99d33f286d1fca31ca45bc5afb20321c13d33b2a56cb304ed
MD5 a02bf3c97a49f62c6ba31f7b874bd5ab
BLAKE2b-256 cfdddce003aa6fa13b527f7b3fabcc6de862c9347baf109596346f3f5254dbbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.2-py3-none-any.whl:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page