Interactive-ARC benchmark for evaluating LLMs on ARC tasks
Project description
Interactive-ARC
An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.
Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.
Features
- Interactive evaluation: models build solutions incrementally using grid-editing tools
- Full action traces: every tool call, grid state, and token count is recorded
- 1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
- Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
- Concurrent execution: evaluates tasks in parallel with configurable concurrency
- Checkpointing: interrupted runs resume from where they left off
Installation
pip install interactive-arc
Requires Python 3.12+.
Quick Start
With a cloud provider
# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514
# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0
With a local model (vLLM, Ollama, etc.)
interactive-arc run \
--provider openai \
--base-url http://localhost:8000/v1 \
--model Qwen/Qwen3.6-27B \
--dataset arc-agi-1 \
--split training \
--sample 50 --seed 42
Inspect a single task
interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514
CLI Reference
interactive-arc run [OPTIONS]
| Option | Default | Description |
|---|---|---|
--dataset |
arc-agi-1 |
Dataset (arc-agi-1 or arc-agi-2) |
--split |
training |
Split (training or evaluation) |
--provider |
bedrock |
LLM provider (anthropic, bedrock, openai) |
--model |
Model identifier | |
--base-url |
Base URL for OpenAI-compatible endpoints | |
--renderer |
text |
Grid format sent to model (text, json, markdown) |
--sample |
all | Number of tasks to sample |
--seed |
Random seed for reproducible sampling | |
--output |
Path for summary statistics JSON | |
--traces |
./traces |
Directory for full trace files |
--max-attempts |
2 |
Submission attempts per task (1-10) |
--enabled-tools |
all | Comma-separated subset of tools to enable |
--grid-feedback |
both |
Grid state shown after actions (both, output, none) |
Tools
Models interact with the grid through these tools:
| Tool | Description |
|---|---|
set_cell(x, y, color) |
Set a single cell |
set_width(width) |
Resize grid width |
set_height(height) |
Resize grid height |
flood_fill(x, y, color) |
Fill connected region |
copy_input() |
Copy test input to output grid |
copy_region(x, y, w, h) |
Copy a rectangular region to clipboard |
paste_region(x, y) |
Paste clipboard at position |
undo() |
Undo last operation |
reset() |
Reset grid to initial state |
submit(explanation) |
Submit current grid as answer |
Python API
from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer
# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")
# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()
print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")
Output
Each run produces:
- Summary JSON: success rate, action efficiency, token usage, cost estimates
- Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)
Architecture
The codebase follows a three-layer architecture with strict one-directional dependencies:
- Environment: grid state machine, tool execution, task loading
- Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
- Runner: concurrent orchestration, checkpointing, metrics, CLI
Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.
Development
git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file interactive_arc-0.1.2.tar.gz.
File metadata
- Download URL: interactive_arc-0.1.2.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fd1c913c8a411bc2e8a2616a17d1310e2b1259d7b93e0ce90d4fdc184abcf08
|
|
| MD5 |
a1ebe5ba4ce5c87780cd1b1ac90faa3f
|
|
| BLAKE2b-256 |
c573f2b1a105823e59a1b912788d469db09fe749f31aa4ba208465104c4f281a
|
Provenance
The following attestation bundles were made for interactive_arc-0.1.2.tar.gz:
Publisher:
cd.yml on interactive-arc/interactive-arc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interactive_arc-0.1.2.tar.gz -
Subject digest:
2fd1c913c8a411bc2e8a2616a17d1310e2b1259d7b93e0ce90d4fdc184abcf08 - Sigstore transparency entry: 1567954217
- Sigstore integration time:
-
Permalink:
interactive-arc/interactive-arc@abfaad8e9855263671adff6f75872a288533947f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/interactive-arc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@abfaad8e9855263671adff6f75872a288533947f -
Trigger Event:
push
-
Statement type:
File details
Details for the file interactive_arc-0.1.2-py3-none-any.whl.
File metadata
- Download URL: interactive_arc-0.1.2-py3-none-any.whl
- Upload date:
- Size: 43.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c7f8b0d43b69a99d33f286d1fca31ca45bc5afb20321c13d33b2a56cb304ed
|
|
| MD5 |
a02bf3c97a49f62c6ba31f7b874bd5ab
|
|
| BLAKE2b-256 |
cfdddce003aa6fa13b527f7b3fabcc6de862c9347baf109596346f3f5254dbbb
|
Provenance
The following attestation bundles were made for interactive_arc-0.1.2-py3-none-any.whl:
Publisher:
cd.yml on interactive-arc/interactive-arc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interactive_arc-0.1.2-py3-none-any.whl -
Subject digest:
68c7f8b0d43b69a99d33f286d1fca31ca45bc5afb20321c13d33b2a56cb304ed - Sigstore transparency entry: 1567954352
- Sigstore integration time:
-
Permalink:
interactive-arc/interactive-arc@abfaad8e9855263671adff6f75872a288533947f -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/interactive-arc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
cd.yml@abfaad8e9855263671adff6f75872a288533947f -
Trigger Event:
push
-
Statement type: