Interactive-ARC benchmark for evaluating LLMs on ARC tasks

Project description

Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.

Features

Interactive evaluation: models build solutions incrementally using grid-editing tools
Full action traces: every tool call, grid state, and token count is recorded
1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
Concurrent execution: evaluates tasks in parallel with configurable concurrency
Checkpointing: interrupted runs resume from where they left off

Installation

pip install interactive-arc

Requires Python 3.12+.

Quick Start

With a cloud provider

# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0

With a local model (vLLM, Ollama, etc.)

interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42

Inspect a single task

interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514

CLI Reference

interactive-arc run [OPTIONS]

Option	Default	Description
`--dataset`	`arc-agi-1`	Dataset (`arc-agi-1` or `arc-agi-2`)
`--split`	`training`	Split (`training` or `evaluation`)
`--provider`	`bedrock`	LLM provider (`anthropic`, `bedrock`, `openai`)
`--model`		Model identifier
`--base-url`		Base URL for OpenAI-compatible endpoints
`--renderer`	`text`	Grid format sent to model (`text`, `json`, `markdown`)
`--sample`	all	Number of tasks to sample
`--seed`		Random seed for reproducible sampling
`--output`		Path for summary statistics JSON
`--traces`	`./traces`	Directory for full trace files
`--max-attempts`	`2`	Submission attempts per task (1-10)
`--enabled-tools`	all	Comma-separated subset of tools to enable
`--grid-feedback`	`both`	Grid state shown after actions (`both`, `output`, `none`)

Tools

Models interact with the grid through these tools:

Tool	Description
`set_cell(x, y, color)`	Set a single cell
`set_width(width)`	Resize grid width
`set_height(height)`	Resize grid height
`flood_fill(x, y, color)`	Fill connected region
`copy_input()`	Copy test input to output grid
`copy_region(x, y, w, h)`	Copy a rectangular region to clipboard
`paste_region(x, y)`	Paste clipboard at position
`undo()`	Undo last operation
`reset()`	Reset grid to initial state
`submit(explanation)`	Submit current grid as answer

Python API

from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")

Output

Each run produces:

Summary JSON: success rate, action efficiency, token usage, cost estimates
Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

Environment: grid state machine, tool execution, task loading
Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
Runner: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

Development

git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/

Licence

MIT

Project details

Release history Release notifications | RSS feed

0.1.2

May 18, 2026

This version

0.1.1

May 18, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interactive_arc-0.1.1.tar.gz (2.2 MB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

interactive_arc-0.1.1-py3-none-any.whl (43.3 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file interactive_arc-0.1.1.tar.gz.

File metadata

Download URL: interactive_arc-0.1.1.tar.gz
Upload date: May 18, 2026
Size: 2.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`44e02a9cf501a11b0efb203c6964b3b9a37f69ce1d14716bc9548a9e59ec4bf0`
MD5	`a9f817ec48e78029ace92e13b1714554`
BLAKE2b-256	`92752b6cf25223b3c74f348c81218958d6c7d9aab28114ee7eb1062705dc8e43`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.1.tar.gz:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interactive_arc-0.1.1.tar.gz
- Subject digest: 44e02a9cf501a11b0efb203c6964b3b9a37f69ce1d14716bc9548a9e59ec4bf0
- Sigstore transparency entry: 1567810638
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: interactive-arc/interactive-arc@8f7fb0beb65893661f3ff82f2e536de8cf392fd7
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/interactive-arc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@8f7fb0beb65893661f3ff82f2e536de8cf392fd7
- Trigger Event: push

File details

Details for the file interactive_arc-0.1.1-py3-none-any.whl.

File metadata

Download URL: interactive_arc-0.1.1-py3-none-any.whl
Upload date: May 18, 2026
Size: 43.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interactive_arc-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f43adac0a30c8fa510abe74378d9c79ed23b9c749b4d0f5ec1ffd97e9a2c38a6`
MD5	`f15ed032a9283b514cf8df6befbf31d2`
BLAKE2b-256	`4482ba6efaaf7204c121022496794f790da17502ef425cfdd6518b88abf8a8bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interactive_arc-0.1.1-py3-none-any.whl:

Publisher: cd.yml on interactive-arc/interactive-arc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interactive_arc-0.1.1-py3-none-any.whl
- Subject digest: f43adac0a30c8fa510abe74378d9c79ed23b9c749b4d0f5ec1ffd97e9a2c38a6
- Sigstore transparency entry: 1567810967
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: interactive-arc/interactive-arc@8f7fb0beb65893661f3ff82f2e536de8cf392fd7
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/interactive-arc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@8f7fb0beb65893661f3ff82f2e536de8cf392fd7
- Trigger Event: push

interactive-arc 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Interactive-ARC

Features

Installation

Quick Start

With a cloud provider

With a local model (vLLM, Ollama, etc.)

Inspect a single task

CLI Reference

Tools

Python API

Output

Architecture

Development

Licence

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance