SIA: Self-Improving Auto-researcher — an autonomous AI scientist framework

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

SIA (Self-Improving Auto-researcher)

Our goal is to build a self-improving AI scientist that can autonomously go ahead and improve its performance on scientific tasks.

Results

Below are example results showing progressive improvement of SIA on scientific tasks:

Figure: Model performance plots show the improvement of SIA over multiple generations of self-improvement across tasks.

Overview

SIA orchestration flow: Meta-Agent, Target Agent, and Feedback Agent across generations

Figure: How the orchestrator runs Meta-, Target, and Feedback agents over successive generations.

SIA operates by coordinating three main types of AI agents that work together to continuously improve task performance:

Glossary

Meta-Agent: Reads the task description and generates an initial Target Agent tailored to the task.
Target Agent: Attempts to complete the task and records its actions and results.
Feedback/Improvement Agent: Reviews the Target Agent's performance logs, identifies improvements, and updates the Target Agent accordingly.

This iterative process allows the system to autonomously refine and enhance its ability to solve scientific tasks.

Directory Structure

sia/
├── sia/
│   ├── orchestrator.py           # Main orchestration logic
│   ├── context_manager.py        # Run/context tracking
│   ├── util.py                   # Agent runner utilities
│   ├── prepare_mlebench_dataset.py    # Dataset preparation script
│   └── tasks/                    # Bundled with the wheel
│       ├── _shared/
│       │   ├── reference_target_agent.py
│       │   └── sample_agent_execution.json
│       └── {task-id}/            # gpqa, lawbench, longcot-chess, spaceship-titanic
│           ├── data/
│           │   ├── public/       # Public dataset
│           │   │   ├── task.md       # Task description
│           │   │   └── *.csv         # Data files
│           │   └── private/      # Private evaluation data
│           └── reference/
│               ├── SAMPLE_TASK_DESCRIPTIONS.md
│               └── reference_target_agent.py
└── runs/                         # Generated during execution
    └── run_{id}/
        ├── venv/                 # Isolated Python environment
        └── gen_{n}/              # Each generation's artifacts
            ├── target_agent.py
            ├── agent_execution.json
            └── improvement.md    # (from gen_2 onwards)

Setup

Prerequisites

Python 3.11+ with venv support

Create a virtual environment (recommended):

python3 -m venv .venv
source .venv/bin/activate

Install sia-agent from PyPI (recommended) — this ships the four built-in tasks with the wheel:

pip install 'sia-agent[claude]'
# or, for the OpenHands backend:
pip install 'sia-agent[openhands]'

For development from a clone of this repo:

pip install -e '.[dev,claude]'

API Keys: Set the appropriate API keys based on which backend and models you plan to use:

For Claude Code backend (default):

export ANTHROPIC_API_KEY="your-anthropic-api-key"

For OpenHands backend with multiple LLMs:

# For Claude models via OpenHands
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# For Gemini models via OpenHands
export GOOGLE_API_KEY="your-google-api-key"
# OR
export GEMINI_API_KEY="your-gemini-api-key"

# For GPT models via OpenHands
export OPENAI_API_KEY="your-openai-api-key"

# Generic fallback (if specific keys not set)
export LLM_API_KEY="your-api-key"

Example Usage

Quick start — run a bundled task

The wheel ships with four ready-to-run tasks: gpqa, lawbench, longcot-chess, spaceship-titanic.

sia --task gpqa --max_gen 2 --run_id 1

That's it — no clone, no dataset setup. To use a different bundled task, swap the name (e.g., --task spaceship-titanic).

Using SIA to build a custom task

If you want to run SIA on your own dataset, prepare a task directory with the layout below and point --task_dir at it.

Step 1: Set Up Your Custom Task Directory and Assets

Create the task directory structure:

mkdir -p my-tasks/gpqa/{data/public,data/private,reference}

Add your dataset and task description:
- Place your dataset files in the appropriate folders:
  - Public questions:
```
cp questions.json my-tasks/gpqa/data/public/
```
  - Private answers, ground truths:
```
cp answers.json my-tasks/gpqa/data/private/
```
  Note: The LLM is NOT provided any context about the private/ folder during evaluation. This prevents cheating and ensures fair assessment.
- Write the task description in my-tasks/gpqa/data/public/task.md.

Copy the reference agent template (from a clone of this repo):

cp sia/tasks/_shared/reference_target_agent.py my-tasks/gpqa/reference/

(Optional) Add sample task descriptions: You may create my-tasks/gpqa/reference/SAMPLE_TASK_DESCRIPTIONS.md with examples of similar tasks. This helps the agent generalize better and prevents overfitting to the specific task, if that is your intention.

Step 2: Run the Orchestrator

Bundled task (Claude backend):

sia --task gpqa --max_gen 5 --run_id 1

External custom task:

sia --task_dir ./my-tasks/gpqa --max_gen 5 --run_id 1

Using OpenHands with Gemini:

sia \
  --task gpqa \
  --max_gen 5 \
  --run_id 1 \
  --backend openhands \
  --meta_model "gemini/gemini-3.1-pro-preview"

Key Arguments:

--task: Name of a bundled task (gpqa, lawbench, longcot-chess, spaceship-titanic). Mutually exclusive with --task_dir.
--task_dir: Path to an external task directory. Mutually exclusive with --task.
--max_gen: Number of generations to evolve (default: 3)
--run_id: Unique identifier for this run (default: 1)
--backend: Agent backend to use: claude (default) or openhands
--meta_model: Model for meta/feedback agents (default: haiku)

See the Configuration section below for detailed backend and model options.

What happens during execution:

Generation 1:
- Meta-agent reads task and creates initial target_agent.py
- Target agent executes task and logs to agent_execution.json
- Feedback agent analyzes and creates improved agent for Gen 2
Generation 2-N:
- Target agent from current generation executes task
- Feedback agent analyzes and creates next generation
- Continues until max_gen is reached
Output:
- All artifacts saved in runs/run_{run_id}/gen_{n}/
- Each generation has its own target_agent.py and execution logs
- Improvement notes in improvement.md

Step 3: Analyze Results

# View execution logs
cat runs/run_1/gen_1/agent_execution.json

# View improvements made
cat runs/run_1/gen_2/improvement.md

# Compare agent versions
diff runs/run_1/gen_1/target_agent.py runs/run_1/gen_2/target_agent.py

Task Requirements

Each task directory must follow this structure:

{task-id}/
├── data/
│   ├── public/
│   │   ├── task.md                    # Task description (orchestrator reads this)
│   │   ├── train.csv
│   │   ├── test.csv
│   │   └── sample_submission.csv
│   └── private/
│       └── ...                        # Private evaluation data
└── reference/
    ├── SAMPLE_TASK_DESCRIPTIONS.md    # Similar tasks (for meta-agent context)
    └── reference_target_agent.py      # Template agent structure

Running SIA on MLE-Bench task

Use the prepare_mlebench_dataset.py script to prepare a task dataset from MLE-Bench:

python orchestration/prepare_mlebench_dataset.py -c "spaceship-titanic"

This will:

Run mlebench prepare -c "spaceship-titanic"
Copy public and private datasets from ~/.cache/mle-bench/data/prepared/
Rename description.md to task.md in data/public/
Use Gemini to generate similar tasks (optional)
Create SAMPLE_TASK_DESCRIPTIONS.md in reference/
Copy reference_target_agent.py from _shared/ to reference/

Options:

--skip-gemini: Skip Gemini API call for similar tasks
--tasks-dir PATH: Specify custom tasks directory (default: ./tasks)

Optionally create SAMPLE_TASK_DESCRIPTIONS.md manually in reference/

Troubleshooting

"Run directory already exists"

The orchestrator prevents overwriting existing runs. Either:

Use a different --run_id
Delete the existing run: rm -rf runs/run_1

"No GEMINI_API_KEY environment variable set"

The prepare script will skip similar task generation. Either:

Set the environment variable: export GEMINI_API_KEY="your-key"
Use --skip-gemini flag to skip this step

Target agent fails during execution

Check the logs in the generation directory:

cat runs/run_1/gen_1/agent_execution.json

Common issues:

Dataset paths incorrect (ensure absolute paths are used)
Missing Python packages in the venv
ANTHROPIC_API_KEY not set

ImportError: No module named 'anthropic'

The orchestrator creates a fresh venv for each run. If packages are missing:

Check the venv creation in the orchestrator logs
Manually install: runs/run_1/venv/bin/pip install anthropic

Configuration

Agent Backend Selection

SIA supports two agent backends for maximum flexibility:

1. Claude Code Backend (Default)

Uses the Claude Agent SDK with Claude models only:

sia \
  --task gpqa \
  --max_gen 5 \
  --run_id 1 \
  --backend claude \
  --meta_model haiku

Supported Models:

haiku (claude-haiku-4-5-20251001)
sonnet (claude-sonnet-4-5-20250929)
opus (claude-opus-4-5-20251101)

2. OpenHands Backend

Uses the OpenHands SDK with support for multiple LLM providers:

sia \
  --task gpqa \
  --max_gen 5 \
  --run_id 2 \
  --backend openhands \
  --meta_model "gemini/gemini-3.1-pro-preview"

Supported Models:

Google Gemini:

--meta_model "gemini/gemini-3.0-pro"
--meta_model "gemini/gemini-3.1-pro-preview"

OpenAI GPT:

--meta_model "openai/gpt-4"
--meta_model "openai/gpt-4-turbo"

Anthropic Claude (via OpenHands):

--meta_model "anthropic/claude-sonnet-4-5-20250929"
--meta_model "anthropic/claude-opus-4-5-20251101"

Complete Example: Testing Multiple LLMs

# Run 1: Claude via Claude Code (default)
sia \
  --task gpqa \
  --max_gen 3 \
  --run_id 1 \
  --backend claude \
  --meta_model haiku

# Run 2: Gemini via OpenHands
sia \
  --task gpqa \
  --max_gen 3 \
  --run_id 2 \
  --backend openhands \
  --meta_model "gemini/gemini-3.1-pro-preview"

# Run 3: GPT-4 via OpenHands
sia \
  --task gpqa \
  --max_gen 3 \
  --run_id 3 \
  --backend openhands \
  --meta_model "openai/gpt-4"

Command-Line Arguments Reference

Argument	Required	Default	Description
`--task`	One of	-	Name of a bundled task (`gpqa`, `lawbench`, `longcot-chess`, `spaceship-titanic`)
`--task_dir`	One of	-	Path to an external task directory (mutually exclusive with `--task`)
`--max_gen`	No	3	Number of improvement generations
`--run_id`	No	1	Unique run identifier
`--backend`	No	`claude`	Agent backend: `claude` or `openhands`
`--meta_model`	No	`haiku`	Model for meta and feedback agents
`--task_model`	No	`claude-haiku-4-5-20251001`	Model for target agent execution

Model Selection

The default model is haiku (claude-haiku-4-5-20251001). To use a different model, use the --meta_model and --task_model arguments as shown above.

Important Notes:

When using the claude backend, only Claude model names are supported (haiku, sonnet, opus)
When using the openhands backend, use fully-qualified model names (e.g., gemini/gemini-3.1-pro-preview)
Ensure the appropriate API keys are set in your environment for the models you choose

Customizing Prompts

Edit the prompts in orchestrator.py:

META_AGENT_PROMPT: Controls how the initial agent is created
FEEDBACK_AGENT_PROMPT: Controls how improvements are suggested

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

selvamhexo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

May 27, 2026

0.2.0

May 27, 2026

0.1.1

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sia_agent-0.2.1.tar.gz (3.7 MB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sia_agent-0.2.1-py3-none-any.whl (3.7 MB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file sia_agent-0.2.1.tar.gz.

File metadata

Download URL: sia_agent-0.2.1.tar.gz
Upload date: May 27, 2026
Size: 3.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sia_agent-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`26bfa72eb07d652c9e69bbbe49377a745039014c18cfe635804e15c346bd93fc`
MD5	`2205272fff86041d170f139d8995d848`
BLAKE2b-256	`bd4e3129f239efd5449db5371dc10d9f6e370522513a60a3c39ff40cc13d80ea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sia_agent-0.2.1.tar.gz:

Publisher: publish.yml on hexo-ai/sia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sia_agent-0.2.1.tar.gz
- Subject digest: 26bfa72eb07d652c9e69bbbe49377a745039014c18cfe635804e15c346bd93fc
- Sigstore transparency entry: 1646042190
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: hexo-ai/sia@6eedecf934a8471bc5867baa4a258cb38d07ca7f
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/hexo-ai
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6eedecf934a8471bc5867baa4a258cb38d07ca7f
- Trigger Event: push

File details

Details for the file sia_agent-0.2.1-py3-none-any.whl.

File metadata

Download URL: sia_agent-0.2.1-py3-none-any.whl
Upload date: May 27, 2026
Size: 3.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sia_agent-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f03ca47b71fe530322717e8731ca8d8a9a9a11ac357abe3187e93bd2d38da16b`
MD5	`5f3f718b831d316b40e40b3d2c34c813`
BLAKE2b-256	`c5afe1c8686e5b192fca060a54fd7902bd77d4b5a0cd5f64350f97b13754f99c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sia_agent-0.2.1-py3-none-any.whl:

Publisher: publish.yml on hexo-ai/sia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sia_agent-0.2.1-py3-none-any.whl
- Subject digest: f03ca47b71fe530322717e8731ca8d8a9a9a11ac357abe3187e93bd2d38da16b
- Sigstore transparency entry: 1646042277
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: hexo-ai/sia@6eedecf934a8471bc5867baa4a258cb38d07ca7f
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/hexo-ai
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6eedecf934a8471bc5867baa4a258cb38d07ca7f
- Trigger Event: push

sia-agent 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SIA (Self-Improving Auto-researcher)

Results

Overview

Glossary

Directory Structure

Setup

Prerequisites

Example Usage

Quick start — run a bundled task

Using SIA to build a custom task

Step 1: Set Up Your Custom Task Directory and Assets

Step 2: Run the Orchestrator

Step 3: Analyze Results

Task Requirements

Running SIA on MLE-Bench task

Troubleshooting

"Run directory already exists"

"No GEMINI_API_KEY environment variable set"

Target agent fails during execution

ImportError: No module named 'anthropic'

Configuration

Agent Backend Selection

1. Claude Code Backend (Default)

2. OpenHands Backend

Complete Example: Testing Multiple LLMs

Command-Line Arguments Reference

Model Selection

Customizing Prompts

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance