SIA: Self-Improving Auto-researcher — an autonomous AI scientist framework
Project description
SIA (Self-Improving Auto-researcher)
Our goal is to build a self-improving AI scientist that can autonomously go ahead and improve its performance on scientific tasks.
Results
Below are example results showing progressive improvement of SIA on scientific tasks:
Figure: Model performance plots show the improvement of SIA over multiple generations of self-improvement across tasks.
Overview
Figure: How the orchestrator runs Meta-, Target, and Feedback agents over successive generations.
SIA operates by coordinating three main types of AI agents that work together to continuously improve task performance:
Glossary
- Meta-Agent: Reads the task description and generates an initial Target Agent tailored to the task.
- Target Agent: Attempts to complete the task and records its actions and results.
- Feedback/Improvement Agent: Reviews the Target Agent's performance logs, identifies improvements, and updates the Target Agent accordingly.
This iterative process allows the system to autonomously refine and enhance its ability to solve scientific tasks.
Directory Structure
sia/
├── orchestration/
│ ├── orchestrator.py # Main orchestration logic
│ ├── meta_agent.py # Meta-agent implementation
│ ├── feedback_agent.py # Feedback agent implementation
│ └── prepare_mlebench_dataset.py # Dataset preparation script
├── tasks/
│ ├── _shared/
│ │ ├── reference_target_agent.py
│ │ └── sample_agent_execution.json
│ └── {task-id}/
│ ├── data/
│ │ ├── public/ # Public dataset
│ │ │ ├── task.md # Task description
│ │ │ └── *.csv # Data files
│ │ └── private/ # Private dataset
│ └── reference/
│ ├── SAMPLE_TASK_DESCRIPTIONS.md
│ └── reference_target_agent.py
└── runs/ # Generated during execution
└── run_{id}/
├── venv/ # Isolated Python environment
└── gen_{n}/ # Each generation's artifacts
├── target_agent.py
├── agent_execution.json
└── improvement.md # (from gen_2 onwards)
Setup
Prerequisites
-
Python 3.11+ with venv support
-
Create a virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate
-
Install required dependencies from
requirements.txt:pip install -r requirements.txt
-
API Keys: Set the appropriate API keys based on which backend and models you plan to use:
For Claude Code backend (default):
export ANTHROPIC_API_KEY="your-anthropic-api-key"
For OpenHands backend with multiple LLMs:
# For Claude models via OpenHands export ANTHROPIC_API_KEY="your-anthropic-api-key" # For Gemini models via OpenHands export GOOGLE_API_KEY="your-google-api-key" # OR export GEMINI_API_KEY="your-gemini-api-key" # For GPT models via OpenHands export OPENAI_API_KEY="your-openai-api-key" # Generic fallback (if specific keys not set) export LLM_API_KEY="your-api-key"
Example Usage
Using SIA to build SOTA Scientifc Reasoning Agent
Step 1: Set Up Your Custom Task Directory and Assets
To create a new custom task (e.g., for GPQA), follow these streamlined steps:
-
Create the task directory structure:
mkdir -p tasks/gpqa/{data/public,data/private,reference}
-
Add your dataset and task description:
-
Place your dataset files in the appropriate folders:
- Public questions:
cp questions.json tasks/gpqa/data/public/
- Private answers, ground truths:
cp answers.json tasks/gpqa/data/private/
Note: The LLM is NOT provided any context about the
private/folder during evaluation. This prevents cheating and ensures fair assessment. - Public questions:
-
Write the task description in
tasks/gpqa/data/public/task.md.
Example content:# GPQA - General Purpose Question Answering Answer graduate-level science questions across physics, chemistry, and biology. Each question has multiple choice answers. Select the correct answer. ## Data Format - questions.json: Contains questions with multiple choice options
-
-
Copy the reference agent template:
cp tasks/_shared/reference_target_agent.py tasks/gpqa/reference/
-
(Optional) Add sample task descriptions: You may create
tasks/gpqa/reference/SAMPLE_TASK_DESCRIPTIONS.mdwith examples of similar tasks. This helps the agent generalize better and prevents overfitting to the specific task, if that is your intention.
Step 2: Run the Orchestrator
Basic Usage (Claude backend):
python orchestration/orchestrator.py --task_dir ./tasks/gpqa --max_gen 5 --run_id 1
Using OpenHands with Gemini:
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 5 \
--run_id 1 \
--backend openhands \
--meta_model "gemini/gemini-3.1-pro-preview"
Key Arguments:
--task_dir: Path to the task directory (e.g.,./tasks/spaceship-titanic)--max_gen: Number of generations to evolve (default: 3)--run_id: Unique identifier for this run (default: 1)--backend: Agent backend to use:claude(default) oropenhands--meta_model: Model for meta/feedback agents (default:haiku)
See the Configuration section below for detailed backend and model options.
What happens during execution:
-
Generation 1:
- Meta-agent reads task and creates initial
target_agent.py - Target agent executes task and logs to
agent_execution.json - Feedback agent analyzes and creates improved agent for Gen 2
- Meta-agent reads task and creates initial
-
Generation 2-N:
- Target agent from current generation executes task
- Feedback agent analyzes and creates next generation
- Continues until
max_genis reached
-
Output:
- All artifacts saved in
runs/run_{run_id}/gen_{n}/ - Each generation has its own
target_agent.pyand execution logs - Improvement notes in
improvement.md
- All artifacts saved in
Step 3: Analyze Results
# View execution logs
cat runs/run_1/gen_1/agent_execution.json
# View improvements made
cat runs/run_1/gen_2/improvement.md
# Compare agent versions
diff runs/run_1/gen_1/target_agent.py runs/run_1/gen_2/target_agent.py
Task Requirements
Each task directory must follow this structure:
tasks/{task-id}/
├── data/
│ ├── public/
│ │ ├── task.md # Task description (orchestrator reads this)
│ │ ├── train.csv
│ │ ├── test.csv
│ │ └── sample_submission.csv
│ └── private/
│ └── ... # Private evaluation data
└── reference/
├── SAMPLE_TASK_DESCRIPTIONS.md # Similar tasks (for meta-agent context)
└── reference_target_agent.py # Template agent structure
Running SIA on MLE-Bench task
Use the prepare_mlebench_dataset.py script to prepare a task dataset from MLE-Bench:
python orchestration/prepare_mlebench_dataset.py -c "spaceship-titanic"
This will:
- Run
mlebench prepare -c "spaceship-titanic" - Copy public and private datasets from
~/.cache/mle-bench/data/prepared/ - Rename
description.mdtotask.mdindata/public/ - Use Gemini to generate similar tasks (optional)
- Create
SAMPLE_TASK_DESCRIPTIONS.mdinreference/ - Copy
reference_target_agent.pyfrom_shared/toreference/
Options:
--skip-gemini: Skip Gemini API call for similar tasks--tasks-dir PATH: Specify custom tasks directory (default:./tasks)
- Optionally create
SAMPLE_TASK_DESCRIPTIONS.mdmanually inreference/
Troubleshooting
"Run directory already exists"
The orchestrator prevents overwriting existing runs. Either:
- Use a different
--run_id - Delete the existing run:
rm -rf runs/run_1
"No GEMINI_API_KEY environment variable set"
The prepare script will skip similar task generation. Either:
- Set the environment variable:
export GEMINI_API_KEY="your-key" - Use
--skip-geminiflag to skip this step
Target agent fails during execution
Check the logs in the generation directory:
cat runs/run_1/gen_1/agent_execution.json
Common issues:
- Dataset paths incorrect (ensure absolute paths are used)
- Missing Python packages in the venv
- ANTHROPIC_API_KEY not set
ImportError: No module named 'anthropic'
The orchestrator creates a fresh venv for each run. If packages are missing:
- Check the venv creation in the orchestrator logs
- Manually install:
runs/run_1/venv/bin/pip install anthropic
Configuration
Agent Backend Selection
SIA supports two agent backends for maximum flexibility:
1. Claude Code Backend (Default)
Uses the Claude Agent SDK with Claude models only:
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 5 \
--run_id 1 \
--backend claude \
--meta_model haiku
Supported Models:
haiku(claude-haiku-4-5-20251001)sonnet(claude-sonnet-4-5-20250929)opus(claude-opus-4-5-20251101)
2. OpenHands Backend
Uses the OpenHands SDK with support for multiple LLM providers:
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 5 \
--run_id 2 \
--backend openhands \
--meta_model "gemini/gemini-3.1-pro-preview"
Supported Models:
Google Gemini:
--meta_model "gemini/gemini-3.0-pro"
--meta_model "gemini/gemini-3.1-pro-preview"
OpenAI GPT:
--meta_model "openai/gpt-4"
--meta_model "openai/gpt-4-turbo"
Anthropic Claude (via OpenHands):
--meta_model "anthropic/claude-sonnet-4-5-20250929"
--meta_model "anthropic/claude-opus-4-5-20251101"
Complete Example: Testing Multiple LLMs
# Run 1: Claude via Claude Code (default)
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 3 \
--run_id 1 \
--backend claude \
--meta_model haiku
# Run 2: Gemini via OpenHands
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 3 \
--run_id 2 \
--backend openhands \
--meta_model "gemini/gemini-3.1-pro-preview"
# Run 3: GPT-4 via OpenHands
python orchestration/orchestrator.py \
--task_dir ./tasks/gpqa \
--max_gen 3 \
--run_id 3 \
--backend openhands \
--meta_model "openai/gpt-4"
Command-Line Arguments Reference
| Argument | Required | Default | Description |
|---|---|---|---|
--task_dir |
Yes | - | Path to task directory (e.g., ./tasks/gpqa) |
--max_gen |
No | 3 | Number of improvement generations |
--run_id |
No | 1 | Unique run identifier |
--backend |
No | claude |
Agent backend: claude or openhands |
--meta_model |
No | haiku |
Model for meta and feedback agents |
--task_model |
No | claude-haiku-4-5-20251001 |
Model for target agent execution |
Model Selection
The default model is haiku (claude-haiku-4-5-20251001). To use a different model, use the --meta_model and --task_model arguments as shown above.
Important Notes:
- When using the
claudebackend, only Claude model names are supported (haiku,sonnet,opus) - When using the
openhandsbackend, use fully-qualified model names (e.g.,gemini/gemini-3.1-pro-preview) - Ensure the appropriate API keys are set in your environment for the models you choose
Customizing Prompts
Edit the prompts in orchestrator.py:
META_AGENT_PROMPT: Controls how the initial agent is createdFEEDBACK_AGENT_PROMPT: Controls how improvements are suggested
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sia_agent-0.2.0.tar.gz.
File metadata
- Download URL: sia_agent-0.2.0.tar.gz
- Upload date:
- Size: 32.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcbc36b8c7efebe81587a3b65db47ed5c4e088e50d1c5535972362ba85673511
|
|
| MD5 |
6d6b3194733d7d2dce709c9bdf32f75e
|
|
| BLAKE2b-256 |
05f76fa09e48ed4285295876c9570d3b4e38f63ae92f97c28ee9da89905a7d83
|
Provenance
The following attestation bundles were made for sia_agent-0.2.0.tar.gz:
Publisher:
publish.yml on hexo-ai/sia
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sia_agent-0.2.0.tar.gz -
Subject digest:
dcbc36b8c7efebe81587a3b65db47ed5c4e088e50d1c5535972362ba85673511 - Sigstore transparency entry: 1645624633
- Sigstore integration time:
-
Permalink:
hexo-ai/sia@e7f13cab4a0909c93b1fe472e9b3f32942cd770a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/hexo-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7f13cab4a0909c93b1fe472e9b3f32942cd770a -
Trigger Event:
push
-
Statement type:
File details
Details for the file sia_agent-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sia_agent-0.2.0-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcf4974dc6635f6fc991e50bf9875c13164fcbd7598f7fe2c5911cbe22b835b2
|
|
| MD5 |
d6749ab11e1f02c90fc191b17c3905d5
|
|
| BLAKE2b-256 |
f74071ccc6268c91d456fcf545ebc28bdb3ce8e3064cf1e5d89ef84798485d2b
|
Provenance
The following attestation bundles were made for sia_agent-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on hexo-ai/sia
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sia_agent-0.2.0-py3-none-any.whl -
Subject digest:
fcf4974dc6635f6fc991e50bf9875c13164fcbd7598f7fe2c5911cbe22b835b2 - Sigstore transparency entry: 1645624717
- Sigstore integration time:
-
Permalink:
hexo-ai/sia@e7f13cab4a0909c93b1fe472e9b3f32942cd770a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/hexo-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7f13cab4a0909c93b1fe472e9b3f32942cd770a -
Trigger Event:
push
-
Statement type: