Automated research sprint platform for HPC clusters
Project description
ResearchLoop
Automated AI research sprints on HPC clusters.
ResearchLoop automates multi-step AI research pipelines on SLURM and SGE clusters. You describe a research idea, and ResearchLoop submits it to your HPC cluster where Claude Code executes a full research pipeline -- coding, red-teaming, fixing, reporting -- inside a single job. Results are reported back via webhooks, Slack, or push notifications, and you can monitor everything from a web dashboard or the CLI.
The platform is built for researchers who run experiments on shared HPC infrastructure and want to iterate faster without babysitting jobs. Define your studies, point ResearchLoop at your cluster, and let it handle the rest: job submission, progress tracking, artifact collection, and even automatic generation of follow-up research ideas.
ResearchLoop's auto-loop feature chains sprints together automatically. After each sprint completes, Claude analyzes the results and proposes the next experiment. You set how many iterations to run, and the system handles the rest -- turning a single research question into a sustained investigation.
How it works
ResearchLoop has two components:
- Orchestrator (
researchloop serve) -- a lightweight server that manages studies and sprints in SQLite, submits jobs to HPC clusters via SSH, receives completion webhooks, stores artifacts, and serves the web dashboard. - Sprint Runner -- runs inside each SLURM/SGE job on the HPC cluster. Chains
claude -pcalls through the research pipeline (research, red-team, fix, report, summarize), then sends artifacts and results back to the orchestrator.
You (CLI / Dashboard / Slack)
|
v
Orchestrator (Docker / Fly.io) HPC Cluster
+--------------------------+ +----------------------------+
| FastAPI API + Dashboard |---SSH------>| SLURM / SGE scheduler |
| SQLite metadata | | |
| Artifact storage |<--webhook--| Sprint Runner |
| Slack bot |<--upload---| 1. claude -p "research" |
| ntfy.sh notifications | | 2. claude -p "red-team" |
+--------------------------+ | 3. claude -p "fix" |
| 4. claude -p "report" |
| 5. claude -p "summarize" |
+----------------------------+
Core concepts
| Concept | Description |
|---|---|
| Study | A sustained research effort (e.g., "synthetic SAE improvements"). Tied to a cluster, has its own context and configuration. |
| Sprint | A single research attempt within a study. Gets a short ID (sp-a3f7b2), its own directory, and runs the full pipeline. |
| Auto-loop | Automatic sequential sprint execution. After each sprint, Claude analyzes results and generates the next research idea. |
Sprint pipeline
Each sprint runs these steps inside a single SLURM/SGE job:
- Research -- execute the research idea (coding, experiments, analysis)
- Red-team -- critique the work, find flaws (up to N rounds with fix steps)
- Fix -- address issues found by the red-team
- Report -- generate a comprehensive markdown report
- Summarize -- write a short summary for notifications and the dashboard
All steps share a single Claude session (via --resume), so Claude maintains full context of the sprint's work across steps.
Features
- HPC cluster integration -- submit, monitor, and cancel jobs on SLURM and SGE clusters via SSH
- Multi-step research pipeline -- research, red-team, fix, report, summarize with configurable rounds
- Auto-loop -- chain sprints automatically with AI-generated follow-up ideas
- Web dashboard -- monitor studies, sprints, and loops from a browser with live status refresh
- Slack bot -- start sprints, check status, and have research conversations via Slack DMs or channels
- CLI -- full remote management from the command line with token-based auth
- Progress tracking -- live
progress.mdandoutput.logstreaming from cluster to dashboard - Notifications -- push notifications via ntfy.sh and Slack with PDF report attachments
- Per-sprint security -- webhook tokens, CSRF protection, signed session cookies, bcrypt password hashing
- Context hierarchy -- global, cluster, and study-level context files and inline configuration
Quick start
Prerequisites
- Python 3.10+
- uv (recommended) or pip
- SSH access to an HPC cluster with SLURM or SGE
- Claude Code CLI installed and authenticated on the HPC cluster
Install
pip install git+https://github.com/chanind/researchloop.git
Or for development:
git clone https://github.com/chanind/researchloop.git
cd researchloop
uv sync
Initialize a project
researchloop init
# Creates researchloop.toml and artifacts/ directory
Configure
Edit researchloop.toml:
shared_secret = "change-me"
orchestrator_url = "https://your-server.fly.dev"
[[cluster]]
name = "hpc"
host = "login.cluster.example.com"
user = "researcher"
key_path = "~/.ssh/id_ed25519"
scheduler_type = "slurm" # "slurm", "sge", or "local"
working_dir = "/scratch/researcher/researchloop"
[cluster.job_options]
gres = "gpu:1"
mem = "64G"
cpus-per-task = "8"
[[study]]
name = "my-research"
cluster = "hpc"
description = "Investigating feature X"
max_sprint_duration_hours = 8
red_team_max_rounds = 3
Start the server and run a sprint
# Start the orchestrator
researchloop serve
# In another terminal, connect the CLI to the server
researchloop connect https://localhost:8080
# Submit a sprint
researchloop sprint run "try approach X on dataset Y" --study my-research
# Check status
researchloop sprint list
researchloop sprint show sp-a3f7b2
Configuration reference
Complete researchloop.toml example
# -- Top-level settings --
db_path = "researchloop.db" # SQLite database location
artifact_dir = "artifacts" # Local directory for uploaded artifacts
shared_secret = "your-secret" # Auth between runner and orchestrator
orchestrator_url = "https://example.com" # Public URL for webhooks
claude_command = "" # Override claude command globally
# Global context (included in all sprints)
context = "Always use Python 3.10+ features."
context_paths = ["./global-context.md"] # Files to include as context
# -- Cluster configuration --
[[cluster]]
name = "hpc"
host = "login.cluster.example.com"
port = 22
user = "researcher"
key_path = "~/.ssh/id_ed25519"
scheduler_type = "slurm" # "slurm", "sge", or "local"
working_dir = "/scratch/user/researchloop"
max_concurrent_jobs = 4
claude_command = "claude --dangerously-skip-permissions"
# Context specific to this cluster
context = "GPUs are NVIDIA L40. Check CUDA_VISIBLE_DEVICES."
context_paths = ["./cluster-notes.md"]
# Environment variables set in SLURM jobs
[cluster.environment]
# ANTHROPIC_API_KEY = "sk-ant-..." # Only if not using claude login
# SLURM job options (passed as #SBATCH directives)
[cluster.job_options]
gres = "gpu:l40:1"
cpus-per-task = "8"
mem = "64G"
# -- Study configuration --
[[study]]
name = "my-study"
cluster = "hpc" # Must match a cluster name
description = "Research into X"
claude_md_path = "./studies/my-study/CLAUDE.md" # Study-specific context file
sprints_dir = "/scratch/user/my-study" # Where sprints go (default: working_dir/<study>)
max_sprint_duration_hours = 8 # SLURM time limit
red_team_max_rounds = 3 # Red-team/fix cycles
allow_loop = true # Allow auto-loops for this study
claude_command = "" # Override claude command for this study
# Inline study context (included in research prompts)
context = """
Focus on improving F1 score. Use batch size 1024.
"""
# Per-study SLURM overrides
[study.job_options]
gres = "gpu:a100:2"
# -- Notifications --
[ntfy]
url = "https://ntfy.sh" # Self-hosted ntfy server URL
topic = "researchloop" # ntfy topic name
# -- Slack integration --
[slack]
bot_token = "" # xoxb-... (prefer env var)
signing_secret = "" # Slack signing secret (prefer env var)
channel_id = "C0123456789" # Channel or user ID for notifications
allowed_user_ids = ["U0123456789"] # Users allowed to interact with bot
restrict_to_channel = false # If true, only respond in channel_id
# -- Dashboard --
[dashboard]
enabled = true
host = "0.0.0.0"
port = 8080
password_hash = "" # bcrypt hash (prefer env var or first-run setup)
Environment variable overrides
All secrets and sensitive settings can be set via environment variables with the RESEARCHLOOP_ prefix. Environment variables take precedence over TOML values.
| Environment variable | Overrides |
|---|---|
RESEARCHLOOP_SHARED_SECRET |
shared_secret |
RESEARCHLOOP_ORCHESTRATOR_URL |
orchestrator_url |
RESEARCHLOOP_DB_PATH |
db_path |
RESEARCHLOOP_ARTIFACT_DIR |
artifact_dir |
RESEARCHLOOP_SLACK_BOT_TOKEN |
slack.bot_token |
RESEARCHLOOP_SLACK_SIGNING_SECRET |
slack.signing_secret |
RESEARCHLOOP_SLACK_CHANNEL_ID |
slack.channel_id |
RESEARCHLOOP_SLACK_ALLOWED_USER_IDS |
slack.allowed_user_ids (comma-separated) |
RESEARCHLOOP_NTFY_TOPIC |
ntfy.topic |
RESEARCHLOOP_NTFY_URL |
ntfy.url |
RESEARCHLOOP_DASHBOARD_PASSWORD |
Auto-hashed on startup |
RESEARCHLOOP_DASHBOARD_PASSWORD_HASH |
dashboard.password_hash |
RESEARCHLOOP_DASHBOARD_PORT |
dashboard.port |
RESEARCHLOOP_DASHBOARD_HOST |
dashboard.host |
Deployment
Docker
FROM python:3.12-slim
RUN apt-get update && \
apt-get install -y --no-install-recommends openssh-client curl git && \
rm -rf /var/lib/apt/lists/*
# Install Claude CLI
RUN curl -fsSL https://claude.ai/install.sh | bash
# Install researchloop
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
RUN uv venv /app/.venv && \
uv pip install --python /app/.venv/bin/python --no-cache \
"researchloop @ git+https://github.com/chanind/researchloop.git"
WORKDIR /app
COPY researchloop.toml .
ENV PATH="/root/.local/bin:/root/.claude/bin:/app/.venv/bin:$PATH"
ENV RESEARCHLOOP_DB_PATH="/data/researchloop.db"
ENV RESEARCHLOOP_ARTIFACT_DIR="/data/artifacts"
EXPOSE 8080
CMD ["researchloop", "serve"]
Fly.io
ResearchLoop works well on Fly.io with a persistent volume for the database and artifacts:
# fly.toml
app = "my-researchloop"
primary_region = "iad"
[build]
[[mounts]]
source = "researchloop_data"
destination = "/data"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 0
[[vm]]
size = "shared-cpu-1x"
memory = "2gb"
Set secrets:
fly secrets set \
RESEARCHLOOP_SHARED_SECRET="your-secret" \
RESEARCHLOOP_ORCHESTRATOR_URL="https://my-researchloop.fly.dev" \
SSH_PRIVATE_KEY="$(cat ~/.ssh/id_ed25519)" \
RESEARCHLOOP_DASHBOARD_PASSWORD="your-password" \
-a my-researchloop
Deploy:
fly deploy
SSH key setup for Docker/Fly.io
The orchestrator needs SSH access to your HPC cluster. Add an entrypoint script that writes the key from a secret:
#!/bin/bash
set -euo pipefail
if [ -n "${SSH_PRIVATE_KEY:-}" ]; then
mkdir -p ~/.ssh
echo "$SSH_PRIVATE_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
cat > ~/.ssh/config <<EOF
Host *
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
EOF
chmod 600 ~/.ssh/config
fi
mkdir -p /data/artifacts
exec "$@"
Dashboard
The web dashboard provides a browser-based interface for managing ResearchLoop. It is served by the orchestrator at /dashboard/.
Features
- Studies list -- overview of all configured studies with sprint counts
- Study detail -- view study configuration, submit new sprints with GPU/memory overrides
- Sprint list -- filterable list of all sprints across studies
- Sprint detail -- live status with progress.md display, tool log, script output, report rendering (markdown to HTML), PDF download, and artifact listing
- Auto-loop management -- start, stop, and resume loops with context guidance and job option overrides
- Loop detail -- progress tracking with links to individual loop sprints
- Refresh -- pull live status from the cluster via SSH (detects current pipeline step, reads logs)
Authentication
On first visit, the dashboard prompts you to set a password. Alternatively, set RESEARCHLOOP_DASHBOARD_PASSWORD as an environment variable and the password is auto-hashed on startup.
Sessions use signed cookies (7-day expiry) with a signing key persisted in the database. All mutating dashboard actions are protected by CSRF tokens.
CLI authentication
The CLI authenticates to the orchestrator using password-based token auth:
researchloop connect https://your-server.fly.dev
# Prompts for password, saves token to ~/.config/researchloop/credentials.json
researchloop status # Check connection
researchloop disconnect # Remove saved credentials
Slack integration
Setup
- Go to api.slack.com/apps and create a new app
- Enable Event Subscriptions with request URL:
https://your-server.fly.dev/api/slack/events - Subscribe to bot events:
app_mention,message.im - Add OAuth Scopes:
chat:write,files:write - Install the app to your workspace
- Set environment variables:
RESEARCHLOOP_SLACK_BOT_TOKEN="xoxb-..."
RESEARCHLOOP_SLACK_SIGNING_SECRET="..."
RESEARCHLOOP_SLACK_CHANNEL_ID="C0123456789" # For notifications
RESEARCHLOOP_SLACK_ALLOWED_USER_IDS="U01,U02" # Comma-separated
Commands
| Command | Description |
|---|---|
sprint run <study> <idea> |
Submit a new sprint |
sprint list |
List recent sprints |
auth status |
Check if Claude CLI is authenticated |
help |
Show available commands |
Conversational mode
Beyond commands, the Slack bot supports free-form conversations. Messages in a thread are tracked as a Claude session (via --resume), so the bot remembers context within a thread. The bot can:
- Discuss research ideas and help plan sprints
- Review results from completed sprints
- Look up papers and references (web search)
- Execute actions (start sprints, loops) when you ask
Notifications
When sprints complete or fail, the bot sends notifications to the configured channel. Completed sprint notifications include the summary and a link to the dashboard. If a PDF report was generated, it is uploaded as an attachment.
CLI reference
researchloop [OPTIONS] COMMAND
Options:
-c, --config PATH Path to researchloop.toml
--version Show version
--help Show help
Commands:
init Initialize a new project with example config
serve Start the orchestrator server
connect [URL] Authenticate CLI to a remote orchestrator
disconnect Remove saved credentials
status Show connection status
study list List all configured studies
study show NAME Show study details and recent sprints
study init NAME Scaffold a new study directory with starter CLAUDE.md
sprint run IDEA Submit a new sprint (-s/--study required)
sprint list List sprints (--study, --limit options)
sprint show ID Show sprint details, artifacts, and summary
sprint cancel ID Cancel a running sprint
loop start Start an auto-loop (-s/--study, -n/--count, -m/--context)
loop status Show all auto-loops
loop stop LOOP_ID Stop a running auto-loop
cluster list List configured clusters
cluster check Test SSH connectivity (--name for specific cluster)
API endpoints
The orchestrator exposes a REST API at /api/:
| Method | Path | Auth | Description |
|---|---|---|---|
POST |
/api/auth |
Password | Get API token |
GET |
/api/studies |
Token/Secret | List all studies |
GET |
/api/sprints |
Token/Secret | List sprints (?study_name=, ?limit=) |
GET |
/api/sprints/{id} |
Token/Secret | Get sprint details |
POST |
/api/sprints |
Token/Secret | Create and submit a sprint |
POST |
/api/sprints/{id}/cancel |
Token/Secret | Cancel a sprint |
POST |
/api/loops |
Token/Secret | Start an auto-loop |
POST |
/api/loops/{id}/stop |
Token/Secret | Stop an auto-loop |
POST |
/api/webhook/sprint-complete |
Webhook token | Sprint completion callback |
POST |
/api/webhook/heartbeat |
Webhook token | Runner heartbeat with logs |
POST |
/api/artifacts/{sprint_id} |
Webhook token | Upload artifact file |
POST |
/api/slack/events |
Slack signature | Slack Events API handler |
Authentication uses either a bearer token (from /api/auth) or the X-Shared-Secret header. Webhook endpoints use per-sprint X-Webhook-Token headers.
Development
Setup
git clone https://github.com/chanind/researchloop.git
cd researchloop
uv sync
Run tests
# Unit tests (339 tests, ~3s)
uv run pytest tests/ -v -m "not integration"
# Integration tests (requires Docker for SLURM container)
docker build -t researchloop-slurm-test tests/docker/slurm/
uv run pytest tests/integration/ -v --timeout=120
Code quality
uv run ruff check . # Lint
uv run ruff format --check . # Format check
uv run pyright researchloop/ # Type check
Project structure
researchloop/
core/
config.py TOML config loading into dataclasses
models.py SprintStatus enum, Sprint/Study/AutoLoop dataclasses
orchestrator.py Orchestrator class + create_app() FastAPI factory
credentials.py CLI credential storage (~/.config/researchloop/)
auth.py Claude CLI auth checking
db/
database.py Async SQLite wrapper (WAL mode, auto-migrations)
migrations.py Schema definitions (7 tables + indexes)
queries.py Async CRUD functions (parameterized SQL, return dicts)
clusters/
ssh.py SSHConnection + SSHManager (connection pooling)
monitor.py JobMonitor (polls active jobs, heartbeat tracking)
schedulers/
base.py BaseScheduler ABC
slurm.py SlurmScheduler (sbatch/squeue/sacct/scancel)
sge.py SGEScheduler (qsub/qstat/qacct/qdel)
local.py LocalScheduler (subprocesses, for testing)
sprints/
manager.py SprintManager (create/submit/cancel/handle_completion)
auto_loop.py AutoLoopController (start/stop/resume, idea generation)
studies/
manager.py StudyManager (config-to-DB sync, cluster resolution)
runner/
pipeline.py Pipeline class (research pipeline steps)
claude.py run_claude() wrapper + render_template()
upload.py upload_artifacts(), send_webhook(), send_heartbeat()
main.py Runner CLI entry point (researchloop-runner)
templates/ Jinja2 prompt templates (6 templates)
job_templates/ SLURM (slurm.sh.j2) and SGE (sge.sh.j2) job scripts
comms/
base.py BaseNotifier ABC
ntfy.py NtfyNotifier (ntfy.sh push notifications)
slack.py SlackNotifier + verify_slack_signature()
conversation.py ConversationManager (Slack threads to Claude sessions)
router.py NotificationRouter (fan-out to all backends)
dashboard/
app.py ASGI app entry point
auth.py Password auth (bcrypt + signed session cookies + CSRF)
routes.py Dashboard HTML routes
templates/ Jinja2 HTML templates (9 templates)
cli.py Click CLI entry point
CI
GitHub Actions runs on every push and PR to main:
- Lint --
ruff check,ruff format --check,pyright - Test --
pyteston Python 3.10, 3.12, 3.13 - Integration -- builds a Docker SLURM container and runs integration tests
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file researchloop-0.1.0.tar.gz.
File metadata
- Download URL: researchloop-0.1.0.tar.gz
- Upload date:
- Size: 240.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38540efc6c4f73244554cbf050616d1b706c1770992021f3c5d85ef9a0b47f6d
|
|
| MD5 |
61855b047bdc033bc6f85f9c9211690d
|
|
| BLAKE2b-256 |
1f7dbf8fa94ef22a7079913539701714496eb3555ffb8743bb7dd13b7a4bbd9f
|
Provenance
The following attestation bundles were made for researchloop-0.1.0.tar.gz:
Publisher:
release.yml on researchloop/researchloop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
researchloop-0.1.0.tar.gz -
Subject digest:
38540efc6c4f73244554cbf050616d1b706c1770992021f3c5d85ef9a0b47f6d - Sigstore transparency entry: 1141294370
- Sigstore integration time:
-
Permalink:
researchloop/researchloop@4d67004be3e53eb3af11c36a45b6ea03ecfb73be -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/researchloop
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d67004be3e53eb3af11c36a45b6ea03ecfb73be -
Trigger Event:
push
-
Statement type:
File details
Details for the file researchloop-0.1.0-py3-none-any.whl.
File metadata
- Download URL: researchloop-0.1.0-py3-none-any.whl
- Upload date:
- Size: 108.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a08f0c258b15c40d827d8079e9109be700c42126268ba8a319f4b73bd3399bcf
|
|
| MD5 |
563fbeaa5d33dc3e30748e801d0dc183
|
|
| BLAKE2b-256 |
ceea9216a5610923ccf7ad0e8499e1564af306afa3376066af72089be6200e71
|
Provenance
The following attestation bundles were made for researchloop-0.1.0-py3-none-any.whl:
Publisher:
release.yml on researchloop/researchloop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
researchloop-0.1.0-py3-none-any.whl -
Subject digest:
a08f0c258b15c40d827d8079e9109be700c42126268ba8a319f4b73bd3399bcf - Sigstore transparency entry: 1141294468
- Sigstore integration time:
-
Permalink:
researchloop/researchloop@4d67004be3e53eb3af11c36a45b6ea03ecfb73be -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/researchloop
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d67004be3e53eb3af11c36a45b6ea03ecfb73be -
Trigger Event:
push
-
Statement type: