HPC orchestrator for Claude Code and external agent harnesses
Project description
hpc-agent
HPC orchestrator for array-batch experiments on SGE/SLURM clusters. Two surfaces over one core:
- Slash commands for humans in Claude Code (
/submit-hpc,/monitor-hpc,/aggregate-hpc,/campaign-hpc,/preflight) — interactive markdown templates inslash_commands/commands/*.mdthat walk you through choosing a cluster and authoring.hpc/tasks.py. Executor scaffolding is folded into/submit-hpcStep 1; preflight is folded into/submit-hpcStep 6b as an idempotent gate (with/preflightstill available as a standalone diagnostic). - CLI for agents and automation (
hpc-agent <subcommand>) — JSON-in, JSON-out, exit codes. Designed to be invoked via aBash-style tool by external orchestrators. This is a POSIX-native agent surface: any tool that can shell out and parse JSON can drive a cluster — seedocs/reference/agent-surface.md. For integrators:docs/integrations/CONTRACT.md.
Both surfaces invoke hpc-agent <subcommand>. The slash commands are pure markdown that orchestrate the binary; the binary's atomic-ops layer (hpc_agent.runner) ensures cross-surface state — in-flight runs, journal records under ~/.claude/hpc/<repo_hash>/ — is shared automatically.
Quick Start
For humans (Claude Code)
pip install hpc-agent # or `pip install -e .` from a checkout
hpc-agent setup # copy commands + skills, wire the Stop hooks
hpc-agent setup copies the bundled slash commands into
~/.claude/commands/ and the skills into ~/.claude/skills/, then
installs hpc-agent's Stop hooks — all idempotent, so re-running is
safe. Both asset trees ship inside the package, so this works the same
from a wheel install or an editable checkout. Pass --no-hooks to
skip the hook step or --dry-run to preview. Every command
(/preflight, /submit-hpc, /monitor-hpc, /aggregate-hpc,
/campaign-hpc, /hpc-axes-init) and skill ships inside the package.
Once installed:
/preflight(optional) — verify SSH agent + cluster reachability./submit-hpcauto-runs this as a cached gate, so you only need it for ad-hoc diagnostics./submit-hpc— answer prompts about cluster, executor, grid params. Scaffolds the executor inline if none exists./monitor-hpcto monitor,/aggregate-hpcto collect results.
For agents and automation
pip install hpc-agent
hpc-agent preflight --cluster hoffman2 # health check
hpc-agent interview --spec intent.json --campaign-dir <d> # persist campaign intent next to tasks.py
hpc-agent recall --root ~/experiments --task-kind <kind> # query past interviews for next-interview grounding
hpc-agent submit --spec spec.json # JSON envelope on stdout
hpc-agent status --run-id <id> # one-shot snapshot; poll as needed
hpc-agent aggregate --run-id <id> --wave 1 # combiner + result pull
Stdout is a single-line JSON envelope: {"ok": true, "idempotent": ..., "data": {...}} or {"ok": false, "error_code": ..., "retry_safe": ..., "remediation": ...}. Exit codes: 0 ok, 1 user error, 2 cluster/network, 3 internal. Full schema in docs/reference/cli-spec.md; JSON Schema files for runtime validation under hpc_agent/schemas/.
For integrators
hpc-agent is Bash-invokable from any agent harness with a JSON
parser. See docs/integrations/CONTRACT.md
for the full contract: the spawn env block,
error_code → retry policy table, the find-prior-run → submit →
monitor-summary → verify-aggregation-complete workflow, the
.hpc/tasks.py boundary, and the executor import allowlist.
The canonical reference for .hpc/tasks.py is shipped inside the
package at
src/hpc_agent/mapreduce/templates/scaffolds/tasks_example.py.
It demonstrates three patterns (Cartesian product, chunking by row
count, date-window backtests) inline. Integrators locate it at runtime
via from hpc_agent import _PACKAGE_ROOT or rglob("tasks_example.py").
The most common first-time failure is the harness's default-empty
spawn env dropping SSH_AUTH_SOCK. hpc-agent status/aggregate/reconcile fail fast with error_code: "ssh_unreachable" (exit 2) instead of hanging on auth — run
hpc-agent preflight first to verify the spawn env. hpc-agent does
not kill cluster jobs by design (settings.json denies
scancel/qdel); if the integrator decides a run is bad, stop
polling and let it expire.
Standalone usage
Organize your experiment repo
Keep standalone executor scripts in a dedicated directory, separate from shared utilities:
my_experiment/
├── executors/ # or src/ — each file is a runnable experiment
│ ├── ml_ridge.py # python3 executors/ml_ridge.py --help
│ ├── ml_xgboost.py
│ └── dl_patchts.py
├── lib/ # shared utilities (not executors)
│ ├── loading.py
│ └── transforms.py
└── data/
Each executor accepts experiment-specific arguments (--horizon, --start, --end, --features, etc.). No HPC awareness is needed — all parameters arrive as CLI flags.
Run
/preflight → verify SSH agent + cluster reachability before first submit
/submit → discovers executors, walks you through .hpc/tasks.py, syncs code, submits
/monitor-hpc → tracks completion per grid point, diagnoses failures, auto-resubmits
/aggregate → validates completeness, runs aggregation, downloads summaries
Example conversation:
You: /submit run ridge and xgboost with horizon=[1, 5, 25]
Claude: I found these executors in src/:
ml_ridge.py — --horizon, --start, --end, --output-file
ml_xgboost.py — --horizon, --start, --end, --output-file
Proposed plan:
Cluster: hoffman2 (SGE)
Grid: executor=[ml_ridge, ml_xgboost] × horizon=[1, 5, 25] → 6 grid points
Total: 6 tasks
Resources: 1 CPU, 16G, 4:00:00
Confirm?
You: yes
Claude: Submitted job 12345678 (6 tasks). Run /monitor-hpc to track progress.
No config files required. Claude discovers your executors by reading their source and --help, then suggests resources conversationally based on the executor and your input.
How It Works
The boundary between hpc-agent and your experiment repo is documented in docs/reference/boundary-contract.md and enforced by tests/test_boundary_contract.py.
- Claude reads your executor scripts and their
--helpoutput. - You describe what to run in natural language — Claude walks you through writing
.hpc/tasks.pyonce: a small Python module exposingtotal()andresolve(task_id)that returns the per-task kwargs. The file is committed to git and reused on every subsequent submit. - A per-run sidecar
.hpc/runs/<run_id>.jsonrecords the executor command, result-dir template,cmd_sha, and wave map for this particular submission. - The framework executor
_hpc_dispatch.py(zero deps, stdlib-only) is deployed to the cluster's.hpc/bydeploy_runtime. - The job template runs the dispatcher, which imports your
.hpc/tasks.py, callsresolve(task_id), formats the result_dir, and execs your executor command with kwargs as env vars. - Your executor reads kwargs as ordinary env vars (uppercased +
HPC_KW_*) — no HPC awareness needed.
Parallelism Model
The parallelization axis lives entirely in user code (.hpc/tasks.py). The framework is agnostic to whether you're doing a Cartesian grid, chunking by row count, date-window backtests, or something else — it just calls total() and resolve(i). The canonical reference at hpc_agent/mapreduce/templates/scaffolds/tasks_example.py shows three patterns inline; the agent helps you keep whichever applies and delete the rest.
Memory across campaigns
Two primitives — interview and recall — close the loop between consecutive campaigns. The interview agent (Claude Code or any external orchestrator) persists structured intent (goal, task_count, budget, abort_if, task_generator, cluster_target, transcript, provenance) into <campaign_dir>/interview.json next to the materialized tasks.py. The next interview calls recall --root <experiments-dir> to query past intents, returning recency-sorted summaries plus a 3-tier rollup (counts/histograms/quantiles, optional walltime aggregation, optional per-generator parameter envelopes). Observed ranges only — reasoning over them stays in the calling agent.
See docs/workflows/memory-across-campaigns.md for the full flow, including the task_generator typed materializer (5 shapes: enumerated, cartesian_product, items_x_seeds, numeric_logspace, numeric_linspace) and the ~/.hpc-agent/config.json:experiment_roots default-root config.
Throughput Optimization
hpc-agent automatically optimizes job submissions for cluster constraints. When constraints are configured (max array size, walltime, concurrent job limits), the optimizer packs tasks into batched waves:
- Tasks are split into arrays of ≤max_array_size
- Arrays are grouped into waves of ≤max_concurrent_jobs
- Waves are staggered via scheduler dependencies (SLURM
--dependency, SGE-hold_jid) - Total wall-clock time is estimated when per-task duration is known
Configure constraints in clusters.yaml (cluster-level); per-experiment overrides resolved at /submit time are persisted to the run sidecar at .hpc/runs/<run_id>.json.
Commands
| Command | What it does |
|---|---|
/preflight |
Standalone: verify SSH agent, ssh/rsync on PATH, clusters.yaml parses, cluster reachable. /submit-hpc auto-runs the same checks as a 24h-cached gate, so direct invocation is mostly for ad-hoc diagnostics. |
/submit-hpc |
Discover executors (scaffolds inline if none found), build grid conversationally, write .hpc/tasks.py with FLAGS dict + .hpc/cli.py dispatcher, sync code, submit array jobs |
/monitor-hpc |
Poll status, diagnose failures, auto-resubmit, self-schedule next check |
/aggregate-hpc |
Validate completeness, run aggregation on cluster, download summaries |
/campaign-hpc |
Closed-loop iteration: tag submits, read prior history, repeat /submit-hpc campaign_id=<slug> until the strategy stops. See docs/workflows/campaign.md. |
/hpc-axes-init |
Write <experiment>/.hpc/axes.yaml with the parallel-axis enumeration + homogeneity hint that drives the cold-start (and warm-path) array-axis picker. |
Primitives
The slash commands above compose ~50 primitives exposed as hpc-agent <name>. Full machine-readable catalog at docs/generated/operations.md (auto-regenerated). High-traffic ones for agent orchestration:
| Primitive | Replaces |
|---|---|
submit-flow / submit-flow-batch |
rsync + deploy + qsub + record (single or N-spec batch with shared rsync). Auto-dispatches when the spec is {specs: [...]}. |
monitor-flow |
Poll-and-combine loop the slash command's tick body wraps. |
aggregate-flow |
rsync_pull _combiner/ + reduce_partials + optional summary pull + ingest runtime samples. |
build-submit-spec |
Resolved-interview-values → validated submit_flow.input.json spec. |
build-tasks-py |
Cartesian-product axes → .hpc/tasks.py from the canonical Pattern 1 template. |
discover-executors / discover-reducers |
Scan repo for executor scripts / aggregator scripts (find existing reducer instead of writing a fresh one). |
decide-monitor-arm |
Pick cron/loop/none + cadence + cron schedule + literal armed: line. |
monitor-summary |
Canonical user-facing tick summary (byte-stable framing). |
summarize-submit-plan |
Canonical pre-submit confirmation summary. |
verify-canary |
Wait + grep + output-check protocol for 1-task canary submissions. |
verify-aggregation-complete |
All-waves-combined / all-tasks-present / no-cross-run-contamination invariant report. |
suggest-setup-action / find-prior-run |
/submit-hpc Setup priority cascade + cmd_sha resume detection. |
prune-orphan-sidecars |
Clean half-baked sidecars from failed batches. |
hpc-agent <name> --help shows the per-primitive args; many take --spec <path> for a JSON input. See docs/primitives/<name>.md for the per-primitive contract (idempotency, side effects, error codes, schemas).
Configuration
clusters.yaml (required)
Cluster infrastructure definitions. Ships inside the package at hpc_agent/config/clusters.yaml. Override the active path with HPC_CLUSTERS_CONFIG=/your/clusters.yaml (useful for integrators who want to keep their cluster definitions outside the package):
hoffman2:
host: hoffman2.idre.ucla.edu
user: <your_user>
scheduler: sge
scratch: <your_scratch>
modules: [python/3.11.9]
conda_source: /u/local/apps/anaconda3/2024.06/etc/profile.d/conda.sh
conda_envs: [<your_env>] # optional — Claude presents these as options
gpu_types: [a100, h200, a6000]
~/.hpc-agent/config.json (optional)
Per-user config for the recall primitive's default --root. List one or more directories under experiment_roots and recall walks them all when --root is omitted:
{
"experiment_roots": [
"/home/user/experiments",
"/scratch/user/campaigns"
]
}
The --root CLI flag still wins when set. If neither flag nor config is present, recall errors with spec_invalid rather than silently falling back to cwd.
Caching
Claude remembers your preferences (cluster, executor directory, environment, resources) across conversations via Claude Code memory. The .hpc/runs/<run_id>.json sidecars (paired with .hpc/tasks.py) serve as the submission record for monitoring and resubmission.
Job Templates
| Template | SGE | SLURM |
|---|---|---|
| CPU array | hpc_agent/mapreduce/templates/runtime/sge/cpu_array.sh |
hpc_agent/mapreduce/templates/runtime/slurm/cpu_array.slurm |
| GPU array | hpc_agent/mapreduce/templates/runtime/sge/gpu_array.sh |
hpc_agent/mapreduce/templates/runtime/slurm/gpu_array.slurm |
Templates are parameterized via environment variables injected at submission time. Resolve paths via hpc_agent.get_template_path(scheduler, template). The GPU template is used when the configured resources include gpus; otherwise the CPU template is used.
Supported Clusters
| Cluster | Institution | Scheduler |
|---|---|---|
| Hoffman2 | UCLA IDRE | SGE |
| Discovery | USC CARC | SLURM |
Cluster connection details are in hpc_agent/config/clusters.yaml (or whatever HPC_CLUSTERS_CONFIG points at).
Python API
from hpc_agent import (
# Framework subdirectory layout
framework_subdir, runs_subdir, tasks_path, load_tasks_module,
# Per-run sidecars
compute_cmd_sha, write_run_sidecar, read_run_sidecar,
find_run_by_cmd_sha, find_existing_runs,
# Cluster config
load_clusters_config, get_template_path, _PACKAGE_ROOT,
# Submission
ClusterConstraints, parse_constraints,
WorkloadSpec, compute_submission_plan, build_wave_map,
deploy_runtime, run_combiner_checked,
)
from hpc_agent.infra.backends import get_backend
Development
pip install -e '.[dev]'
pre-commit install # auto-runs ruff, frontmatter regen, index regen
pytest -q # 1400+ tests
The pre-commit hook regenerates docs/primitives/*.md frontmatter,
docs/primitives/README.md catalog, and docs/generated/operations.md
from the @primitive registry, then auto-stages the result. Without it
you'll see CI fail on the corresponding --check gates and have to
push a follow-up chore: regenerate ... commit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hpc_agent-0.3.0.tar.gz.
File metadata
- Download URL: hpc_agent-0.3.0.tar.gz
- Upload date:
- Size: 465.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a783b0493e0e61ffa3affec180cd1797979b4f98d09fc1a2e77a6f68807467b
|
|
| MD5 |
3bce575d1af9d9350828a1346516496e
|
|
| BLAKE2b-256 |
a84b13a18a5e69a196b5db2b4fddf68aa027f7c971a2a7f22946b0c1986a09ae
|
File details
Details for the file hpc_agent-0.3.0-py3-none-any.whl.
File metadata
- Download URL: hpc_agent-0.3.0-py3-none-any.whl
- Upload date:
- Size: 615.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c8ef493d2f1a6f14fd8255e32a39046bbcf0b23894e250cc0a6a607d33b4afc
|
|
| MD5 |
0aaf48e96401d856772301376fcd1b41
|
|
| BLAKE2b-256 |
0a10bdd564835dd4551f0c93675b98468dbd46c16f2e2637cbb022915347694e
|