Skip to main content

HPC orchestrator for Claude Code and external agent harnesses

Project description

hpc-agent

HPC orchestrator for array-batch experiments on SGE/SLURM clusters. Two surfaces over one core:

  • Slash commands for humans in Claude Code (/submit-hpc, /monitor-hpc, /aggregate-hpc, /campaign-hpc, /preflight) — interactive markdown templates in slash_commands/commands/*.md that walk you through choosing a cluster and authoring .hpc/tasks.py. Executor scaffolding is folded into /submit-hpc Step 1; preflight is folded into /submit-hpc Step 6b as an idempotent gate (with /preflight still available as a standalone diagnostic).
  • CLI for agents and automation (hpc-agent <subcommand>) — JSON-in, JSON-out, exit codes. Designed to be invoked via a Bash-style tool by external orchestrators. This is a POSIX-native agent surface: any tool that can shell out and parse JSON can drive a cluster — see docs/reference/agent-surface.md. For integrators: docs/integrations/CONTRACT.md.

Both surfaces invoke hpc-agent <subcommand>. The slash commands are pure markdown that orchestrate the binary; the binary's atomic-ops layer (hpc_agent.runner) ensures cross-surface state — in-flight runs, journal records under ~/.claude/hpc/<repo_hash>/ — is shared automatically.

Quick Start

For humans (Claude Code)

pip install hpc-agent          # or `pip install -e .` from a checkout
hpc-agent setup                # copy commands + skills, wire the Stop hooks

hpc-agent setup copies the bundled slash commands into ~/.claude/commands/ and the skills into ~/.claude/skills/, then installs hpc-agent's Stop hooks — all idempotent, so re-running is safe. Both asset trees ship inside the package, so this works the same from a wheel install or an editable checkout. Pass --no-hooks to skip the hook step or --dry-run to preview. Every command (/preflight, /submit-hpc, /monitor-hpc, /aggregate-hpc, /campaign-hpc, /hpc-axes-init) and skill ships inside the package.

Once installed:

  • /preflight (optional) — verify SSH agent + cluster reachability. /submit-hpc auto-runs this as a cached gate, so you only need it for ad-hoc diagnostics.
  • /submit-hpc — answer prompts about cluster, executor, grid params. Scaffolds the executor inline if none exists.
  • /monitor-hpc to monitor, /aggregate-hpc to collect results.

For agents and automation

pip install hpc-agent
hpc-agent preflight --cluster hoffman2                    # health check
hpc-agent interview --spec intent.json --campaign-dir <d> # persist campaign intent next to tasks.py
hpc-agent recall --root ~/experiments --task-kind <kind>  # query past interviews for next-interview grounding
hpc-agent submit --spec spec.json                          # JSON envelope on stdout
hpc-agent status --run-id <id>                             # one-shot snapshot; poll as needed
hpc-agent aggregate --run-id <id> --wave 1                 # combiner + result pull

Stdout is a single-line JSON envelope: {"ok": true, "idempotent": ..., "data": {...}} or {"ok": false, "error_code": ..., "retry_safe": ..., "remediation": ...}. Exit codes: 0 ok, 1 user error, 2 cluster/network, 3 internal. Full schema in docs/reference/cli-spec.md; JSON Schema files for runtime validation under hpc_agent/schemas/.

For integrators

hpc-agent is Bash-invokable from any agent harness with a JSON parser. See docs/integrations/CONTRACT.md for the full contract: the spawn env block, error_code → retry policy table, the find-prior-runsubmitmonitor-summaryverify-aggregation-complete workflow, the .hpc/tasks.py boundary, and the executor import allowlist.

The canonical reference for .hpc/tasks.py is shipped inside the package at src/hpc_agent/mapreduce/templates/scaffolds/tasks_example.py. It demonstrates three patterns (Cartesian product, chunking by row count, date-window backtests) inline. Integrators locate it at runtime via from hpc_agent import _PACKAGE_ROOT or rglob("tasks_example.py").

The most common first-time failure is the harness's default-empty spawn env dropping SSH_AUTH_SOCK. hpc-agent status/aggregate/reconcile fail fast with error_code: "ssh_unreachable" (exit 2) instead of hanging on auth — run hpc-agent preflight first to verify the spawn env. hpc-agent does not kill cluster jobs by design (settings.json denies scancel/qdel); if the integrator decides a run is bad, stop polling and let it expire.


Standalone usage

Organize your experiment repo

Keep standalone executor scripts in a dedicated directory, separate from shared utilities:

my_experiment/
├── executors/           # or src/ — each file is a runnable experiment
│   ├── ml_ridge.py      # python3 executors/ml_ridge.py --help
│   ├── ml_xgboost.py
│   └── dl_patchts.py
├── lib/                 # shared utilities (not executors)
│   ├── loading.py
│   └── transforms.py
└── data/

Each executor accepts experiment-specific arguments (--horizon, --start, --end, --features, etc.). No HPC awareness is needed — all parameters arrive as CLI flags.

Run

/preflight → verify SSH agent + cluster reachability before first submit
/submit    → discovers executors, walks you through .hpc/tasks.py, syncs code, submits
/monitor-hpc    → tracks completion per grid point, diagnoses failures, auto-resubmits
/aggregate → validates completeness, runs aggregation, downloads summaries

Example conversation:

You: /submit run ridge and xgboost with horizon=[1, 5, 25]

Claude: I found these executors in src/:
  ml_ridge.py    — --horizon, --start, --end, --output-file
  ml_xgboost.py  — --horizon, --start, --end, --output-file

Proposed plan:
  Cluster: hoffman2 (SGE)
  Grid: executor=[ml_ridge, ml_xgboost] × horizon=[1, 5, 25] → 6 grid points
  Total: 6 tasks
  Resources: 1 CPU, 16G, 4:00:00
  Confirm?

You: yes

Claude: Submitted job 12345678 (6 tasks). Run /monitor-hpc to track progress.

No config files required. Claude discovers your executors by reading their source and --help, then suggests resources conversationally based on the executor and your input.

How It Works

The boundary between hpc-agent and your experiment repo is documented in docs/reference/boundary-contract.md and enforced by tests/test_boundary_contract.py.

  1. Claude reads your executor scripts and their --help output.
  2. You describe what to run in natural language — Claude walks you through writing .hpc/tasks.py once: a small Python module exposing total() and resolve(task_id) that returns the per-task kwargs. The file is committed to git and reused on every subsequent submit.
  3. A per-run sidecar .hpc/runs/<run_id>.json records the executor command, result-dir template, cmd_sha, and wave map for this particular submission.
  4. The framework executor _hpc_dispatch.py (zero deps, stdlib-only) is deployed to the cluster's .hpc/ by deploy_runtime.
  5. The job template runs the dispatcher, which imports your .hpc/tasks.py, calls resolve(task_id), formats the result_dir, and execs your executor command with kwargs as env vars.
  6. Your executor reads kwargs as ordinary env vars (uppercased + HPC_KW_*) — no HPC awareness needed.

Parallelism Model

The parallelization axis lives entirely in user code (.hpc/tasks.py). The framework is agnostic to whether you're doing a Cartesian grid, chunking by row count, date-window backtests, or something else — it just calls total() and resolve(i). The canonical reference at hpc_agent/mapreduce/templates/scaffolds/tasks_example.py shows three patterns inline; the agent helps you keep whichever applies and delete the rest.

Memory across campaigns

Two primitives — interview and recall — close the loop between consecutive campaigns. The interview agent (Claude Code or any external orchestrator) persists structured intent (goal, task_count, budget, abort_if, task_generator, cluster_target, transcript, provenance) into <campaign_dir>/interview.json next to the materialized tasks.py. The next interview calls recall --root <experiments-dir> to query past intents, returning recency-sorted summaries plus a 3-tier rollup (counts/histograms/quantiles, optional walltime aggregation, optional per-generator parameter envelopes). Observed ranges only — reasoning over them stays in the calling agent.

See docs/workflows/memory-across-campaigns.md for the full flow, including the task_generator typed materializer (5 shapes: enumerated, cartesian_product, items_x_seeds, numeric_logspace, numeric_linspace) and the ~/.hpc-agent/config.json:experiment_roots default-root config.

Throughput Optimization

hpc-agent automatically optimizes job submissions for cluster constraints. When constraints are configured (max array size, walltime, concurrent job limits), the optimizer packs tasks into batched waves:

  • Tasks are split into arrays of ≤max_array_size
  • Arrays are grouped into waves of ≤max_concurrent_jobs
  • Waves are staggered via scheduler dependencies (SLURM --dependency, SGE -hold_jid)
  • Total wall-clock time is estimated when per-task duration is known

Configure constraints in clusters.yaml (cluster-level); per-experiment overrides resolved at /submit time are persisted to the run sidecar at .hpc/runs/<run_id>.json.

Commands

Command What it does
/preflight Standalone: verify SSH agent, ssh/rsync on PATH, clusters.yaml parses, cluster reachable. /submit-hpc auto-runs the same checks as a 24h-cached gate, so direct invocation is mostly for ad-hoc diagnostics.
/submit-hpc Discover executors (scaffolds inline if none found), build grid conversationally, write .hpc/tasks.py with FLAGS dict + .hpc/cli.py dispatcher, sync code, submit array jobs
/monitor-hpc Poll status, diagnose failures, auto-resubmit, self-schedule next check
/aggregate-hpc Validate completeness, run aggregation on cluster, download summaries
/campaign-hpc Closed-loop iteration: tag submits, read prior history, repeat /submit-hpc campaign_id=<slug> until the strategy stops. See docs/workflows/campaign.md.
/hpc-axes-init Write <experiment>/.hpc/axes.yaml with the parallel-axis enumeration + homogeneity hint that drives the cold-start (and warm-path) array-axis picker.

Primitives

The slash commands above compose ~50 primitives exposed as hpc-agent <name>. Full machine-readable catalog at docs/generated/operations.md (auto-regenerated). High-traffic ones for agent orchestration:

Primitive Replaces
submit-flow / submit-flow-batch rsync + deploy + qsub + record (single or N-spec batch with shared rsync). Auto-dispatches when the spec is {specs: [...]}.
monitor-flow Poll-and-combine loop the slash command's tick body wraps.
aggregate-flow rsync_pull _combiner/ + reduce_partials + optional summary pull + ingest runtime samples.
build-submit-spec Resolved-interview-values → validated submit_flow.input.json spec.
build-tasks-py Cartesian-product axes → .hpc/tasks.py from the canonical Pattern 1 template.
discover-executors / discover-reducers Scan repo for executor scripts / aggregator scripts (find existing reducer instead of writing a fresh one).
decide-monitor-arm Pick cron/loop/none + cadence + cron schedule + literal armed: line.
monitor-summary Canonical user-facing tick summary (byte-stable framing).
summarize-submit-plan Canonical pre-submit confirmation summary.
verify-canary Wait + grep + output-check protocol for 1-task canary submissions.
verify-aggregation-complete All-waves-combined / all-tasks-present / no-cross-run-contamination invariant report.
suggest-setup-action / find-prior-run /submit-hpc Setup priority cascade + cmd_sha resume detection.
prune-orphan-sidecars Clean half-baked sidecars from failed batches.

hpc-agent <name> --help shows the per-primitive args; many take --spec <path> for a JSON input. See docs/primitives/<name>.md for the per-primitive contract (idempotency, side effects, error codes, schemas).

Configuration

clusters.yaml (required)

Cluster infrastructure definitions. Ships inside the package at hpc_agent/config/clusters.yaml. Override the active path with HPC_CLUSTERS_CONFIG=/your/clusters.yaml (useful for integrators who want to keep their cluster definitions outside the package):

hoffman2:
  host: hoffman2.idre.ucla.edu
  user: <your_user>
  scheduler: sge
  scratch: <your_scratch>
  modules: [python/3.11.9]
  conda_source: /u/local/apps/anaconda3/2024.06/etc/profile.d/conda.sh
  conda_envs: [<your_env>]          # optional — Claude presents these as options
  gpu_types: [a100, h200, a6000]

~/.hpc-agent/config.json (optional)

Per-user config for the recall primitive's default --root. List one or more directories under experiment_roots and recall walks them all when --root is omitted:

{
  "experiment_roots": [
    "/home/user/experiments",
    "/scratch/user/campaigns"
  ]
}

The --root CLI flag still wins when set. If neither flag nor config is present, recall errors with spec_invalid rather than silently falling back to cwd.

Caching

Claude remembers your preferences (cluster, executor directory, environment, resources) across conversations via Claude Code memory. The .hpc/runs/<run_id>.json sidecars (paired with .hpc/tasks.py) serve as the submission record for monitoring and resubmission.

Job Templates

Template SGE SLURM
CPU array hpc_agent/mapreduce/templates/runtime/sge/cpu_array.sh hpc_agent/mapreduce/templates/runtime/slurm/cpu_array.slurm
GPU array hpc_agent/mapreduce/templates/runtime/sge/gpu_array.sh hpc_agent/mapreduce/templates/runtime/slurm/gpu_array.slurm

Templates are parameterized via environment variables injected at submission time. Resolve paths via hpc_agent.get_template_path(scheduler, template). The GPU template is used when the configured resources include gpus; otherwise the CPU template is used.

Supported Clusters

Cluster Institution Scheduler
Hoffman2 UCLA IDRE SGE
Discovery USC CARC SLURM

Cluster connection details are in hpc_agent/config/clusters.yaml (or whatever HPC_CLUSTERS_CONFIG points at).

Python API

from hpc_agent import (
    # Framework subdirectory layout
    framework_subdir, runs_subdir, tasks_path, load_tasks_module,
    # Per-run sidecars
    compute_cmd_sha, write_run_sidecar, read_run_sidecar,
    find_run_by_cmd_sha, find_existing_runs,
    # Cluster config
    load_clusters_config, get_template_path, _PACKAGE_ROOT,
    # Submission
    ClusterConstraints, parse_constraints,
    WorkloadSpec, compute_submission_plan, build_wave_map,
    deploy_runtime, run_combiner_checked,
)
from hpc_agent.infra.backends import get_backend

Development

pip install -e '.[dev]'
pre-commit install        # auto-runs ruff, frontmatter regen, index regen
pytest -q                 # 1400+ tests

The pre-commit hook regenerates docs/primitives/*.md frontmatter, docs/primitives/README.md catalog, and docs/generated/operations.md from the @primitive registry, then auto-stages the result. Without it you'll see CI fail on the corresponding --check gates and have to push a follow-up chore: regenerate ... commit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hpc_agent-0.3.0.tar.gz (465.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hpc_agent-0.3.0-py3-none-any.whl (615.7 kB view details)

Uploaded Python 3

File details

Details for the file hpc_agent-0.3.0.tar.gz.

File metadata

  • Download URL: hpc_agent-0.3.0.tar.gz
  • Upload date:
  • Size: 465.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hpc_agent-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2a783b0493e0e61ffa3affec180cd1797979b4f98d09fc1a2e77a6f68807467b
MD5 3bce575d1af9d9350828a1346516496e
BLAKE2b-256 a84b13a18a5e69a196b5db2b4fddf68aa027f7c971a2a7f22946b0c1986a09ae

See more details on using hashes here.

File details

Details for the file hpc_agent-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: hpc_agent-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 615.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hpc_agent-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c8ef493d2f1a6f14fd8255e32a39046bbcf0b23894e250cc0a6a607d33b4afc
MD5 0aaf48e96401d856772301376fcd1b41
BLAKE2b-256 0a10bdd564835dd4551f0c93675b98468dbd46c16f2e2637cbb022915347694e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page