HPC orchestrator for Claude Code and external agent harnesses

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX
Programming Language
- Python :: 3
- Python :: 3.10
Topic
- System :: Distributed Computing

Project description

hpc-agent

HPC orchestrator for array-batch experiments on SGE/SLURM clusters. Two surfaces over one core:

Slash commands for humans in Claude Code (/submit-hpc, /monitor-hpc, /aggregate-hpc, /campaign-hpc, /preflight) — interactive markdown templates in slash_commands/commands/*.md that walk you through choosing a cluster and authoring .hpc/tasks.py. Executor scaffolding is folded into /submit-hpc Step 1; preflight is folded into /submit-hpc Step 6b as an idempotent gate (with /preflight still available as a standalone diagnostic).
CLI for agents and automation (hpc-agent <subcommand>) — JSON-in, JSON-out, exit codes. Designed to be invoked via a Bash-style tool by external orchestrators. This is a POSIX-native agent surface: any tool that can shell out and parse JSON can drive a cluster — see docs/reference/agent-surface.md. For integrators: docs/integrations/CONTRACT.md.

Both surfaces invoke hpc-agent <subcommand>. The slash commands are pure markdown that orchestrate the binary; the binary's atomic-ops layer (hpc_agent.runner) ensures cross-surface state — in-flight runs, journal records under ~/.claude/hpc/<repo_hash>/ — is shared automatically.

Quick Start

For humans (Claude Code)

pip install hpc-agent          # or `pip install -e .` from a checkout
hpc-agent setup                # copy commands + skills, wire the Stop hooks

hpc-agent setup copies the bundled slash commands into ~/.claude/commands/ and the skills into ~/.claude/skills/, then installs hpc-agent's Stop hooks — all idempotent, so re-running is safe. Both asset trees ship inside the package, so this works the same from a wheel install or an editable checkout. Pass --no-hooks to skip the hook step or --dry-run to preview. Every command (/preflight, /submit-hpc, /monitor-hpc, /aggregate-hpc, /campaign-hpc, /hpc-axes-init) and skill ships inside the package.

Once installed:

/preflight (optional) — verify SSH agent + cluster reachability. /submit-hpc auto-runs this as a cached gate, so you only need it for ad-hoc diagnostics.
/submit-hpc — answer prompts about cluster, executor, grid params. Scaffolds the executor inline if none exists.
/monitor-hpc to monitor, /aggregate-hpc to collect results.

For agents and automation

pip install hpc-agent
hpc-agent preflight --cluster hoffman2                    # health check
hpc-agent interview --spec intent.json --campaign-dir <d> # persist campaign intent next to tasks.py
hpc-agent recall --root ~/experiments --task-kind <kind>  # query past interviews for next-interview grounding
hpc-agent submit --spec spec.json                          # JSON envelope on stdout
hpc-agent status --run-id <id>                             # one-shot snapshot; poll as needed
hpc-agent aggregate --run-id <id> --wave 1                 # combiner + result pull

Stdout is a single-line JSON envelope: {"ok": true, "idempotent": ..., "data": {...}} or {"ok": false, "error_code": ..., "retry_safe": ..., "remediation": ...}. Exit codes: 0 ok, 1 user error, 2 cluster/network, 3 internal. Full schema in docs/reference/cli-spec.md; JSON Schema files for runtime validation under hpc_agent/schemas/.

For integrators

hpc-agent is Bash-invokable from any agent harness with a JSON parser. See docs/integrations/CONTRACT.md for the full contract: the spawn env block, error_code → retry policy table, the find-prior-run → submit → monitor-summary → verify-aggregation-complete workflow, the .hpc/tasks.py boundary, and the executor import allowlist.

The canonical reference for .hpc/tasks.py is shipped inside the package at src/hpc_agent/mapreduce/templates/scaffolds/tasks_example.py. It demonstrates three patterns (Cartesian product, chunking by row count, date-window backtests) inline. Integrators locate it at runtime via from hpc_agent import _PACKAGE_ROOT or rglob("tasks_example.py").

The most common first-time failure is the harness's default-empty spawn env dropping SSH_AUTH_SOCK. hpc-agent status/aggregate/reconcile fail fast with error_code: "ssh_unreachable" (exit 2) instead of hanging on auth — run hpc-agent preflight first to verify the spawn env. hpc-agent does not kill cluster jobs by design (settings.json denies scancel/qdel); if the integrator decides a run is bad, stop polling and let it expire.

Standalone usage

Organize your experiment repo

Keep standalone executor scripts in a dedicated directory, separate from shared utilities:

my_experiment/
├── executors/           # or src/ — each file is a runnable experiment
│   ├── ml_ridge.py      # python3 executors/ml_ridge.py --help
│   ├── ml_xgboost.py
│   └── dl_patchts.py
├── lib/                 # shared utilities (not executors)
│   ├── loading.py
│   └── transforms.py
└── data/

Each executor accepts experiment-specific arguments (--horizon, --start, --end, --features, etc.). No HPC awareness is needed — all parameters arrive as CLI flags.

Run

/preflight → verify SSH agent + cluster reachability before first submit
/submit    → discovers executors, walks you through .hpc/tasks.py, syncs code, submits
/monitor-hpc    → tracks completion per grid point, diagnoses failures, auto-resubmits
/aggregate → validates completeness, runs aggregation, downloads summaries

Example conversation:

You: /submit run ridge and xgboost with horizon=[1, 5, 25]

Claude: I found these executors in src/:
  ml_ridge.py    — --horizon, --start, --end, --output-file
  ml_xgboost.py  — --horizon, --start, --end, --output-file

Proposed plan:
  Cluster: hoffman2 (SGE)
  Grid: executor=[ml_ridge, ml_xgboost] × horizon=[1, 5, 25] → 6 grid points
  Total: 6 tasks
  Resources: 1 CPU, 16G, 4:00:00
  Confirm?

You: yes

Claude: Submitted job 12345678 (6 tasks). Run /monitor-hpc to track progress.

No config files required. Claude discovers your executors by reading their source and --help, then suggests resources conversationally based on the executor and your input.

How It Works

The boundary between hpc-agent and your experiment repo is documented in docs/reference/boundary-contract.md and enforced by tests/test_boundary_contract.py.

Claude reads your executor scripts and their --help output.
You describe what to run in natural language — Claude walks you through writing .hpc/tasks.py once: a small Python module exposing total() and resolve(task_id) that returns the per-task kwargs. The file is committed to git and reused on every subsequent submit.
A per-run sidecar .hpc/runs/<run_id>.json records the executor command, result-dir template, cmd_sha, and wave map for this particular submission.
The framework executor _hpc_dispatch.py (zero deps, stdlib-only) is deployed to the cluster's .hpc/ by deploy_runtime.
The job template runs the dispatcher, which imports your .hpc/tasks.py, calls resolve(task_id), formats the result_dir, and execs your executor command with kwargs as env vars.
Your executor reads kwargs as ordinary env vars (uppercased + HPC_KW_*) — no HPC awareness needed.

Parallelism Model

The parallelization axis lives entirely in user code (.hpc/tasks.py). The framework is agnostic to whether you're doing a Cartesian grid, chunking by row count, date-window backtests, or something else — it just calls total() and resolve(i). The canonical reference at hpc_agent/mapreduce/templates/scaffolds/tasks_example.py shows three patterns inline; the agent helps you keep whichever applies and delete the rest.

Memory across campaigns

Two primitives — interview and recall — close the loop between consecutive campaigns. The interview agent (Claude Code or any external orchestrator) persists structured intent (goal, task_count, budget, abort_if, task_generator, cluster_target, transcript, provenance) into <campaign_dir>/interview.json next to the materialized tasks.py. The next interview calls recall --root <experiments-dir> to query past intents, returning recency-sorted summaries plus a 3-tier rollup (counts/histograms/quantiles, optional walltime aggregation, optional per-generator parameter envelopes). Observed ranges only — reasoning over them stays in the calling agent.

See docs/workflows/memory-across-campaigns.md for the full flow, including the task_generator typed materializer (5 shapes: enumerated, cartesian_product, items_x_seeds, numeric_logspace, numeric_linspace) and the ~/.hpc-agent/config.json:experiment_roots default-root config.

Throughput Optimization

hpc-agent automatically optimizes job submissions for cluster constraints. When constraints are configured (max array size, walltime, concurrent job limits), the optimizer packs tasks into batched waves:

Tasks are split into arrays of ≤max_array_size
Arrays are grouped into waves of ≤max_concurrent_jobs
Waves are staggered via scheduler dependencies (SLURM --dependency, SGE -hold_jid)
Total wall-clock time is estimated when per-task duration is known

Configure constraints in clusters.yaml (cluster-level); per-experiment overrides resolved at /submit time are persisted to the run sidecar at .hpc/runs/<run_id>.json.

Commands

Command	What it does
`/preflight`	Standalone: verify SSH agent, ssh/rsync on PATH, clusters.yaml parses, cluster reachable. `/submit-hpc` auto-runs the same checks as a 24h-cached gate, so direct invocation is mostly for ad-hoc diagnostics.
`/submit-hpc`	Discover executors (scaffolds inline if none found), build grid conversationally, write `.hpc/tasks.py` with FLAGS dict + `.hpc/cli.py` dispatcher, sync code, submit array jobs
`/monitor-hpc`	Poll status, diagnose failures, auto-resubmit, self-schedule next check
`/aggregate-hpc`	Validate completeness, run aggregation on cluster, download summaries
`/campaign-hpc`	Closed-loop iteration: tag submits, read prior history, repeat `/submit-hpc campaign_id=<slug>` until the strategy stops. See `docs/workflows/campaign.md`.
`/hpc-axes-init`	Write `<experiment>/.hpc/axes.yaml` with the parallel-axis enumeration + homogeneity hint that drives the cold-start (and warm-path) array-axis picker.

Primitives

The slash commands above compose ~50 primitives exposed as hpc-agent <name>. Full machine-readable catalog at docs/generated/operations.md (auto-regenerated). High-traffic ones for agent orchestration:

Primitive	Replaces
`submit-flow` / `submit-flow-batch`	rsync + deploy + qsub + record (single or N-spec batch with shared rsync). Auto-dispatches when the spec is `{specs: [...]}`.
`monitor-flow`	Poll-and-combine loop the slash command's tick body wraps.
`aggregate-flow`	rsync_pull `_combiner/` + `reduce_partials` + optional summary pull + ingest runtime samples.
`build-submit-spec`	Resolved-interview-values → validated `submit_flow.input.json` spec.
`build-tasks-py`	Cartesian-product axes → `.hpc/tasks.py` from the canonical Pattern 1 template.
`discover-executors` / `discover-reducers`	Scan repo for executor scripts / aggregator scripts (find existing reducer instead of writing a fresh one).
`decide-monitor-arm`	Pick cron/loop/none + cadence + cron schedule + literal `armed:` line.
`monitor-summary`	Canonical user-facing tick summary (byte-stable framing).
`summarize-submit-plan`	Canonical pre-submit confirmation summary.
`verify-canary`	Wait + grep + output-check protocol for 1-task canary submissions.
`verify-aggregation-complete`	All-waves-combined / all-tasks-present / no-cross-run-contamination invariant report.
`suggest-setup-action` / `find-prior-run`	`/submit-hpc` Setup priority cascade + `cmd_sha` resume detection.
`prune-orphan-sidecars`	Clean half-baked sidecars from failed batches.

hpc-agent <name> --help shows the per-primitive args; many take --spec <path> for a JSON input. See docs/primitives/<name>.md for the per-primitive contract (idempotency, side effects, error codes, schemas).

Configuration

`clusters.yaml` (required)

Cluster infrastructure definitions. Ships inside the package at hpc_agent/config/clusters.yaml. Override the active path with HPC_CLUSTERS_CONFIG=/your/clusters.yaml (useful for integrators who want to keep their cluster definitions outside the package):

hoffman2:
  host: hoffman2.idre.ucla.edu
  user: <your_user>
  scheduler: sge
  scratch: <your_scratch>
  modules: [python/3.11.9]
  conda_source: /u/local/apps/anaconda3/2024.06/etc/profile.d/conda.sh
  conda_envs: [<your_env>]          # optional — Claude presents these as options
  gpu_types: [a100, h200, a6000]

`~/.hpc-agent/config.json` (optional)

Per-user config for the recall primitive's default --root. List one or more directories under experiment_roots and recall walks them all when --root is omitted:

{
  "experiment_roots": [
    "/home/user/experiments",
    "/scratch/user/campaigns"
  ]
}

The --root CLI flag still wins when set. If neither flag nor config is present, recall errors with spec_invalid rather than silently falling back to cwd.

Caching

Claude remembers your preferences (cluster, executor directory, environment, resources) across conversations via Claude Code memory. The .hpc/runs/<run_id>.json sidecars (paired with .hpc/tasks.py) serve as the submission record for monitoring and resubmission.

Job Templates

Template	SGE	SLURM
CPU array	`hpc_agent/mapreduce/templates/runtime/sge/cpu_array.sh`	`hpc_agent/mapreduce/templates/runtime/slurm/cpu_array.slurm`
GPU array	`hpc_agent/mapreduce/templates/runtime/sge/gpu_array.sh`	`hpc_agent/mapreduce/templates/runtime/slurm/gpu_array.slurm`

Templates are parameterized via environment variables injected at submission time. Resolve paths via hpc_agent.get_template_path(scheduler, template). The GPU template is used when the configured resources include gpus; otherwise the CPU template is used.

Supported Clusters

Cluster	Institution	Scheduler
Hoffman2	UCLA IDRE	SGE
Discovery	USC CARC	SLURM

Cluster connection details are in hpc_agent/config/clusters.yaml (or whatever HPC_CLUSTERS_CONFIG points at).

Python API

from hpc_agent import (
    # Framework subdirectory layout
    framework_subdir, runs_subdir, tasks_path, load_tasks_module,
    # Per-run sidecars
    compute_cmd_sha, write_run_sidecar, read_run_sidecar,
    find_run_by_cmd_sha, find_existing_runs,
    # Cluster config
    load_clusters_config, get_template_path, _PACKAGE_ROOT,
    # Submission
    ClusterConstraints, parse_constraints,
    WorkloadSpec, compute_submission_plan, build_wave_map,
    deploy_runtime, run_combiner_checked,
)
from hpc_agent.infra.backends import get_backend

Development

pip install -e '.[dev]'
pre-commit install        # auto-runs ruff, frontmatter regen, index regen
pytest -q                 # 1400+ tests

The pre-commit hook regenerates docs/primitives/*.md frontmatter, docs/primitives/README.md catalog, and docs/generated/operations.md from the @primitive registry, then auto-stages the result. Without it you'll see CI fail on the corresponding --check gates and have to push a follow-up chore: regenerate ... commit.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX
Programming Language
- Python :: 3
- Python :: 3.10
Topic
- System :: Distributed Computing

Release history Release notifications | RSS feed

This version

0.3.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hpc_agent-0.3.0.tar.gz (465.4 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hpc_agent-0.3.0-py3-none-any.whl (615.7 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file hpc_agent-0.3.0.tar.gz.

File metadata

Download URL: hpc_agent-0.3.0.tar.gz
Upload date: May 21, 2026
Size: 465.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hpc_agent-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2a783b0493e0e61ffa3affec180cd1797979b4f98d09fc1a2e77a6f68807467b`
MD5	`3bce575d1af9d9350828a1346516496e`
BLAKE2b-256	`a84b13a18a5e69a196b5db2b4fddf68aa027f7c971a2a7f22946b0c1986a09ae`

See more details on using hashes here.

File details

Details for the file hpc_agent-0.3.0-py3-none-any.whl.

File metadata

Download URL: hpc_agent-0.3.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 615.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hpc_agent-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c8ef493d2f1a6f14fd8255e32a39046bbcf0b23894e250cc0a6a607d33b4afc`
MD5	`0aaf48e96401d856772301376fcd1b41`
BLAKE2b-256	`0a10bdd564835dd4551f0c93675b98468dbd46c16f2e2637cbb022915347694e`

See more details on using hashes here.

hpc-agent 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hpc-agent

Quick Start

For humans (Claude Code)

For agents and automation

For integrators

Standalone usage

Organize your experiment repo

Run

How It Works

Parallelism Model

Memory across campaigns

Throughput Optimization

Commands

Primitives

Configuration

clusters.yaml (required)

~/.hpc-agent/config.json (optional)

Caching

Job Templates

Supported Clusters

Python API

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`clusters.yaml` (required)

`~/.hpc-agent/config.json` (optional)