A command-line tool for distributed parallel execution across multiple GPUs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yuanhaobo

These details have not been verified by PyPI

Project description

🐙 OctoRun

Distributed Parallel Execution Made Simple

Run Python scripts across multiple GPUs and multiple machines with chunk-based work distribution, automatic failure recovery, and live job monitoring.

Overview

OctoRun launches one Python worker per GPU, splits the work into total_chunks shards by stride (not by file), and coordinates concurrent runs across machines via lock files on a shared filesystem. There is no central scheduler — each worker picks the next free chunk, refreshes a heartbeat while running, and writes a completion marker on success. Locks left behind by crashed workers are automatically reclaimed.

Features

Auto GPU detection — one worker per available CUDA device, or supply an explicit slot list (CPU works too).
Stride-based chunking — total_chunks is independent of input sharding; any value works.
Crash-safe locks — heartbeat-refreshed locks; stale ones (including 0-byte locks left by mid-write deaths) auto-reclaim after 5 min.
Multi-node by default — point N machines at the same shared log_dir and they cooperate without coordination.
Live status — octorun status <log_dir> prints active sessions, completed chunks, and stale locks.

What's new in 1.3.0

Per-chunk log writes use a single fd. start_process previously opened chunk_<id>.log twice — once for the header, once as the subprocess's stdout. On HDFS-fuse the second open could stall on a write lease held by the first writer, and the header could land after the child's first output. The header is now written and flush()ed on one fd that is then handed to Popen, with the parent closing its end immediately (the child keeps its inherited fd).
Tidier config reads. cmd_run and cmd_save_config now use with open(...) as f: json.load(f) so config-file fds release on the spot instead of waiting for GC.

Install

uv tool install octorun                  # recommended — global CLI
uv add octorun                           # in a project venv
pip install octorun                      # via pip
pip install "octorun[benchmark]"         # GPU benchmark extras

Quick start

octorun save_config --script ./worker.py    # write a default config.json
octorun run --config config.json            # launch on all detected GPUs
octorun status ./logs                       # check progress (any machine)

CLI

Command	Purpose
`octorun run --config <cfg> [--kwargs '<json>']`	Launch workers; CLI `--kwargs` overrides config kwargs.
`octorun status <log_dir> [--alive-threshold N] [--no-clean]`	Live job summary; reaps stale locks before reporting unless `--no-clean`.
`octorun save_config [--script <path>]`	Write a default `config.json` here.
`octorun list_gpus [--detailed]`	Show detected GPUs (with usage and processes when `-d`).
`octorun benchmark [--gpus auto] [--test-duration s] [--interval s]`	Continuous TFLOPS / bandwidth probe. Requires `[benchmark]` extra.
`octorun install-skill [--dest <dir>] [--force]`	Install the bundled Claude Code skill into `~/.claude/skills/octorun`.

Status output

OctoRun Job Status — ./logs/stage3
──────────────────────────────────────────────────────────────
  Locks total  : 164
  Completed    : 29
  Active       : 48  (6 sessions)
  Stale locks  : 87  (cleaned 87 this run)
──────────────────────────────────────────────────────────────

Active sessions (6):
  node1               8 chunks  [25, 26, 28, 32, ...]  (heartbeat 43s ago)
  ...
Dead sessions — stale locks (4 node(s)):
  node3               8 chunks  [161, 162, ...]  (last seen 1h31m ago)

--alive-threshold (default 300 s) controls how silent a session can be before its locks are flagged stale.

Configuration

{
    "script_path": "./worker.py",
    "gpus": "auto",
    "total_chunks": 1000,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": true,
    "max_retries": 2,
    "success_codes": [0],
    "kwargs": {
        "input_dir": "/data/input",
        "output_dir": "/data/output",
        "batch_size": 32
    }
}

Option	Description	Default
`script_path`	Worker script (absolute or relative to CWD)	—
`gpus`	`"auto"` or explicit list (`[0,1,2,3]`). On CPU-only nodes use a slot list, e.g. `[0,1]`.	`"auto"`
`total_chunks`	Number of stride shards. Independent of input file count.	`128`
`log_dir`	Session and chunk logs (must be shared FS for multi-node)	`"./logs"`
`chunk_lock_dir`	Lock files; must be on a POSIX FS (see Locks)	`"./logs/locks"`
`monitor_interval`	Seconds between worker health checks	`60`
`restart_failed`	Retry chunks whose worker exits with a non-success code	`false`
`max_retries`	Per-chunk retry budget	`3`
`success_codes`	Worker exit codes treated as success	`[0]`
`kwargs`	Forwarded to the worker script as CLI flags	`{}`

`success_codes`

Default [0]. Add codes here for known-benign non-zero exits — e.g. CPython interpreter shutdown errors (often 120) on slow networked filesystems where the chunk's data is already written:

"success_codes": [0, 120]

Writing a worker

Workers receive --gpu_id, --chunk_id, --total_chunks, plus everything in kwargs as flags.

`--gpu_id` is a local rank, not a device index

OctoRun does not set CUDA_VISIBLE_DEVICES. --gpu_id is the worker's index on the current machine (a local_rank). Two patterns:

# A — direct device index (fine when no other processes share the GPUs)
torch.cuda.set_device(args.gpu_id)

# B — pin the device by setting CUDA_VISIBLE_DEVICES BEFORE importing torch
import os, argparse
_p = argparse.ArgumentParser(add_help=False)
_p.add_argument("--gpu_id", type=int, default=0)
_early, _ = _p.parse_known_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(_early.gpu_id)
import torch
model = model.to("cuda:0")

Pattern B is recommended for transformers / large-model loaders, which often initialize CUDA at import time.

Minimal template

import argparse

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--gpu_id",       type=int, required=True)
    p.add_argument("--chunk_id",     type=int, required=True)
    p.add_argument("--total_chunks", type=int, required=True)
    p.add_argument("--input_dir",    type=str, required=True)
    p.add_argument("--output_dir",   type=str, required=True)
    p.add_argument("--batch_size",   type=int, default=32)
    return p.parse_args()

def main():
    args = parse_args()
    all_keys = sorted(load_all_keys(args.input_dir))
    my_keys  = all_keys[args.chunk_id::args.total_chunks]   # stride shard

    output_path = get_output_path(args.output_dir, args.chunk_id, args.total_chunks)
    if output_path.exists():
        return                                              # idempotent skip

    results = process(my_keys, gpu=args.gpu_id, batch_size=args.batch_size)
    write_atomic(results, output_path)                      # tmp + rename

if __name__ == "__main__":
    main()

Multi-machine usage

Point each machine at the same shared log_dir and chunk_lock_dir:

# machine A
octorun run --config config.json
# machine B (in parallel)
octorun run --config config.json
# any machine
octorun status ./logs

Lock files prevent any chunk from being processed twice. Workers write a completed/chunk_<id>.completed marker on success, which is honored across all machines.

How locks work

Each in-progress chunk owns a 3-line lock file under chunk_lock_dir:

<pid>
<iso-8601 timestamp>     # refreshed every monitor_interval
HEARTBEAT

A lock is considered stale (and reclaimable) when any of:

its HEARTBEAT timestamp is older than _STALE_TIMEOUT_SECONDS (5 min);
it is 0 bytes and its mtime is older than the same threshold (these form when a process dies between O_CREAT|O_EXCL and the subsequent write, or when a network FS swallows the write);
it is malformed or pre-1.0.0 format (no HEARTBEAT marker) — since 1.2.0 these are reaped on sight (no live worker can refresh them).

Stale locks are reclaimed lazily on the next acquire_lock, and eagerly by octorun status (which sweeps the lock dir before reporting; pass --no-clean to disable).

Filesystem requirements

Locking relies on O_CREAT | O_EXCL having atomic, cross-client semantics. Use a real POSIX filesystem (local disk, NFS, CephFS).

Do not place chunk_lock_dir on object-storage-backed filesystems like HDFS-fuse or s3fs — they fall back to non-atomic stat-then-create sequences and cache metadata client-side, so two workers can claim the same lock. Output data may live on HDFS / S3, but locks must not. If you must run that way, treat the lock as best-effort and write outputs idempotently (tmp + rename).

Claude Code skill

Install the bundled skill so an LLM agent in Claude Code can answer OctoRun questions with up-to-date semantics:

octorun install-skill          # installs to ~/.claude/skills/octorun
octorun install-skill --force  # overwrite an existing copy

The skill records the version of octorun it shipped with — re-run install-skill after upgrading.

Contributing

Fork, branch, PR.

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yuanhaobo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.4.0

Jun 1, 2026

1.3.0

May 11, 2026

1.2.0

Apr 30, 2026

1.1.0

Apr 30, 2026

1.0.2

Apr 28, 2026

1.0.1

Apr 28, 2026

1.0.0

Apr 10, 2026

0.3.0

Mar 30, 2026

0.2.1

Oct 25, 2025

0.2.0

Oct 6, 2025

0.1.2

Jul 6, 2025

0.1.1.post1

Jul 6, 2025

0.1.0

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-1.4.0.tar.gz (54.9 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

octorun-1.4.0-py3-none-any.whl (31.0 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file octorun-1.4.0.tar.gz.

File metadata

Download URL: octorun-1.4.0.tar.gz
Upload date: Jun 1, 2026
Size: 54.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`831ef36f8d14c063d3274db57a44e1fc2f70d4a444f85db713ec88585f073229`
MD5	`b70f6c4e0cc159d09fae45eb7be1ddea`
BLAKE2b-256	`d6b946561da90c4d850ea0f25de92dac783e1ed76db90d79e075c19cea6f9b59`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.4.0.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octorun-1.4.0.tar.gz
- Subject digest: 831ef36f8d14c063d3274db57a44e1fc2f70d4a444f85db713ec88585f073229
- Sigstore transparency entry: 1697378069
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: HarborYuan/OctoRun@b706ec5bd164a3537b75bf31a174e34a1cc29de2
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b706ec5bd164a3537b75bf31a174e34a1cc29de2
- Trigger Event: release

File details

Details for the file octorun-1.4.0-py3-none-any.whl.

File metadata

Download URL: octorun-1.4.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 31.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1836ffcf63ece1163b37c3f8fbf024f5818623037c9324ee2769daa3c3c13673`
MD5	`ffced75971e51f2738243ccbd7e4bb49`
BLAKE2b-256	`af6723026c9cb26866bb874537f6faaf32450a2b5d28ab3939ce42f0fe35dc57`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.4.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octorun-1.4.0-py3-none-any.whl
- Subject digest: 1836ffcf63ece1163b37c3f8fbf024f5818623037c9324ee2769daa3c3c13673
- Sigstore transparency entry: 1697378203
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: HarborYuan/OctoRun@b706ec5bd164a3537b75bf31a174e34a1cc29de2
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b706ec5bd164a3537b75bf31a174e34a1cc29de2
- Trigger Event: release

octorun 1.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

🐙 OctoRun

Overview

Features

What's new in 1.3.0

Install

Quick start

CLI

Status output

Configuration

success_codes

Writing a worker

--gpu_id is a local rank, not a device index

Minimal template

Multi-machine usage

How locks work

Filesystem requirements

Claude Code skill

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`success_codes`

`--gpu_id` is a local rank, not a device index