A command-line tool for distributed parallel execution across multiple GPUs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yuanhaobo

These details have not been verified by PyPI

Project description

🐙 OctoRun

Distributed Parallel Execution Made Simple

Run Python scripts across multiple GPUs with intelligent chunk management, failure recovery, and live job monitoring

Overview

OctoRun dispatches one Python worker process per GPU, divides your dataset into chunks, and coordinates work across multiple machines through shared lock files. Each worker independently claims a chunk, processes it, and writes its output atomically — making it safe to run OctoRun on many nodes simultaneously without any central coordinator.

Key Features

Automatic GPU Detection — detects and allocates all available GPUs
Chunk-Based Work Distribution — divides keys by stride, not by file, so any total_chunks value works regardless of input sharding
Failure Recovery — stale locks (heartbeat timeout 5 min) are automatically reclaimed; configurable retries
Multi-Machine Safe — multiple OctoRun instances on different nodes share the same lock directory on a shared filesystem
Live Job Monitoring — octorun status shows active sessions, completed chunks, and stale locks at a glance

Installation

# Via uv (recommended)
uv tool install octorun

# In a project venv
uv add octorun

# Via pip
pip install octorun

Optional extras:

# GPU benchmark tooling (requires PyTorch)
pip install "octorun[benchmark]"

Quick Start

# 1. Generate a default config
octorun save_config --script ./your_script.py

# 2. Run across all available GPUs
octorun run --config config.json

# 3. Monitor progress from any machine
octorun status ./logs

Commands

`run` (`r`)

Launch workers across all available GPUs.

octorun run --config config.json
octorun run --config config.json --kwargs '{"batch_size": 64}'

CLI --kwargs override any kwargs values from the config file.

`status` (`st`)

Show a live summary of a running or completed job.

octorun status <log_dir>
octorun status <log_dir> --alive-threshold 120

log_dir is the same directory you set as log_dir in your config. It contains the *_session_*.log files and the locks/ subdirectory.

Example output:

OctoRun Job Status — ./logs/stage3
──────────────────────────────────────────────────────────────
  Locks total  : 164
  Completed    : 29
  Active       : 48  (6 sessions)
  Stale locks  : 87  (dead workers, will be auto-reclaimed)
──────────────────────────────────────────────────────────────

Active sessions (6):
  node1               8 chunks  [25, 26, 28, 32, ...]  (heartbeat 43s ago)
  node2               8 chunks  [16, 18, 20, 21, ...]  (heartbeat 47s ago)
  ...

Dead sessions — stale locks (4 node(s)):
  node3               8 chunks  [161, 162, ...]  (last seen 1h31m ago)
  ...

Field	Meaning
Locks total	Lock files currently on disk (active + stale)
Completed	Chunks that finished successfully (persistent `.completed` markers)
Active	Chunks held by workers with a live heartbeat
Stale locks	Locks from crashed/killed workers — reclaimed automatically when the next chunk finishes on any active node

--alive-threshold (default 300 s) sets how long a session can be silent before it is considered dead.

`save_config` (`s`)

Write a default config.json to the current directory.

octorun save_config --script ./your_script.py

`list_gpus` (`l`)

List available GPUs.

octorun list_gpus
octorun list_gpus --detailed

`benchmark` (`b`)

Continuously measure GPU TFLOPs and communication bandwidth.

octorun benchmark
octorun benchmark --gpus 0,1,2 --test-duration 10 --interval 30

Requires octorun[benchmark].

Configuration

Generate a starter config with octorun save_config, then edit as needed:

{
    "script_path": "your_script.py",
    "gpus": "auto",
    "total_chunks": 1000,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": true,
    "max_retries": 2,
    "kwargs": {
        "input_dir": "/data/input",
        "output_dir": "/data/output",
        "batch_size": 32
    }
}

Option	Description	Default
`script_path`	Path to your Python worker script	—
`gpus`	`"auto"` or a list of GPU IDs	`"auto"`
`total_chunks`	Total number of chunks to divide work into	`128`
`log_dir`	Directory for session and chunk logs	`"./logs"`
`chunk_lock_dir`	Directory for lock files	`"./logs/locks"`
`monitor_interval`	Seconds between process-status checks	`60`
`restart_failed`	Retry failed chunks	`false`
`max_retries`	Maximum retries per chunk	`3`
`success_codes`	Worker exit codes treated as success	`[0]`
`kwargs`	Extra arguments forwarded to your script	`{}`

`success_codes`

By default a chunk is considered complete only when the worker exits with code 0. Add additional codes here when your worker has a known-benign non-zero exit. For example, if the worker calls sys.exit(0) but the CPython interpreter shutdown phase fails (often 120 due to atexit / stdout-flush errors on slow networked filesystems), the chunk's data is already written and you can safely treat 120 as success:

"success_codes": [0, 120]

Writing a Worker Script

`--gpu_id` is a local rank, not a device index

OctoRun does not set CUDA_VISIBLE_DEVICES or any other GPU environment variable. --gpu_id is simply the index of this worker among the workers launched on the current machine — i.e. a local_rank. If you launch with gpus: [0, 1, 2, 3], four workers start with --gpu_id 0, 1, 2, 3 respectively, but OctoRun leaves device selection entirely to your script.

Common patterns:

# Pattern A — direct device index (works when no other processes use the GPUs)
torch.cuda.set_device(args.gpu_id)

# Pattern B — set CUDA_VISIBLE_DEVICES before any CUDA init so the process
# only sees one device and always addresses it as cuda:0.
# Do this at the very top of the script, before importing torch/transformers.
import os, argparse
_p = argparse.ArgumentParser(add_help=False)
_p.add_argument("--gpu_id", type=int, default=0)
_early, _ = _p.parse_known_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(_early.gpu_id)
# Now import torch — it will only see one GPU
import torch
model = model.to("cuda:0")

Pattern B is strongly recommended when loading large models (transformers, etc.) because many libraries initialize CUDA at import time and will claim the wrong device if CUDA_VISIBLE_DEVICES is not set beforehand.

Minimal script template

Your script must accept three OctoRun arguments plus any custom ones:

import argparse

def parse_args():
    p = argparse.ArgumentParser()
    # Required by OctoRun — gpu_id is a local_rank, not a device index
    p.add_argument("--gpu_id",       type=int, required=True)
    p.add_argument("--chunk_id",     type=int, required=True)
    p.add_argument("--total_chunks", type=int, required=True)
    # Your own arguments (forwarded via kwargs)
    p.add_argument("--input_dir",    type=str, required=True)
    p.add_argument("--output_dir",   type=str, required=True)
    p.add_argument("--batch_size",   type=int, default=32)
    return p.parse_args()

def main():
    args = parse_args()

    # Shard your dataset by stride
    all_keys = sorted(load_all_keys(args.input_dir))
    my_keys  = all_keys[args.chunk_id::args.total_chunks]

    # Skip if already done (idempotent)
    output_path = get_output_path(args.output_dir, args.chunk_id, args.total_chunks)
    if output_path.exists():
        return

    # Process and write atomically
    results = process(my_keys, gpu=args.gpu_id, batch_size=args.batch_size)
    write_atomic(results, output_path)

if __name__ == "__main__":
    main()

Multi-Machine Usage

Run OctoRun on each machine, pointing at the same shared log_dir and chunk_lock_dir:

# machine A
octorun run --config config.json   # claims chunks 0, 1, 2, ...

# machine B (simultaneously)
octorun run --config config.json   # claims the next available chunks

Lock files on the shared filesystem prevent any chunk from being processed twice. Monitor all machines from anywhere:

octorun status ./logs

Shared filesystem requirements

chunk_lock_dir relies on O_CREAT | O_EXCL having atomic, cross-client semantics. Use a POSIX-compliant filesystem (local disk, NFS, CephFS, etc.).

Do not place chunk_lock_dir on object-storage-backed filesystems such as HDFS-fuse or s3fs — they typically fall back to non-atomic "stat-then-create" sequences and have client-side metadata caching, both of which let two workers acquire the same lock. Output data can still live on HDFS / S3, but locking requires a real filesystem. If you must run on such storage, ensure your worker writes outputs idempotently (tmp + rename) and treat the lock as best-effort rather than a correctness guarantee.

Contributing

Fork the repository, create a feature branch, and open a pull request.

License

MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yuanhaobo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

Jun 1, 2026

1.3.0

May 11, 2026

1.2.0

Apr 30, 2026

1.1.0

Apr 30, 2026

1.0.2

Apr 28, 2026

This version

1.0.1

Apr 28, 2026

1.0.0

Apr 10, 2026

0.3.0

Mar 30, 2026

0.2.1

Oct 25, 2025

0.2.0

Oct 6, 2025

0.1.2

Jul 6, 2025

0.1.1.post1

Jul 6, 2025

0.1.0

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-1.0.1.tar.gz (49.6 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

octorun-1.0.1-py3-none-any.whl (25.4 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file octorun-1.0.1.tar.gz.

File metadata

Download URL: octorun-1.0.1.tar.gz
Upload date: Apr 28, 2026
Size: 49.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`a4fc0d6ff2947ab976b552a485d276ccf6ed09ce330aab3ea76f64b99984879a`
MD5	`26c2d9966e0d32bd5597200ca0dafbdb`
BLAKE2b-256	`b78fcb725f3b63d61a223a1cd86f69286a875794b7c93450eed8d5aa5141923f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.1.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octorun-1.0.1.tar.gz
- Subject digest: a4fc0d6ff2947ab976b552a485d276ccf6ed09ce330aab3ea76f64b99984879a
- Sigstore transparency entry: 1397671584
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: HarborYuan/OctoRun@28504978838cf6969f74bdf82b6ffd25fa54570f
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28504978838cf6969f74bdf82b6ffd25fa54570f
- Trigger Event: release

File details

Details for the file octorun-1.0.1-py3-none-any.whl.

File metadata

Download URL: octorun-1.0.1-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc1546a28c172e3eeb90f5da48370bbf6072d94d0d13b6643fc88482eb909267`
MD5	`d9d67dab2160ce12670f237a1581b868`
BLAKE2b-256	`a19a5af262ed440a8bbac014a32fd3e53fdfbee0ccc7f39c137ee51052336194`

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.1-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: octorun-1.0.1-py3-none-any.whl
- Subject digest: cc1546a28c172e3eeb90f5da48370bbf6072d94d0d13b6643fc88482eb909267
- Sigstore transparency entry: 1397671586
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: HarborYuan/OctoRun@28504978838cf6969f74bdf82b6ffd25fa54570f
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/HarborYuan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28504978838cf6969f74bdf82b6ffd25fa54570f
- Trigger Event: release

octorun 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

🐙 OctoRun

Overview

Key Features

Installation

Quick Start

Commands

run (r)

status (st)

save_config (s)

list_gpus (l)

benchmark (b)

Configuration

success_codes

Writing a Worker Script

--gpu_id is a local rank, not a device index

Minimal script template

Multi-Machine Usage

Shared filesystem requirements

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`run` (`r`)

`status` (`st`)

`save_config` (`s`)

`list_gpus` (`l`)

`benchmark` (`b`)

`success_codes`

`--gpu_id` is a local rank, not a device index