Skip to main content

A minimal alternative to Ray for distributed data processing on EC2 instances

Project description

poormanray

poormanray library logo

PyPI version Python 3.10+ License

A minimal alternative to Ray for distributed data processing on EC2 instances. Manage clusters, run commands, and distribute jobs without the complexity of a full Ray deployment.

Installation

Requires Python 3.10+.

# Install as a CLI tool (recommended)
uv tool install poormanray

# Or install as a library
uv pip install poormanray
pip install poormanray

Quick Start

# Create a cluster of 5 instances
pmr create --name mycluster --number 5 --instance-type i4i.2xlarge

# List instances in the cluster
pmr list --name mycluster

# Run a command on all instances
pmr run --name mycluster --command "echo 'Hello from $(hostname)'"

# Terminate the cluster when done
pmr terminate --name mycluster

Prerequisites

  • AWS credentials configured via:
    • Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    • AWS CLI (aws configure)
    • Credentials file (~/.aws/credentials)
  • SSH key pair in ~/.ssh/ (id_rsa, id_ed25519, etc.)

Commands

Cluster Management

create - Launch EC2 instances

pmr create --name mycluster --number 5 --instance-type i4i.2xlarge

# Options:
#   -n, --name          Cluster name (required)
#   -N, --number        Number of instances (default: 1)
#   -t, --instance-type EC2 instance type (default: i4i.xlarge)
#   -r, --region        AWS region (default: us-east-1)
#   -a, --ami-id        Custom AMI ID (default: Amazon Linux 2023)
#   -d, --detach        Don't wait for instances to be ready
#   --zone              Availability zone
#   --storage-type      EBS volume type (gp3, gp2, io1, io2, st1, sc1)
#   --storage-size      Root volume size in GB
#   --storage-iops      IOPS for the root volume

list - Show cluster instances

pmr list --name mycluster

# Output includes: instance ID, name, type, state, IP, status checks

terminate - Destroy instances

pmr terminate --name mycluster

# Terminate specific instances only:
pmr terminate --name mycluster -i i-abc123 -i i-def456

pause / resume - Stop and start instances

pmr pause --name mycluster    # Stop instances (preserves EBS)
pmr resume --name mycluster   # Start stopped instances

Command Execution

run - Execute commands on instances

# Run a command
pmr run --name mycluster --command "df -h"

# Run a script
pmr run --name mycluster --script ./my-script.sh

# Run in background (detached)
pmr run --name mycluster --command "long-running-job.sh" --detach

# Auto-terminate after command completes
pmr run --name mycluster --command "./job.sh" --spindown

map - Distribute scripts across instances

Distributes a directory of scripts evenly across all instances and runs them in parallel.

# Create scripts directory with executable scripts
ls scripts/
# job_001.sh  job_002.sh  job_003.sh  job_004.sh  job_005.sh

# Distribute and run across cluster
pmr map --name mycluster --script scripts/

# Scripts are distributed round-robin and executed in parallel

Instance Setup

setup - Configure AWS credentials

Copies your AWS credentials to all instances in the cluster.

pmr setup --name mycluster

setup-d2tk - Install Dolma2 Toolkit

Sets up RAID drives, installs Rust, and builds datamap-rs and minhash-rs.

pmr setup-d2tk --name mycluster --detach

setup-dolma-python - Install Dolma Python

Installs Python 3.12, uv, and the dolma package.

pmr setup-dolma-python --name mycluster --detach

setup-decon - Install DECON toolkit

Sets up the DECON pipeline with Rust toolchain.

pmr setup-decon --name mycluster --github-token ghp_xxx --detach

Common Options

These options are available on most commands:

Option Short Description
--name -n Cluster name (required)
--region -r AWS region (default: us-east-1)
--instance-id -i Target specific instance(s), repeatable
--ssh-key-path -k Path to SSH private key
--detach -d Run in background
--owner -o Owner tag for cost tracking

How It Works

  1. Instance Tagging: Instances are tagged with Project (cluster name) and Contact (owner) for easy identification and cost tracking.

  2. SSH Key Management: Your local SSH key is automatically imported to EC2 when creating instances.

  3. Remote Execution: Commands are executed over SSH using paramiko. Long-running commands use GNU screen for detached execution.

  4. Script Distribution: The map command base64-encodes scripts, transfers them to instances, and executes them in parallel.

Examples

Data Processing Pipeline

# 1. Create a cluster
pmr create --name dataproc --number 10 --instance-type i4i.4xlarge

# 2. Set up the environment
pmr setup-dolma-python --name dataproc --detach

# 3. Distribute processing scripts
pmr map --name dataproc --script ./processing-jobs/

# 4. Monitor progress
pmr run --name dataproc --command "tail -f ~/*/run_all.log"

# 5. Clean up
pmr terminate --name dataproc

Quick One-Off Command

# Create, run, and terminate in one go
pmr create --name quickjob --number 1
pmr run --name quickjob --command "./my-job.sh" --spindown
# Instance auto-terminates after job completes

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poormanray-0.1.0.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poormanray-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file poormanray-0.1.0.tar.gz.

File metadata

  • Download URL: poormanray-0.1.0.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for poormanray-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6e1a8edf4fff361128bcb11e83f312c7f14fbc90984b7d2106fe279fee2012a9
MD5 3366a8d40ef411625ab776e0a8602b99
BLAKE2b-256 e08c0218248d8366e19b43a924ab89c3d6d48ca2d0ac8ca531b490f5d69ba86d

See more details on using hashes here.

Provenance

The following attestation bundles were made for poormanray-0.1.0.tar.gz:

Publisher: publish.yml on allenai/poormanray

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file poormanray-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: poormanray-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for poormanray-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4505258aa123f55b042d99387a415a4390eccaffeba2ae76a38424de12387a4c
MD5 af6fedd6d0c2986b835a1fc309fb1a64
BLAKE2b-256 3de135af77655ec4d43ae53da05bd33d9c54ee625a694beaefbebd54141d1b6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for poormanray-0.1.0-py3-none-any.whl:

Publisher: publish.yml on allenai/poormanray

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page