CLI tool for PyTorch GPU developer server reservations

Project description

GPU Developer CLI

A command-line tool for reserving and managing GPU development servers on AWS EKS.

Installation
Configuration
Quick Start
Commands Reference
GPU Types
Storage
Multinode Reservations
Custom Docker Images
Nsight Profiling
Default Container Image
SSH & IDE Integration
Architecture
Troubleshooting

Installation

# Install directly from GitHub (recommended)
python3 -m pip install --upgrade "git+https://github.com/wdvr/osdc.git"

# Or install from local clone
git clone https://github.com/wdvr/osdc.git
cd osdc
pip install -e .

Configuration

Initial Setup

# Set your GitHub username (required for SSH key authentication)
gpu-dev config set github_user your-github-username

# View current configuration
gpu-dev config show

Configuration is stored at ~/.config/gpu-dev/config.json.

SSH Config Integration

Enable automatic SSH config for seamless VS Code/Cursor integration:

# Enable SSH config auto-include (recommended)
gpu-dev config ssh-include enable

# Disable if needed
gpu-dev config ssh-include disable

When enabled, this adds Include ~/.gpu-dev/*-sshconfig to:

~/.ssh/config
~/.cursor/ssh_config

AWS Authentication

The CLI uses your AWS credentials. Configure via:

aws configure command
Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
IAM roles (for EC2/Lambda)
SSO: aws sso login --profile your-profile

Quick Start

# Interactive reservation (guided setup)
gpu-dev reserve

# Reserve 4 H100 GPUs for 8 hours
gpu-dev reserve --gpu-type h100 --gpus 4 --hours 8

# Check your reservations
gpu-dev list

# Connect to your active reservation
gpu-dev connect

# Check GPU availability
gpu-dev avail

Commands Reference

`gpu-dev reserve`

Create a GPU reservation.

Interactive Mode (default when parameters omitted):

gpu-dev reserve

Guides you through GPU type, count, duration, disk, and Jupyter selection.

Command-line Mode:

gpu-dev reserve [OPTIONS]

Option	Short	Description
`--gpus`	`-g`	Number of GPUs (1, 2, 4, 8, 12, 16, 20, 24, 32, 40, 48)
`--gpu-type`	`-t`	GPU type: `b200`, `h200`, `h100`, `a100`, `a10g`, `t4`, `l4`, `t4-small`, `cpu-arm`, `cpu-x86`
`--hours`	`-h`	Duration in hours (0.0833 to 24, supports decimals)
`--name`	`-n`	Optional reservation name
`--jupyter`		Enable Jupyter Lab access
`--disk`		Named persistent disk to use, or `none` for temporary storage
`--no-persist`		Create without persistent disk (ephemeral `/home/dev`)
`--ignore-no-persist`		Skip warning when disk is in use
`--recreate-env`		Recreate shell environment on existing disk
`--distributed`	`-d`	Required for multinode reservations (>8 GPUs)
`--dockerfile`		Path to custom Dockerfile (max 512KB)
`--dockerimage`		Custom Docker image URL
`--preserve-entrypoint`		Keep original container ENTRYPOINT/CMD
`--node-label`	`-l`	Node selector labels (e.g., `--node-label nsight=true`)
`--verbose`	`-v`	Enable debug output
`--no-interactive`		Force non-interactive mode

Examples:

# 2 H100 GPUs for 4 hours with Jupyter
gpu-dev reserve -t h100 -g 2 -h 4 --jupyter

# Use specific persistent disk
gpu-dev reserve -t a100 -g 4 -h 8 --disk pytorch-dev

# Temporary storage only
gpu-dev reserve -t t4 -g 1 -h 2 --disk none

# 16 GPUs across 2 nodes (multinode)
gpu-dev reserve -t h100 -g 16 -h 12 --distributed

# Custom Docker image
gpu-dev reserve -t h100 -g 4 --dockerimage pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

# Request Nsight profiling node
gpu-dev reserve -t h100 -g 8 --node-label nsight=true

`gpu-dev list`

List your reservations.

gpu-dev list [OPTIONS]

Option	Short	Description
`--user`	`-u`	Filter by user (`all` for all users)
`--status`	`-s`	Filter by status: `active`, `queued`, `pending`, `preparing`, `expired`, `cancelled`, `failed`
`--all`	`-a`	Show all reservations (including expired/cancelled)
`--watch`		Continuously refresh every 2 seconds

`gpu-dev show`

Show detailed information for a specific reservation.

gpu-dev show [RESERVATION_ID] [OPTIONS]

If no ID provided, shows details for your active/pending reservation.

Option	Description
`--trace`	Show detailed timing breakdown of reservation provisioning

Example with trace:

gpu-dev show abc12345 --trace

# Shows timing breakdown:
# ✓ CLI → Lambda: 0.084s
# ✓ Disk restore: 6.2s
# ✓ Volume attach: 26.1s
# ✓ Init containers: 1.3s
# ✓ Container startup: 13.4s

`gpu-dev connect`

SSH to your active reservation.

gpu-dev connect [RESERVATION_ID]

If no ID provided, connects to your active reservation.

`gpu-dev cancel`

Cancel a reservation.

gpu-dev cancel [RESERVATION_ID]

Interactive Mode: If no ID provided, shows selection menu.

Option	Short	Description
`--all`	`-a`	Cancel all your active reservations

`gpu-dev edit`

Modify an active reservation.

gpu-dev edit [RESERVATION_ID] [OPTIONS]

Option	Description
`--enable-jupyter`	Enable Jupyter Lab
`--disable-jupyter`	Disable Jupyter Lab
`--extend`	Extend reservation duration
`--add-user`	Add secondary user (GitHub username)

Examples:

# Enable Jupyter on existing reservation
gpu-dev edit abc12345 --enable-jupyter

# Extend reservation
gpu-dev edit abc12345 --extend

# Add collaborator
gpu-dev edit abc12345 --add-user colleague-github-name

`gpu-dev avail`

Check GPU availability by type.

gpu-dev avail [OPTIONS]

Option	Description
`--watch`	Continuously refresh every 5 seconds

`gpu-dev status`

Show overall cluster status and capacity.

gpu-dev status

`gpu-dev disk`

Manage persistent disks.

`gpu-dev disk list`

gpu-dev disk list [OPTIONS]

Option	Description
`--watch`	Continuously refresh every 2 seconds
`--user`	Impersonate another user

Shows: disk name, size, created date, last used, snapshot count, status (available/in-use/backing-up/deleted).

`gpu-dev disk create`

gpu-dev disk create <DISK_NAME>

Creates a new named persistent disk. Disk names can contain letters, numbers, hyphens, and underscores.

`gpu-dev disk delete`

gpu-dev disk delete <DISK_NAME> [--yes/-y]

Soft-deletes a disk. Snapshots are permanently deleted after 30 days.

`gpu-dev disk list-content`

gpu-dev disk list-content <DISK_NAME>

Shows file listing from the latest snapshot of a disk.

`gpu-dev disk rename`

gpu-dev disk rename <OLD_NAME> <NEW_NAME>

Renames an existing disk.

`gpu-dev help`

Show help information.

GPU Types

GPU Type	Instance Type	GPUs/Node	Memory/GPU	Best For
`b200`	p6-b200.48xlarge	8	192GB	Latest NVIDIA Blackwell, highest performance
`h200`	p5e.48xlarge	8	141GB	Large models, high memory workloads
`h100`	p5.48xlarge	8	80GB	Production training, large-scale inference
`a100`	p4d.24xlarge	8	40GB	General ML training
`a10g`	g5.12xlarge	4	24GB	Inference, smaller training
`l4`	g6.12xlarge	4	24GB	Inference, cost-effective
`t4`	g4dn.12xlarge	4	16GB	Development, testing
`t4-small`	g4dn.xlarge	1	16GB	Single GPU development
`cpu-arm`	c7g.4xlarge	0	N/A	ARM CPU-only workloads
`cpu-x86`	c7i.4xlarge	0	N/A	x86 CPU-only workloads

Storage

Persistent Disk (EBS) - `/home/dev`

Each user can have named persistent disks that preserve data between sessions:

Mount point: /home/dev (your home directory)
Size: 100GB per disk
Backed up: Automatic snapshots when reservation ends
Content tracking: View contents via gpu-dev disk list-content

Workflow:

# Create a new disk
gpu-dev disk create my-project

# Use it in a reservation
gpu-dev reserve --disk my-project

# List your disks
gpu-dev disk list

# View disk contents (from snapshot)
gpu-dev disk list-content my-project

Multiple Disks: You can have multiple named disks for different projects (e.g., pytorch-dev, llm-training, experiments).

Disk Selection: During interactive reservation, you'll be prompted to select a disk or create a new one.

Shared Personal Storage (EFS) - `/shared-personal`

Per-user EFS filesystem for larger files that persist across all your reservations:

Mount point: /shared-personal
Size: Elastic (pay for what you use)
Use case: Datasets, model checkpoints, large files

Shared ccache (EFS) - `/ccache`

Shared compiler cache across ALL users:

Mount point: /ccache
Environment: CCACHE_DIR=/ccache
Benefit: Faster compilation for PyTorch and other C++ projects
Shared: Cache hits from any user benefit everyone

Temporary Storage

Use --disk none or --no-persist for reservations without persistent disk:

/home/dev uses ephemeral storage
Data is lost when reservation ends
Useful for quick experiments or CI-like workflows

Multinode Reservations

For distributed training across multiple GPU nodes:

# 16 H100 GPUs (2 nodes x 8 GPUs)
gpu-dev reserve -t h100 -g 16 --distributed

# 24 H100 GPUs (3 nodes x 8 GPUs)
gpu-dev reserve -t h100 -g 24 --distributed

Requirements:

GPU count must be a multiple of GPUs-per-node (e.g., 16, 24, 32 for H100)
--distributed flag is required

What you get:

Multiple pods with hostname resolution: <podname>-headless.gpu-dev.svc.cluster.local
Shared network drive between nodes
Network connectivity between all pods
Master port 29500 available on all nodes
EFA (Elastic Fabric Adapter) for high-bandwidth inter-node communication

Node naming: Nodes are numbered 0 to N-1. Use $RANK or node index to set MASTER_ADDR.

Custom Docker Images

Using a Pre-built Image

gpu-dev reserve --dockerimage pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

Note: The image must have SSH server capabilities for remote access.

Using a Custom Dockerfile

gpu-dev reserve --dockerfile ./my-project/Dockerfile

Limitations:

Dockerfile max size: 512KB
Build context (directory) max size: ~700KB compressed
Build happens at reservation time (adds startup time)

Example Dockerfile:

FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

# Install additional packages
RUN pip install transformers datasets accelerate

# Your customizations...

Preserving Entrypoint

To keep the original container's ENTRYPOINT/CMD instead of the SSH server:

gpu-dev reserve --dockerimage myimage:latest --preserve-entrypoint

Nsight Profiling

For GPU profiling with NVIDIA Nsight Compute (ncu) and Nsight Systems (nsys):

# Request a profiling-dedicated node
gpu-dev reserve -t h100 -g 8 --node-label nsight=true

Why dedicated nodes?

DCGM (GPU monitoring) conflicts with Nsight profiling
Profiling-dedicated nodes have DCGM disabled
One H100, one B200, and one T4 node are reserved for profiling

Profiling capabilities enabled:

CAP_SYS_ADMIN Linux capability on pods
NVreg_RestrictProfilingToAdminUsers=0 on nodes
NVIDIA_DRIVER_CAPABILITIES=compute,utility

Available profiling tools:

ncu - Nsight Compute for kernel profiling
nsys - Nsight Systems for system-wide profiling

Default Container Image

The default image (pytorch/pytorch:2.9.1-cuda12.8-cudnn9-devel based) includes:

Pre-installed Software

Deep Learning:

PyTorch 2.9.1 with CUDA 12.8
cuDNN 9
CUDA Toolkit 12.8 + 13.0

Python Packages:

JupyterLab, ipywidgets
matplotlib, seaborn, plotly
pandas, numpy, scikit-learn
tensorboard

System Tools:

zsh with oh-my-zsh (default shell)
bash with bash-completion
vim, nano, neovim
tmux, htop, tree
git, curl, wget
ccache

Development:

Claude Code CLI (claude)
Node.js 20
SSH server

Shell Environment

Default shell: zsh with oh-my-zsh
Plugins: zsh-autosuggestions, zsh-syntax-highlighting
User: dev with passwordless sudo
Home: /home/dev (persistent or temporary based on disk settings)

Environment Variables

CUDA_12_PATH=/usr/local/cuda-12.8
CUDA_13_PATH=/usr/local/cuda-13.0
CCACHE_DIR=/ccache

SSH & IDE Integration

SSH Access

After reservation is active:

# Quick connect
gpu-dev connect

# Or use the SSH command shown in reservation details
ssh dev@<node-ip> -p <nodeport>

# With SSH config enabled (recommended)
ssh <pod-name>

VS Code Remote

With SSH config enabled:

code --remote ssh-remote+<pod-name> /home/dev

Or click the VS Code link shown in gpu-dev show output.

Cursor IDE

Works the same as VS Code when SSH config is enabled:

Open Remote SSH in Cursor
Select your pod from the list

SSH Agent Forwarding

To use your local SSH keys on the server (e.g., for git):

ssh -A <pod-name>

Or add to your SSH config:

Host gpu-dev-*
    ForwardAgent yes

Reservation Limits

Limit	Value
Maximum duration	24 hours
Minimum duration	5 minutes (0.0833 hours)
Extension	Once, up to 24 additional hours
Total max time	48 hours (24h initial + 24h extension)

Expiry Warnings:

30 minutes before expiry
15 minutes before expiry
5 minutes before expiry

Warnings appear as files in your home directory and via wall messages.

Architecture

System Components

┌─────────────┐     ┌──────────────┐     ┌─────────────────────┐
│  GPU Dev    │────▶│  SQS Queue   │────▶│  Lambda Processor   │
│    CLI      │     │              │     │                     │
└─────────────┘     └──────────────┘     └──────────┬──────────┘
       │                                            │
       │                                            ▼
       │            ┌──────────────┐     ┌─────────────────────┐
       └───────────▶│  DynamoDB    │◀────│    EKS Cluster      │
                    │ Reservations │     │   (GPU Nodes)       │
                    └──────────────┘     └─────────────────────┘

Infrastructure

EKS Cluster: Kubernetes cluster with GPU-enabled nodes
Node Groups: Auto-scaling groups per GPU type
NVIDIA GPU Operator: Manages GPU drivers and device plugin
EBS CSI Driver: Handles persistent volume attachments
EFS: Shared storage for personal files and ccache

Networking

SSH Access: Via NodePort services (30000-32767)
Inter-node: EFA (Elastic Fabric Adapter) for multinode
DNS: Pod hostname resolution via headless services
Internet: Full outbound access from pods

Troubleshooting

Common Issues

"Disk is in use":

Your disk is attached to another reservation
Cancel the other reservation or use --disk none
Check: gpu-dev disk list

"Queued" status:

No GPU capacity available
Wait for queue position to advance
Check availability: gpu-dev avail

SSH connection refused:

Pod may still be starting
Wait for status to become "active"
Check: gpu-dev show <id>

Pod stuck in "preparing":

Image pull may be slow (especially for custom images)
Disk attachment may take time
Check detailed status: gpu-dev show <id>

Debugging Commands

# Show detailed reservation info
gpu-dev show <reservation-id>

# Watch reservation status
gpu-dev list --watch

# Check cluster status
gpu-dev status

# View disk contents
gpu-dev disk list-content <disk-name>

Getting Help

Use gpu-dev help or gpu-dev <command> --help
Report issues: https://github.com/anthropics/claude-code/issues

Development

# Install development dependencies
poetry install --with dev

# Run tests
poetry run pytest

# Format code
poetry run black .
poetry run isort .

# Type checking
poetry run mypy .

Project details

Release history Release notifications | RSS feed

0.5.1

Apr 28, 2026

0.5.0

Apr 28, 2026

0.4.1

Mar 31, 2026

0.4.0

Mar 31, 2026

0.3.9

Mar 6, 2026

This version

0.3.8

Mar 6, 2026

0.3.7

Mar 3, 2026

0.3.6

Mar 3, 2026

0.3.5

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_dev-0.3.8.tar.gz (331.5 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpu_dev-0.3.8-py3-none-any.whl (77.7 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file gpu_dev-0.3.8.tar.gz.

File metadata

Download URL: gpu_dev-0.3.8.tar.gz
Upload date: Mar 6, 2026
Size: 331.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_dev-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`7432901f4b8ec6c8d7ebf4bc65fd81218c3093bf9e0a1d4be0eefb39334b581f`
MD5	`ff9e8ba1e975473be9ff2978f9a1c242`
BLAKE2b-256	`a7ef6e79cec963360c544433fe543416c23457c2e88e00467a9887114005577b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_dev-0.3.8.tar.gz:

Publisher: publish.yml on wdvr/osdc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_dev-0.3.8.tar.gz
- Subject digest: 7432901f4b8ec6c8d7ebf4bc65fd81218c3093bf9e0a1d4be0eefb39334b581f
- Sigstore transparency entry: 1051753975
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: wdvr/osdc@7f14031e34bae166a6eb03175ed308fd829bbeeb
- Branch / Tag: refs/tags/v0.3.8
- Owner: https://github.com/wdvr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7f14031e34bae166a6eb03175ed308fd829bbeeb
- Trigger Event: push

File details

Details for the file gpu_dev-0.3.8-py3-none-any.whl.

File metadata

Download URL: gpu_dev-0.3.8-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 77.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_dev-0.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d05a71a8711d02511af075139848f5c37c7ba1d899c5298995ebbc820323dca0`
MD5	`c67aee6e4233ac55ee9e056fc362b53c`
BLAKE2b-256	`3b7bdc1de7b4b8c57ccad5a317244e47a93f2803a4365d5bbb4b46436e320320`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_dev-0.3.8-py3-none-any.whl:

Publisher: publish.yml on wdvr/osdc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_dev-0.3.8-py3-none-any.whl
- Subject digest: d05a71a8711d02511af075139848f5c37c7ba1d899c5298995ebbc820323dca0
- Sigstore transparency entry: 1051753982
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: wdvr/osdc@7f14031e34bae166a6eb03175ed308fd829bbeeb
- Branch / Tag: refs/tags/v0.3.8
- Owner: https://github.com/wdvr
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7f14031e34bae166a6eb03175ed308fd829bbeeb
- Trigger Event: push

gpu-dev 0.3.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GPU Developer CLI

Table of Contents

Installation

Configuration

Initial Setup

SSH Config Integration

AWS Authentication

Quick Start

Commands Reference

gpu-dev reserve

gpu-dev list

gpu-dev show

gpu-dev connect

gpu-dev cancel

gpu-dev edit

gpu-dev avail

gpu-dev status

gpu-dev disk

gpu-dev disk list

gpu-dev disk create

gpu-dev disk delete

gpu-dev disk list-content

gpu-dev disk rename

gpu-dev help

GPU Types

Storage

Persistent Disk (EBS) - /home/dev

Shared Personal Storage (EFS) - /shared-personal

Shared ccache (EFS) - /ccache

Temporary Storage

Multinode Reservations

Custom Docker Images

Using a Pre-built Image

Using a Custom Dockerfile

Preserving Entrypoint

Nsight Profiling

Default Container Image

Pre-installed Software

Shell Environment

Environment Variables

SSH & IDE Integration

SSH Access

VS Code Remote

Cursor IDE

SSH Agent Forwarding

Reservation Limits

Architecture

System Components

Infrastructure

Networking

Troubleshooting

Common Issues

Debugging Commands

Getting Help

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`gpu-dev reserve`

`gpu-dev list`

`gpu-dev show`

`gpu-dev connect`

`gpu-dev cancel`

`gpu-dev edit`

`gpu-dev avail`

`gpu-dev status`

`gpu-dev disk`

`gpu-dev disk list`

`gpu-dev disk create`

`gpu-dev disk delete`

`gpu-dev disk list-content`

`gpu-dev disk rename`

`gpu-dev help`

Persistent Disk (EBS) - `/home/dev`

Shared Personal Storage (EFS) - `/shared-personal`

Shared ccache (EFS) - `/ccache`