Skip to main content

AI Computing Power Scheduling Platform

Project description

gpuctl logo

release python kubernetes license contributions

๐Ÿš€ Schedule GPU Clusters Like Writing Python Scripts

Declarative YAML ยท Zero K8s Knowledge ยท Resource Pool Isolation

็ฎ€ไฝ“ไธญๆ–‡ โ€ข Quick Start โ€ข Documentation โ€ข Features


โœจ Why gpuctl

cli

One Command to Rule Them All
gpuctl create -f job.yaml

Say goodbye to 100+ lines of K8s YAML, submit tasks with declarative configuration
pool

Multi-Team Resource Isolation
Training Pool / Inference Pool / Dev Pool

Logical isolation prevents resource contention, with quota management per team
distributed

Distributed Training Built-In
Indexed Job + Headless Service

Set resources.nodes: N โ€” platform auto-injects DDP env vars (MASTER_ADDR / RANK / WORLD_SIZE)
observability

One-Stop Monitoring
Logs / Events / Resource Usage

No more kubectl get pods to find pod names
declarative

ML Engineer Friendly
kind / job / resources

Familiar YAML syntax, no need to understand Pod/Deployment concepts
quota

Namespace-Level Quotas
CPU / Memory / GPU

Auto-bind ResourceQuota when creating Namespace
api

Complete API Support
HTTP / WebSocket

Easy integration with MLOps platforms or third-party tools
non-intrusive

Existing K8s Cluster
Ready to Use

No cluster configuration changes, no impact on existing workloads
storage

Zero-config NFS storage on every job
Operator runs gpuctl init once โ†’ every job auto-mounts a persistent, per-user /home/jovyan (read-write) and a shared /datasets (read-only). No mount paths, no storage classes, no PVCs in user YAML. Files survive restarts and are shared across a user's Notebook and Training jobs.

๐Ÿš€ Quick Start

# 1. Install CLI
pip install gpuctl

# (operator, once) Enable transparent persistent storage for every job
gpuctl init --nfs-server <IP> --nfs-path /exports

# 2. Submit LLM fine-tuning task (4x A100)
cat > training.yaml << 'EOF'
kind: training
version: v0.1
job:
  name: qwen2-7b-sft
environment:
  image: llama-factory:latest
  command: ["llamafactory-cli", "train", "--stage", "sft"]
resources:
  pool: training-pool
  gpu: 4
  cpu: 32
  memory: 128Gi
EOF

gpuctl create -f training.yaml

# 3. Check task status
gpuctl get jobs

# 4. View logs in real-time
gpuctl logs qwen2-7b-sft -f

๐Ÿ†š gpuctl vs Native Kubectl

Scenario โœจ gpuctl Way Native Kubectl Way
๐Ÿ“ Submit Training Task Just 15-20 lines of declarative config, fill in familiar fields like kind, job.name, resources.gpu, and submit Write 120+ lines of K8s YAML, manually create Secret, ConfigMap, Job resources, understand PodSpec, ResourceRequirements, VolumeMounts
๐Ÿ“Š Check Task Status One command for all tasks gpuctl get jobs, auto-aggregate Pod status, show task name, status, resource usage kubectl get jobs to find Job, then get pods -l job-name=xxx to find Pod, finally describe pod for details, tedious process
๐Ÿ” View Task Logs Use task name directly gpuctl logs <job-name> -f, auto-track Pod changes, support multi-replica aggregated logs Remember Pod name (e.g. training-job-7d9f4b8c5-x2mnp), run kubectl logs <pod-name> -f, re-find after Pod restart
๐Ÿง  Multi-Node Distributed Training Just set resources.nodes: N, platform creates an Indexed Job + Headless Service and auto-injects DDP rendezvous env vars (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK, LOCAL_RANK); all workers share one NFS /home/jovyan for checkpoints Manually create an Indexed/JobSet + Headless Service, wire up MASTER_ADDR/RANK/WORLD_SIZE, provision shared storage, understand GPU communication and process groups
๐ŸŠ Resource Pool Management Declarative pool config, pool: training-pool auto-schedules to corresponding node group, supports multi-team isolation and quota control Manually bind nodes via LabelSelector and NodeAffinity, maintain complex scheduling strategies and resource limits per team
๐Ÿ“‹ Resource Quota Management Quota auto-created with Namespace, gpuctl describe quota one-click view of used/total, auto-reject with friendly message when exceeded Manually create ResourceQuota and LimitRange, configure per Namespace, query usage multiple times for aggregation
โšก Deploy Inference Service Auto-create Deployment + Service, declare replicas and port, auto-generate NodePort to expose service, built-in readiness probe Create Deployment, Service, Ingress/NodePort separately, configure HPA auto-scaling, understand Service types and network policies
๐Ÿ““ Launch Notebook One-click JupyterLab launch, auto-generate access link, support custom images and passwords, auto-mount storage volumes Manually create StatefulSet, Headless Service, Ingress, configure PVC storage, handle Jupyter Token and passwords

๐Ÿ—๏ธ Architecture

gpuctl architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   User      โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  gpuctl CLI โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  K8s Job/Deployment/        โ”‚
โ”‚  (YAML)     โ”‚     โ”‚   / REST APIโ”‚     โ”‚  StatefulSet + Service      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“š Documentation

Complete documentation is available in the docs/ directory, or check out the quick navigation below:

Getting Started

User Guides

  • Training Tasks โ€” LLM fine-tuning, conda env reuse, multi-node distributed training
  • Persistent Storage โ€” transparent NFS /home/jovyan + /datasets, zero config in job YAML
  • Inference Services โ€” VLLM inference deployment; single-node tensor-parallel and multi-node (resources.nodes: N) model-parallel serving
  • Notebooks โ€” JupyterLab interactive development
  • Resource Pool Management โ€” GPU resource pool configuration

Reference

  • CLI Commands โ€” Complete command reference
  • API Documentation โ€” RESTful API specifications
  • FAQ โ€” Frequently asked questions and troubleshooting

Development & Contribution


๐Ÿ’ป Installation

Prerequisites

  • Python 3.8+
  • Kubernetes cluster access (via kubectl)

From PyPI (Recommended)

pip install gpuctl

From Source

git clone https://github.com/runwhere-ai/gpuctl.git
cd gpuctl
pip install -e .

Binary Download

# Linux
wget https://github.com/runwhere-ai/gpuctl/releases/latest/download/gpuctl-linux-amd64
chmod +x gpuctl-linux-amd64
sudo mv gpuctl-linux-amd64 /usr/local/bin/gpuctl

๐ŸŒŸ Show Your Support

If gpuctl helps you, please give us a โญ๏ธ Star!

stars

๐Ÿ“„ License

MIT License ยฉ 2024 GPU Control Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuctl_cli-0.9.0.tar.gz (99.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpuctl_cli-0.9.0-py3-none-any.whl (125.1 kB view details)

Uploaded Python 3

File details

Details for the file gpuctl_cli-0.9.0.tar.gz.

File metadata

  • Download URL: gpuctl_cli-0.9.0.tar.gz
  • Upload date:
  • Size: 99.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for gpuctl_cli-0.9.0.tar.gz
Algorithm Hash digest
SHA256 5a3088731ad53b58f4d03bc60614430b280192d9d916aa8d246e93bcef1dcf8c
MD5 ebf157300ab215b84652e5e0aff2e80c
BLAKE2b-256 65f64cf8ca486516ac7726f5c4b23ae83461b959521d6a7c4f04c8fdb5512cf5

See more details on using hashes here.

File details

Details for the file gpuctl_cli-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: gpuctl_cli-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 125.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for gpuctl_cli-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a33b2e34d820232eab62ee04d85de5fc2a4dd2717bf4d53e59f1fbccfed2cfa0
MD5 b9c159d67c26efe86cff0ffeabc96acb
BLAKE2b-256 17feeda90c9692204fd658aa8a4426c6722d84b4e1025278d9f2e00f4a4f71e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page