AI Computing Power Scheduling Platform
Project description
๐ Schedule GPU Clusters Like Writing Python Scripts
Declarative YAML ยท Zero K8s Knowledge ยท Resource Pool Isolation
็ฎไฝไธญๆ โข Quick Start โข Documentation โข Features
โจ Why gpuctl
|
One Command to Rule Them All gpuctl create -f job.yamlSay goodbye to 100+ lines of K8s YAML, submit tasks with declarative configuration |
Multi-Team Resource Isolation Training Pool / Inference Pool / Dev Pool Logical isolation prevents resource contention, with quota management per team |
Distributed Training Built-In Indexed Job + Headless Service Set resources.nodes: N โ platform auto-injects DDP env vars (MASTER_ADDR / RANK / WORLD_SIZE)
|
One-Stop Monitoring Logs / Events / Resource Usage No more kubectl get pods to find pod names |
|
ML Engineer Friendly kind / job / resources Familiar YAML syntax, no need to understand Pod/Deployment concepts |
Namespace-Level Quotas CPU / Memory / GPU Auto-bind ResourceQuota when creating Namespace |
Complete API Support HTTP / WebSocket Easy integration with MLOps platforms or third-party tools |
Existing K8s Cluster Ready to Use No cluster configuration changes, no impact on existing workloads |
|
Zero-config NFS storage on every job Operator runs gpuctl init once โ every job auto-mounts a persistent, per-user /home/jovyan (read-write) and a shared /datasets (read-only). No mount paths, no storage classes, no PVCs in user YAML. Files survive restarts and are shared across a user's Notebook and Training jobs.
|
๐ Quick Start
# 1. Install CLI
pip install gpuctl
# (operator, once) Enable transparent persistent storage for every job
gpuctl init --nfs-server <IP> --nfs-path /exports
# 2. Submit LLM fine-tuning task (4x A100)
cat > training.yaml << 'EOF'
kind: training
version: v0.1
job:
name: qwen2-7b-sft
environment:
image: llama-factory:latest
command: ["llamafactory-cli", "train", "--stage", "sft"]
resources:
pool: training-pool
gpu: 4
cpu: 32
memory: 128Gi
EOF
gpuctl create -f training.yaml
# 3. Check task status
gpuctl get jobs
# 4. View logs in real-time
gpuctl logs qwen2-7b-sft -f
๐ gpuctl vs Native Kubectl
| Scenario | โจ gpuctl Way | Native Kubectl Way |
|---|---|---|
| ๐ Submit Training Task | Just 15-20 lines of declarative config, fill in familiar fields like kind, job.name, resources.gpu, and submit | Write 120+ lines of K8s YAML, manually create Secret, ConfigMap, Job resources, understand PodSpec, ResourceRequirements, VolumeMounts |
| ๐ Check Task Status | One command for all tasks gpuctl get jobs, auto-aggregate Pod status, show task name, status, resource usage |
kubectl get jobs to find Job, then get pods -l job-name=xxx to find Pod, finally describe pod for details, tedious process |
| ๐ View Task Logs | Use task name directly gpuctl logs <job-name> -f, auto-track Pod changes, support multi-replica aggregated logs |
Remember Pod name (e.g. training-job-7d9f4b8c5-x2mnp), run kubectl logs <pod-name> -f, re-find after Pod restart |
| ๐ง Multi-Node Distributed Training | Just set resources.nodes: N, platform creates an Indexed Job + Headless Service and auto-injects DDP rendezvous env vars (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK, LOCAL_RANK); all workers share one NFS /home/jovyan for checkpoints |
Manually create an Indexed/JobSet + Headless Service, wire up MASTER_ADDR/RANK/WORLD_SIZE, provision shared storage, understand GPU communication and process groups |
| ๐ Resource Pool Management | Declarative pool config, pool: training-pool auto-schedules to corresponding node group, supports multi-team isolation and quota control |
Manually bind nodes via LabelSelector and NodeAffinity, maintain complex scheduling strategies and resource limits per team |
| ๐ Resource Quota Management | Quota auto-created with Namespace, gpuctl describe quota one-click view of used/total, auto-reject with friendly message when exceeded |
Manually create ResourceQuota and LimitRange, configure per Namespace, query usage multiple times for aggregation |
| โก Deploy Inference Service | Auto-create Deployment + Service, declare replicas and port, auto-generate NodePort to expose service, built-in readiness probe | Create Deployment, Service, Ingress/NodePort separately, configure HPA auto-scaling, understand Service types and network policies |
| ๐ Launch Notebook | One-click JupyterLab launch, auto-generate access link, support custom images and passwords, auto-mount storage volumes | Manually create StatefulSet, Headless Service, Ingress, configure PVC storage, handle Jupyter Token and passwords |
๐๏ธ Architecture
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User โโโโโโถโ gpuctl CLI โโโโโโถโ K8s Job/Deployment/ โ
โ (YAML) โ โ / REST APIโ โ StatefulSet + Service โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Documentation
Complete documentation is available in the docs/ directory, or check out the quick navigation below:
Getting Started
- Quick Start โ Get started with gpuctl in 5 minutes
- Installation Guide โ Detailed installation steps
User Guides
- Training Tasks โ LLM fine-tuning, conda env reuse, multi-node distributed training
- Persistent Storage โ transparent NFS
/home/jovyan+/datasets, zero config in job YAML - Inference Services โ VLLM inference deployment; single-node tensor-parallel and multi-node (
resources.nodes: N) model-parallel serving - Notebooks โ JupyterLab interactive development
- Resource Pool Management โ GPU resource pool configuration
Reference
- CLI Commands โ Complete command reference
- API Documentation โ RESTful API specifications
- FAQ โ Frequently asked questions and troubleshooting
Development & Contribution
- Architecture Design โ System design documentation
- Local Development โ Development environment setup
- Contributing Guide โ How to contribute
๐ป Installation
Prerequisites
- Python 3.8+
- Kubernetes cluster access (via
kubectl)
From PyPI (Recommended)
pip install gpuctl
From Source
git clone https://github.com/runwhere-ai/gpuctl.git
cd gpuctl
pip install -e .
Binary Download
# Linux
wget https://github.com/runwhere-ai/gpuctl/releases/latest/download/gpuctl-linux-amd64
chmod +x gpuctl-linux-amd64
sudo mv gpuctl-linux-amd64 /usr/local/bin/gpuctl
๐ Show Your Support
If gpuctl helps you, please give us a โญ๏ธ Star!
๐ License
MIT License ยฉ 2024 GPU Control Team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpuctl_cli-0.9.0.tar.gz.
File metadata
- Download URL: gpuctl_cli-0.9.0.tar.gz
- Upload date:
- Size: 99.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a3088731ad53b58f4d03bc60614430b280192d9d916aa8d246e93bcef1dcf8c
|
|
| MD5 |
ebf157300ab215b84652e5e0aff2e80c
|
|
| BLAKE2b-256 |
65f64cf8ca486516ac7726f5c4b23ae83461b959521d6a7c4f04c8fdb5512cf5
|
File details
Details for the file gpuctl_cli-0.9.0-py3-none-any.whl.
File metadata
- Download URL: gpuctl_cli-0.9.0-py3-none-any.whl
- Upload date:
- Size: 125.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a33b2e34d820232eab62ee04d85de5fc2a4dd2717bf4d53e59f1fbccfed2cfa0
|
|
| MD5 |
b9c159d67c26efe86cff0ffeabc96acb
|
|
| BLAKE2b-256 |
17feeda90c9692204fd658aa8a4426c6722d84b4e1025278d9f2e00f4a4f71e2
|