Test CUDA kernels across multiple GPUs via SSH
Project description
gpuci
Test CUDA kernels across multiple GPUs via SSH
Installation • Quick Start • Providers • GitHub Actions • Documentation
Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ $ gpuci test matmul.cu │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ gpuci results: matmul.cu │ │
│ ├─────────────┬───────────────────┬────────┬──────────┬───────────────┤ │
│ │ Target │ GPU │ Status │ Median │ Compile │ │
│ ├─────────────┼───────────────────┼────────┼──────────┼───────────────┤ │
│ │ h100-cloud │ NVIDIA H100 │ PASS │ 0.42ms │ 2.1s │ │
│ │ a100-server │ NVIDIA A100 │ PASS │ 0.61ms │ 2.3s │ │
│ │ rtx-5070ti │ RTX 5070 Ti │ PASS │ 1.03ms │ 3.0s │ │
│ └─────────────┴───────────────────┴────────┴──────────┴───────────────┘ │
│ │
│ All 3 targets passed │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Features
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Multi-GPU │ │ Accurate │ │ 6 Cloud │
│ Testing │ │ Timing │ │ Providers │
├──────────────────┤ ├──────────────────┤ ├──────────────────┤
│ Run kernels on │ │ CUDA events for │ │ SSH, RunPod, │
│ multiple GPUs │ │ microsecond │ │ Lambda, Vast.ai │
│ simultaneously │ │ precision │ │ FluidStack, Brev │
└──────────────────┘ └──────────────────┘ └──────────────────┘
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Easy Setup │ │ GitHub Actions │ │ CI/CD Ready │
├──────────────────┤ ├──────────────────┤ ├──────────────────┤
│ Simple YAML │ │ First-class CI │ │ Exit codes for │
│ configuration │ │ integration with │ │ automation and │
│ + init wizard │ │ PR comments │ │ pipelines │
└──────────────────┘ └──────────────────┘ └──────────────────┘
How It Works
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │ │ Compile │ │ Execute │ │ Report │
│ Kernel │────▶│ on GPU │────▶│ Timed │────▶│ Results │
│ │ │ Machine │ │ Runs │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ SFTP │ │ nvcc │ │ CUDA │ │ Rich │
│ Upload │ │ -arch │ │ Events │ │ Table │
└─────────┘ │ auto │ │ Timing │ └─────────┘
└─────────┘ └─────────┘
Architecture
auto-detected Warmup + Benchmark
via nvidia-smi runs with stats
Execution Flow
Your Machine GPU Target(s)
┌──────────────────┐ ┌──────────────────┐
│ │ SSH/SFTP │ │
│ gpuci test │ ──────────────▶ │ /tmp/gpuci/ │
│ kernel.cu │ │ kernel.cu │
│ │ │ │
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ Wrap with │ │ │ │ nvcc │ │
│ │ timing │ │ │ │ compile │ │
│ │ macros │ │ │ └────────────┘ │
│ └────────────┘ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌────────────┐ │
│ │ │ │ Execute │ │
│ │ ◀───────────── │ │ 3 warmup │ │
│ Parse results │ stdout │ │ 10 timed │ │
│ GPUCI_* │ │ └────────────┘ │
│ │ │ │
└──────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ Display Table │
│ with timing │
│ statistics │
└──────────────────┘
Installation
# Basic installation
pip install gpuci
# With RunPod support
pip install gpuci[runpod]
# With Vast.ai support
pip install gpuci[vastai]
# All cloud providers
pip install gpuci[cloud]
From Source
git clone https://github.com/rightnow-ai/gpuci
cd gpuci
pip install -e .
Quick Start
1. Initialize Configuration
gpuci init
This creates gpuci.yml with your GPU targets.
2. Configure Your Targets
# gpuci.yml
targets:
- name: my-gpu-server
provider: ssh
host: gpu.mycompany.com
user: ubuntu
gpu: RTX 4090
warmup_runs: 3
benchmark_runs: 10
timeout: 120
3. Write a CUDA Kernel
// kernel.cu
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
4. Run Tests
gpuci test kernel.cu
Providers
gpuci supports 6 GPU providers:
┌─────────────────────────────────────────────────────────────────────────────┐
│ GPU PROVIDERS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ SSH │ │ RunPod │ │ Lambda │ │ Vast.ai │ │
│ │ ─────── │ │ ──────── │ │ ──────── │ │ ───────── │ │
│ │ Your own │ │ On-demand │ │ High-perf │ │ GPU market │ │
│ │ machines │ │ cloud GPUs │ │ cloud GPUs │ │ place │ │
│ │ │ │ │ │ │ │ │ │
│ │ FREE │ │ $0.20-4/hr │ │ $1.10-3/hr │ │ $0.10-2/hr │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ FluidStack │ │ Brev │ │
│ │ ────────── │ │ ──────── │ │
│ │ Enterprise │ │ NVIDIA │ │
│ │ GPUs │ │ managed │ │
│ │ │ │ │ │
│ │ Custom │ │ $0.50-4/hr │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Provider Comparison
| Provider | Type | Setup | Best For |
|---|---|---|---|
| SSH | Your machines | SSH key | Dedicated hardware, no cost |
| RunPod | Cloud | API key | Quick testing, simple pricing |
| Lambda Labs | Cloud | API key | Production workloads |
| Vast.ai | Marketplace | API key | Cost optimization |
| FluidStack | Enterprise | API key | Large scale deployments |
| Brev | Cloud | CLI login | Development & testing |
Configuration Examples
SSH (Your Own Machine)
- name: my-server
provider: ssh
host: gpu.example.com
user: ubuntu
port: 22
key: ~/.ssh/id_rsa
gpu: RTX 4090
Requirements:
- NVIDIA GPU with drivers
- CUDA toolkit (nvcc)
- SSH access
RunPod
- name: runpod-a100
provider: runpod
gpu: "NVIDIA A100 80GB PCIe"
gpu_count: 1
image: runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
export RUNPOD_API_KEY=your_key
pip install gpuci[runpod]
Lambda Labs
- name: lambda-h100
provider: lambdalabs
gpu: gpu_1x_h100_pcie
region: us-west-1
export LAMBDA_API_KEY=your_key
Vast.ai
- name: vastai-4090
provider: vastai
gpu: RTX_4090
max_price: 0.50
min_gpu_ram: 24
export VASTAI_API_KEY=your_key
pip install gpuci[vastai]
FluidStack
- name: fluidstack-h100
provider: fluidstack
gpu: H100_SXM_80GB
export FLUIDSTACK_API_KEY=your_key
Brev (NVIDIA)
- name: brev-h100
provider: brev
gpu: H100
curl -fsSL https://raw.githubusercontent.com/brevdev/brev-cli/main/bin/install-latest.sh | bash
brev login
Full documentation: docs/providers.md
Kernel Formats
gpuci supports two kernel formats:
Format 1: Full Kernel (with main)
Use GPUCI macros for timing control:
#include <cuda_runtime.h>
__global__ void myKernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) data[idx] *= 2.0f;
}
int main() {
// Setup...
// Warmup (not timed)
GPUCI_WARMUP_START()
myKernel<<<blocks, threads>>>(d_data, N);
GPUCI_WARMUP_END()
// Benchmark (timed with CUDA events)
GPUCI_BENCHMARK_START()
myKernel<<<blocks, threads>>>(d_data, N);
GPUCI_BENCHMARK_END()
gpuci_print_results();
return 0;
}
Format 2: Kernel Only (auto-harness)
Just provide the kernel - gpuci generates the test harness:
__global__ void myKernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) data[idx] *= 2.0f;
}
CLI Reference
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLI COMMANDS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ gpuci init Create gpuci.yml configuration │
│ gpuci test <kernel.cu> Run kernel tests │
│ gpuci targets List configured targets │
│ gpuci check Verify target connectivity │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ TEST OPTIONS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ --target <name> Test specific target only │
│ --config <path> Use specific config file │
│ --runs <n> Number of benchmark runs (default: 10) │
│ --warmup <n> Number of warmup runs (default: 3) │
│ --verbose Show detailed output │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Examples
# Test all targets
gpuci test kernel.cu
# Test specific target
gpuci test kernel.cu --target h100-cloud
# Custom benchmark settings
gpuci test kernel.cu --runs 20 --warmup 5
# Verbose output
gpuci test kernel.cu --verbose
# Use specific config
gpuci test kernel.cu --config /path/to/gpuci.yml
GitHub Actions
Quick Setup
# .github/workflows/gpu-test.yml
name: GPU Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: rightnow-ai/gpuci@v1
with:
kernel: 'kernels/*.cu'
env:
RUNPOD_API_KEY: ${{ secrets.RUNPOD_API_KEY }}
Workflow Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ GitHub Actions Workflow │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ on: [push, pull_request] │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Checkout │───▶│ Setup │───▶│ Run gpuci │───▶│ Comment │ │
│ │ Code │ │ Python │ │ test │ │ on PR │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Cloud GPUs (RunPod/Lambda/etc) │ │
│ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │
│ │ │ H100│ │ A100│ │ 4090│ │ │
│ │ └─────┘ └─────┘ └─────┘ │ │
│ └─────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Full guide: docs/github-actions.md
Configuration Reference
# gpuci.yml - Full Configuration Reference
# ┌─────────────────────────────────────────────────────────────────────────┐
# │ TARGETS │
# └─────────────────────────────────────────────────────────────────────────┘
targets:
# SSH Target
- name: my-server # Unique identifier
provider: ssh # Provider type
host: gpu.example.com # Hostname or IP
user: ubuntu # SSH username
port: 22 # SSH port (optional)
key: ~/.ssh/id_rsa # SSH key path (optional)
gpu: RTX 4090 # GPU name (for display)
# Cloud Target (RunPod example)
- name: runpod-a100
provider: runpod
gpu: "NVIDIA A100 80GB PCIe"
gpu_count: 1
image: runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04
# ┌─────────────────────────────────────────────────────────────────────────┐
# │ TIMING SETTINGS │
# └─────────────────────────────────────────────────────────────────────────┘
warmup_runs: 3 # Warmup iterations (not timed)
benchmark_runs: 10 # Timed iterations
timeout: 120 # Seconds per target
# ┌─────────────────────────────────────────────────────────────────────────┐
# │ COMPILATION FLAGS │
# └─────────────────────────────────────────────────────────────────────────┘
nvcc_flags:
- "-O3" # Optimization level
- "-lineinfo" # Debug info (optional)
Project Structure
gpuci/
├── gpuci/ # Main package
│ ├── __init__.py # Version
│ ├── __main__.py # python -m gpuci
│ ├── cli.py # Click CLI commands
│ ├── config.py # YAML configuration
│ ├── runner.py # Parallel execution
│ ├── reporter.py # Rich table output
│ ├── timing.py # CUDA timing wrapper
│ ├── exceptions.py # Error hierarchy
│ └── providers/ # GPU providers
│ ├── base.py # Abstract base class
│ ├── ssh.py # Direct SSH
│ ├── runpod.py # RunPod SDK
│ ├── lambdalabs.py # Lambda REST API
│ ├── vastai.py # Vast.ai SDK
│ ├── fluidstack.py # FluidStack REST API
│ └── brev.py # Brev CLI
├── examples/ # Example kernels
│ ├── vector_add.cu
│ ├── matmul.cu
│ └── simple_kernel.cu
├── docs/ # Documentation
│ ├── providers.md
│ └── github-actions.md
├── .github/workflows/ # CI/CD workflows
├── action.yml # GitHub Action
├── pyproject.toml # Package config
├── LICENSE # MIT License
├── CHANGELOG.md # Version history
└── README.md # This file
Timing Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ CUDA Event Timing │
└─────────────────────────────────────────────────────────────────────────────┘
CPU Timeline
────────────────────────────────────────────────────────────────────────▶
cudaEventRecord(start) cudaEventRecord(stop)
│ │
▼ ▼
┌──────┴──────────────────────────┴──────┐
│ GPU Execution │
│ ┌────────────────────────────┐ │
│ │ Kernel Execution │ │
│ │ (what we measure) │ │
│ └────────────────────────────┘ │
└────────────────────────────────────────┘
│ │
└────────────┬─────────────┘
│
▼
cudaEventElapsedTime()
─────────────────────
Microsecond precision
GPU-side measurement
Independent of CPU load
Benchmark Process
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10
│ │ │ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│W W│ │W W│ │W W│ │ T │ │ T │ │ T │ │ T │ │ T │ │ T │ │ T │
│A A│ │A A│ │A A│ │ I │ │ I │ │ I │ │ I │ │ I │ │ I │ │ I │
│R R│ │R R│ │R R│ │ M │ │ M │ │ M │ │ M │ │ M │ │ M │ │ M │
│M M│ │M M│ │M M│ │ E │ │ E │ │ E │ │ E │ │ E │ │ E │ │ E │
│U U│ │U U│ │U U│ │ D │ │ D │ │ D │ │ D │ │ D │ │ D │ │ D │
│P P│ │P P│ │P P│ │ │ │ │ │ │ │ │ │ │ │ │ │ │
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
│ │ │ │ │ │ │ │ │ │
└───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
│
▼
┌─────────────────┐
│ Statistics │
│ ───────────── │
│ Median: 0.42ms │
│ Mean: 0.43ms │
│ Min: 0.40ms │
│ Max: 0.45ms │
└─────────────────┘
Links
| Resource | Link |
|---|---|
| GitHub Repository | github.com/rightnow-ai/gpuci |
| Issue Tracker | github.com/rightnow-ai/gpuci/issues |
| Changelog | CHANGELOG.md |
| Provider Docs | docs/providers.md |
| GitHub Actions | docs/github-actions.md |
Provider Documentation
| Provider | Website | API Docs |
|---|---|---|
| RunPod | runpod.io | docs.runpod.io |
| Lambda Labs | lambdalabs.com | docs.lambda.ai |
| Vast.ai | vast.ai | docs.vast.ai |
| FluidStack | fluidstack.io | docs.fluidstack.io |
| NVIDIA Brev | brev.dev | docs.nvidia.com/brev |
License
PolyForm Noncommercial License 1.0.0 - Free for non-commercial use.
| Use Case | Allowed |
|---|---|
| Personal projects | ✅ |
| Academic/Research | ✅ |
| Education/Learning | ✅ |
| Non-profits | ✅ |
| Commercial use | ❌ (contact for license) |
See LICENSE for full terms.
Part of the RightNow AI ecosystem.
For the complete GPU development environment, see RightNow Code Editor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpuci-0.1.1.tar.gz.
File metadata
- Download URL: gpuci-0.1.1.tar.gz
- Upload date:
- Size: 42.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e579c4cec40c2ac57162e12f76bbc1e0ba7d4e79d3563657ef70749839667bc
|
|
| MD5 |
0d330680099f89fca6959bd1db47f88a
|
|
| BLAKE2b-256 |
b1849d89049f5ce794401f2ea54c4e4d4079622512f0827e804c479dd12df2fa
|
File details
Details for the file gpuci-0.1.1-py3-none-any.whl.
File metadata
- Download URL: gpuci-0.1.1-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a433d601acb299809633acda527b02dae117e90cd16a0a006e5f3ef4b144a30a
|
|
| MD5 |
f698304560cee6655aa79a76d14cb256
|
|
| BLAKE2b-256 |
7004186b40b7bd141e184f517b18a0a1547beefd82217fc1bf436ac7d11c6cbe
|