GPU cluster tail latency optimizer for DDP training and inference
Project description
Cooling Cube
GPU cluster tail latency optimizer for DDP training and multi-GPU inference.
Reduces per-step tail latency by identifying pressure workers and computing start-time offsets. Works on 8–64 GPU clusters. Zero negative gains guaranteed.
Install
pip install coolingcube
CLI usage
# From a timing log file
coolingcube --logs timing.json
# Inline worker times (microseconds)
coolingcube --workers '{"0": 12000, "1": 11800, "2": 15500, "3": 12500}'
# JSON output
coolingcube --logs timing.json --json
Python usage
from coolingcube import optimize
result = optimize({
"0": 12000,
"1": 11800,
"2": 15500,
"3": 12500,
})
print(f"Gain: {result['gain_pct']:.3f}% ({result['gain_us']:.1f} µs)")
print(f"Schedule: {result['best_schedule']}")
Collecting logs from PyTorch DDP
import time
import torch
import torch.distributed as dist
timing_logs = {}
for step in range(num_steps):
t0 = time.perf_counter()
loss = model(batch)
loss.backward()
optimizer.step()
dist.barrier()
elapsed_us = (time.perf_counter() - t0) * 1e6
rank = dist.get_rank()
timing_logs[str(rank)] = elapsed_us # average over steps in practice
# After training loop, on rank 0:
if dist.get_rank() == 0:
from coolingcube import optimize
result = optimize(timing_logs)
print(f"Gain: {result['gain_pct']:.3f}%")
Log file formats
Cooling Cube accepts several formats automatically:
{"0": 12345, "1": 11800, "2": 13000}
[{"rank": 0, "total_iter_time": 0.186}, {"rank": 1, "total_iter_time": 0.212}]
{"workers": [{"rank": 0, "step_time": 0.172}, ...]}
How it works
Standard DDP holds all workers at the barrier until the slowest finishes. The bottleneck is usually not the straggler itself but the pressure workers pushing it — workers with slightly elevated times that create synchronization pressure.
Cooling Cube identifies those pressure workers and computes per-worker start-time offsets to reduce tail latency. The algorithm uses a Ridge surrogate model and converges in 40–80 oracle calls regardless of cluster size.
Typical gains: 0.2–0.9% step-time reduction on heterogeneous or PCIe-bound clusters.
Free
Free for open source and research use. No account required.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coolingcube-0.1.1.tar.gz.
File metadata
- Download URL: coolingcube-0.1.1.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2432631f40436dd7401c7286024696751e984d53dabd7b91726a9cd4dca063c
|
|
| MD5 |
85353fbe6b302a61b0a524d8991aa5c2
|
|
| BLAKE2b-256 |
9042bc6899eff803af6ef00cef172d2fa6b433e5ace4dbe2ba70b8501f2a9466
|
File details
Details for the file coolingcube-0.1.1-py3-none-any.whl.
File metadata
- Download URL: coolingcube-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f0d34f296404dba6800646f615f05a3cea61b6a06a231cc285ded22bf8c4092
|
|
| MD5 |
471cc2f6574de20008767e31e3ff7a5f
|
|
| BLAKE2b-256 |
764714c5dc27b1a8480127bafb1c5050910dc2c903eab1a19501cb763b311ec3
|