Skip to main content

GPU cluster tail latency optimizer for DDP training and inference

Project description

Cooling Cube

GPU cluster tail latency optimizer for DDP training and multi-GPU inference.

Reduces per-step tail latency by identifying pressure workers and computing start-time offsets. Works on 8–64 GPU clusters. Zero negative gains guaranteed.

Install

pip install coolingcube

CLI usage

# From a timing log file
coolingcube --logs timing.json

# Inline worker times (microseconds)
coolingcube --workers '{"0": 12000, "1": 11800, "2": 15500, "3": 12500}'

# JSON output
coolingcube --logs timing.json --json

Python usage

from coolingcube import optimize

result = optimize({
    "0": 12000,
    "1": 11800,
    "2": 15500,
    "3": 12500,
})

print(f"Gain: {result['gain_pct']:.3f}% ({result['gain_us']:.1f} µs)")
print(f"Schedule: {result['best_schedule']}")

Collecting logs from PyTorch DDP

import time
import torch
import torch.distributed as dist

timing_logs = {}

for step in range(num_steps):
    t0 = time.perf_counter()
    loss = model(batch)
    loss.backward()
    optimizer.step()
    dist.barrier()
    elapsed_us = (time.perf_counter() - t0) * 1e6

    rank = dist.get_rank()
    timing_logs[str(rank)] = elapsed_us  # average over steps in practice

# After training loop, on rank 0:
if dist.get_rank() == 0:
    from coolingcube import optimize
    result = optimize(timing_logs)
    print(f"Gain: {result['gain_pct']:.3f}%")

Log file formats

Cooling Cube accepts several formats automatically:

{"0": 12345, "1": 11800, "2": 13000}
[{"rank": 0, "total_iter_time": 0.186}, {"rank": 1, "total_iter_time": 0.212}]
{"workers": [{"rank": 0, "step_time": 0.172}, ...]}

How it works

Standard DDP holds all workers at the barrier until the slowest finishes. The bottleneck is usually not the straggler itself but the pressure workers pushing it — workers with slightly elevated times that create synchronization pressure.

Cooling Cube identifies those pressure workers and computes per-worker start-time offsets to reduce tail latency. The algorithm uses a Ridge surrogate model and converges in 40–80 oracle calls regardless of cluster size.

Typical gains: 0.2–0.9% step-time reduction on heterogeneous or PCIe-bound clusters.

Free

Free for open source and research use. No account required.

https://coolingcube.cc · CoolingCubeInfo@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coolingcube-0.1.0.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coolingcube-0.1.0-py3-none-any.whl (2.7 kB view details)

Uploaded Python 3

File details

Details for the file coolingcube-0.1.0.tar.gz.

File metadata

  • Download URL: coolingcube-0.1.0.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for coolingcube-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a8711ded3f319864164d0ab14bd1b9f4cee24c2391dafc13406400b0c8af222
MD5 57780aad91e77caa60a6268d52b4f28d
BLAKE2b-256 db3985c0a51d220228797342510751bf2855b6089b86c538e8f3595e3ffc1703

See more details on using hashes here.

File details

Details for the file coolingcube-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: coolingcube-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for coolingcube-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59f24667db8fbcba48c211f1d17039c536c0a32319d5b411b244dadd849406f5
MD5 972682d37fc51243148a840b93a03bc2
BLAKE2b-256 ead02760d46048063b468ca34e7b94e71500ec72597a41d26c43a11791788aeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page