Skip to main content

GPU cluster tail latency optimizer for DDP training and inference

Project description

Cooling Cube

GPU cluster tail latency optimizer for DDP training and multi-GPU inference.

Reduces per-step tail latency by identifying pressure workers and computing start-time offsets. Works on 8–64 GPU clusters. Zero negative gains guaranteed.

Install

pip install coolingcube

CLI usage

# From a timing log file
coolingcube --logs timing.json

# Inline worker times (microseconds)
coolingcube --workers '{"0": 12000, "1": 11800, "2": 15500, "3": 12500}'

# JSON output
coolingcube --logs timing.json --json

Python usage

from coolingcube import optimize

result = optimize({
    "0": 12000,
    "1": 11800,
    "2": 15500,
    "3": 12500,
})

print(f"Gain: {result['gain_pct']:.3f}% ({result['gain_us']:.1f} µs)")
print(f"Schedule: {result['best_schedule']}")

Collecting logs from PyTorch DDP

import time
import torch
import torch.distributed as dist

timing_logs = {}

for step in range(num_steps):
    t0 = time.perf_counter()
    loss = model(batch)
    loss.backward()
    optimizer.step()
    dist.barrier()
    elapsed_us = (time.perf_counter() - t0) * 1e6

    rank = dist.get_rank()
    timing_logs[str(rank)] = elapsed_us  # average over steps in practice

# After training loop, on rank 0:
if dist.get_rank() == 0:
    from coolingcube import optimize
    result = optimize(timing_logs)
    print(f"Gain: {result['gain_pct']:.3f}%")

Log file formats

Cooling Cube accepts several formats automatically:

{"0": 12345, "1": 11800, "2": 13000}
[{"rank": 0, "total_iter_time": 0.186}, {"rank": 1, "total_iter_time": 0.212}]
{"workers": [{"rank": 0, "step_time": 0.172}, ...]}

How it works

Standard DDP holds all workers at the barrier until the slowest finishes. The bottleneck is usually not the straggler itself but the pressure workers pushing it — workers with slightly elevated times that create synchronization pressure.

Cooling Cube identifies those pressure workers and computes per-worker start-time offsets to reduce tail latency. The algorithm uses a Ridge surrogate model and converges in 40–80 oracle calls regardless of cluster size.

Typical gains: 0.2–0.9% step-time reduction on heterogeneous or PCIe-bound clusters.

Free

Free for open source and research use. No account required.

https://coolingcube.cc · CoolingCubeInfo@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coolingcube-0.1.1.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coolingcube-0.1.1-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file coolingcube-0.1.1.tar.gz.

File metadata

  • Download URL: coolingcube-0.1.1.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for coolingcube-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e2432631f40436dd7401c7286024696751e984d53dabd7b91726a9cd4dca063c
MD5 85353fbe6b302a61b0a524d8991aa5c2
BLAKE2b-256 9042bc6899eff803af6ef00cef172d2fa6b433e5ace4dbe2ba70b8501f2a9466

See more details on using hashes here.

File details

Details for the file coolingcube-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: coolingcube-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for coolingcube-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7f0d34f296404dba6800646f615f05a3cea61b6a06a231cc285ded22bf8c4092
MD5 471cc2f6574de20008767e31e3ff7a5f
BLAKE2b-256 764714c5dc27b1a8480127bafb1c5050910dc2c903eab1a19501cb763b311ec3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page