Skip to main content

Keep GPU is a simple CLI app that keeps your GPUs running

Project description

Keep GPU

PyPI Version Docs Status DOI Ask DeepWiki

Keep GPU keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job.

Why it exists

On many clusters, idle GPUs are reaped or silently shared after a short grace period. The cost of losing your reservation (or discovering another job has taken your card) can dwarf the cost of a tiny keep-alive loop. KeepGPU is a minimal, auditable guardrail:

  • Predictable – Single-purpose controller with explicit resource knobs (VRAM size, interval, utilization backoff).
  • Polite – Uses NVML to read utilization and backs off when the GPU is busy.
  • Portable – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
  • Observable – Structured logging and optional file logs for auditing what kept the GPU alive.
  • Power-aware – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see CudaGPUController._run_mat_batch for the loop).
  • NVML-backed – GPU telemetry comes from nvidia-ml-py (the pynvml module), with optional rocm-smi support when you install the rocm extra.

Quick start (CLI)

pip install keep-gpu

# Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60

Platform installs at a glance

  • CUDA (example: cu121)
    pip install --index-url https://download.pytorch.org/whl/cu121 torch
    pip install keep-gpu
    
  • ROCm (example: rocm6.1)
    pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch
    pip install keep-gpu[rocm]
    
  • CPU-only
    pip install torch
    pip install keep-gpu
    

Flags that matter:

  • --vram (1GiB, 750MB, or bytes): how much memory to pin.
  • --interval (seconds): sleep between keep-alive bursts.
  • --busy-threshold: skip work when NVML reports higher utilization.
  • --gpu-ids: target a subset; otherwise all visible GPUs are guarded.

Embed in Python

from keep_gpu.single_gpu_controller.cuda_gpu_controller import CudaGPUController

with CudaGPUController(rank=0, interval=0.5, vram_to_keep="1GiB", busy_threshold=20):
    preprocess_dataset()   # GPU is marked busy while you run CPU-heavy code

train_model()              # GPU freed after exiting the context

Need multiple devices?

from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController

with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy_threshold=30):
    run_pipeline_stage()

What you get

  • Battle-tested keep-alive loop built on PyTorch.
  • NVML-based utilization monitoring (by way of nvidia-ml-py) to avoid hogging busy GPUs; optional ROCm SMI support by way of pip install keep-gpu[rocm].
  • CLI + API parity: same controllers power both code paths.
  • Continuous docs + CI: mkdocs + mkdocstrings build in CI to keep guidance up to date.

For developers

  • Install dev extras: pip install -e ".[dev]" (add .[rocm] if you need ROCm SMI).
  • Fast CUDA checks: pytest tests/cuda_controller tests/global_controller tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py
  • ROCm-only tests carry @pytest.mark.rocm; run with pytest --run-rocm tests/rocm_controller.
  • Markers: rocm (needs ROCm stack) and large_memory (opt-in locally).

Contributing

Contributions are welcome—especially around ROCm support, platform fallbacks, and scheduler-specific recipes. Open an issue or PR if you hit edge cases on your cluster.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Contributors

📖 Citation

If you find KeepGPU useful in your research or work, please cite it as:

@software{Wangmerlyn_KeepGPU_2025,
  author       = {Wang, Siyuan and Shi, Yaorui and Liu, Yida and Yin, Yuqi},
  title        = {KeepGPU: a simple CLI app that keeps your GPUs running},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17129114},
  url          = {https://github.com/Wangmerlyn/KeepGPU},
  note         = {GitHub repository},
  keywords     = {ai, hpc, gpu, cluster, cuda, torch, debug}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keep_gpu-0.4.1.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

keep_gpu-0.4.1-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file keep_gpu-0.4.1.tar.gz.

File metadata

  • Download URL: keep_gpu-0.4.1.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for keep_gpu-0.4.1.tar.gz
Algorithm Hash digest
SHA256 b201b443d4a4fe23b13fb34182aa3034d9222cfab1501b61862e7fd85d8647c8
MD5 71d2d1a312da38b81b57e9f6ef152695
BLAKE2b-256 fe9aaeb8a182655e38a401cdda4c151d87bb9925b53cfa5d9a2cfa73e11af954

See more details on using hashes here.

File details

Details for the file keep_gpu-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: keep_gpu-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for keep_gpu-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4efedc95bf09151c5e8934d5537a36409229525a6fd052bbfa2bd98d37a144ee
MD5 836a23b0791737a2f8dac640b63a47af
BLAKE2b-256 28f1bef4f129ea43f809334ff08646afb6f5146fb9aa84f144e3c3761a1dddbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page