Keep GPU is a simple CLI app that keeps your GPUs running
Project description
Keep GPU
Keep GPU keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job.
- 🧾 License: MIT
- 📚 Docs: https://keepgpu.readthedocs.io
Why it exists
On many clusters, idle GPUs are reaped or silently shared after a short grace period. The cost of losing your reservation (or discovering another job has taken your card) can dwarf the cost of a tiny keep-alive loop. KeepGPU is a minimal, auditable guardrail:
- Predictable – Single-purpose controller with explicit resource knobs (VRAM size, interval, utilization backoff).
- Polite – Uses NVML to read utilization and backs off when the GPU is busy.
- Portable – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
- Observable – Structured logging and optional file logs for auditing what kept the GPU alive.
- Power-aware – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see
CudaGPUController._run_relu_batchfor the loop). - NVML-backed – GPU telemetry comes from
nvidia-ml-py(thepynvmlmodule), with optionalrocm-smisupport when you install therocmextra.
Quick start (CLI)
pip install keep-gpu
# Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
# Non-blocking mode for agent workflows (auto-starts local service)
keep-gpu start --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60
keep-gpu status
keep-gpu stop --all
keep-gpu service-stop
Open the dashboard while service mode is running:
http://127.0.0.1:8765/
Platform installs at a glance
- CUDA (example: cu121)
pip install --index-url https://download.pytorch.org/whl/cu121 torch pip install keep-gpu
- ROCm (example: rocm6.1)
pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch pip install keep-gpu[rocm]
- CPU-only
pip install torch pip install keep-gpu
Flags that matter:
- Blocking mode knobs:
--vram(1GiB,750MB, or bytes): how much memory to pin.--interval(seconds): sleep between keep-alive bursts.--busy-threshold: skip work when NVML reports higher utilization.--gpu-ids: target a subset; otherwise all visible GPUs are guarded.
- Service mode commands:
keep-gpu serve: run local service (HTTP + dashboard).keep-gpu start: create keep session and return immediately.keep-gpu status: inspect active sessions.keep-gpu stop --job-id <id>orkeep-gpu stop --all: release sessions.keep-gpu service-stop: stop auto-started local daemon.keep-gpu list-gpus: fetch telemetry from local service.
Embed in Python
from keep_gpu.single_gpu_controller.cuda_gpu_controller import CudaGPUController
with CudaGPUController(rank=0, interval=0.5, vram_to_keep="1GiB", busy_threshold=20):
preprocess_dataset() # GPU is marked busy while you run CPU-heavy code
train_model() # GPU freed after exiting the context
Need multiple devices?
from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController
with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy_threshold=30):
run_pipeline_stage()
What you get
- Battle-tested keep-alive loop built on PyTorch.
- NVML-based utilization monitoring (by way of
nvidia-ml-py) to avoid hogging busy GPUs; optional ROCm SMI support by way ofpip install keep-gpu[rocm]. - CLI + API parity: same controllers power both code paths.
- Continuous docs + CI: mkdocs + mkdocstrings build in CI to keep guidance up to date.
For developers
- Install dev extras:
pip install -e ".[dev]"(add.[rocm]if you need ROCm SMI). - Fast CUDA checks:
pytest tests/cuda_controller tests/global_controller tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py - ROCm-only tests carry
@pytest.mark.rocm; run withpytest --run-rocm tests/rocm_controller. - Markers:
rocm(needs ROCm stack) andlarge_memory(opt-in locally).
MCP and service API
- Start a simple JSON-RPC server on stdin/stdout (default):
keep-gpu-mcp-server
- Or expose it over HTTP (JSON-RPC + REST + dashboard):
keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765
- JSON-RPC request example:
{"id": 1, "method": "start_keep", "params": {"gpu_ids": [0], "vram": "512MB", "interval": 60, "busy_threshold": 20}}
- REST examples:
curl http://127.0.0.1:8765/health curl http://127.0.0.1:8765/api/sessions
- Methods:
start_keep,stop_keep(optionaljob_id, default stops all),status(optionaljob_id),list_gpus(basic info). - Dashboard:
http://127.0.0.1:8765/ - Minimal client config (stdio MCP):
servers: keepgpu: command: ["keep-gpu-mcp-server"] adapter: stdio
- Minimal client config (HTTP MCP):
servers: keepgpu: url: http://127.0.0.1:8765/ adapter: http
- Remote/SSH tunnel example (HTTP):
keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765
Client config (replace hostname/tunnel as needed):servers: keepgpu: url: http://gpu-box.example.com:8765/ adapter: http
For untrusted networks, put the server behind your own auth/reverse-proxy or tunnel by way of SSH (for example,ssh -L 8765:localhost:8765 gpu-box).
Contributing
Contributions are welcome—especially around ROCm support, platform fallbacks, and scheduler-specific recipes. Open an issue or PR if you hit edge cases on your cluster. See docs/contributing.md for dev setup, test commands, and PR tips.
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Contributors
📖 Citation
If you find KeepGPU useful in your research or work, please cite it as:
@software{Wangmerlyn_KeepGPU_2025,
author = {Wang, Siyuan and Shi, Yaorui and Liu, Yida and Yin, Yuqi},
title = {KeepGPU: a simple CLI app that keeps your GPUs running},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17129114},
url = {https://github.com/Wangmerlyn/KeepGPU},
note = {GitHub repository},
keywords = {ai, hpc, gpu, cluster, cuda, torch, debug}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keep_gpu-0.5.1.tar.gz.
File metadata
- Download URL: keep_gpu-0.5.1.tar.gz
- Upload date:
- Size: 90.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91499d07923abcd2583ef3d05125d66bfc5782e7652d0a366e9cdaab4c920384
|
|
| MD5 |
a4826b5856ae2a9c375b30c8bfe358a5
|
|
| BLAKE2b-256 |
90d2b7a502f184243f5dbfef3a519992004174141b9c1eefaa5c253889883a30
|
File details
Details for the file keep_gpu-0.5.1-py3-none-any.whl.
File metadata
- Download URL: keep_gpu-0.5.1-py3-none-any.whl
- Upload date:
- Size: 83.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a69430353a5d19eb762f06bfd9bbdf2678165565a0340e2c22b794ea06cb9fd0
|
|
| MD5 |
ebaed80d56435601ca4abfa6aa749deb
|
|
| BLAKE2b-256 |
595df9f64415e056eb47be54e30764e05e8e6a9b8d887651ef4443ef4cf466f2
|