GPU VRAM profiler (mprof-style) for NVIDIA GPUs
Project description
gmprof
gmprof is a small NVIDIA GPU VRAM profiler for Python. It is inspired by
memory_profiler/mprof, but focuses on GPU memory instead of CPU RAM.
It provides:
- an
mprof-style CLI for sampling a subprocess and writing.datfiles - plotting and text reports for sampled
.datfiles vram_overviewas both a decorator and context managervram_profileas both a decorator and context manager- Linux per-process VRAM accounting through NVML
- Windows fallback to total device VRAM when per-process memory is unavailable
Requirements
- Python 3.8+
- NVIDIA GPU and NVIDIA driver
- NVML access through
nvidia-ml-py3
Core install:
pip install gmprof
Install plotting support:
pip install "gmprof[plot]"
Install the CuPy example dependencies for CUDA 12:
pip install "gmprof[examples]"
If you use CUDA 11, install the CuPy wheel that matches your CUDA runtime
instead of cupy-cuda12x.
Important Caveats
-
gmprof runandvram_overvieware sampling-based. If the sampling interval is too long, short-lived GPU allocations can be missed, and repeated runs of the same code may produce slightly different results due to sampling bias.. This caveat does not apply tovram_profile, which samples at Python line events. For more precise VRAM measurements, especially when tracking short-lived allocations or small differences, use a smaller sampling interval; see the results for comparison results. -
vram_profilecan significantly increase runtime because it samples on every executed line in the profiled function/block. Itstimecolumn is best used for comparing lines relative to each other, not as an exact benchmark of unprofiled application speed.
Platform Behavior
On Linux, gmprof uses NVML per-process accounting when available. With
include_children=True, child processes are included in the measurement.
On Windows, NVIDIA tooling often does not expose per-process VRAM and reports
process memory as unavailable. In that case, gmprof automatically reports
total device VRAM and emits a warning the first time it falls back.
Python API
Overview Decorator
from gmprof import vram_overview
@vram_overview(device=0, label="train_step")
def train_step():
...
train_step()
vram_overview reports start, end, peak, delta, peak delta, and elapsed time.
Line-By-Line Decorator
from gmprof import vram_profile
@vram_profile(device=0, label="allocations")
def allocations():
...
allocations()
vram_profile reports each executed line with elapsed line time, used VRAM,
and delta from the previous measured line.
Context Managers
from gmprof import vram_overview, vram_profile
with vram_overview(device=0, label="block"):
...
with vram_profile(device=0, label="line_block"):
...
The decorator and context-manager forms expose the same profiling behavior.
Python Arguments
vram_overview(...)
| Argument | Default | Meaning |
|---|---|---|
device |
0 |
NVIDIA GPU index to inspect. |
interval |
0.01 |
Sampling interval in seconds for peak tracking. Shorter intervals catch shorter peaks but add overhead. |
label |
None |
Optional name shown in the printed report. Defaults to the function name or "overview". |
include_children |
True |
Include child process VRAM when per-process accounting is available. |
pid |
current process | Process ID to measure. Usually left as default. |
vram_profile(...)
| Argument | Default | Meaning |
|---|---|---|
device |
0 |
NVIDIA GPU index to inspect. |
label |
None |
Optional name shown in the printed report. Defaults to the function name or "profile". |
include_children |
True |
Include child process VRAM when per-process accounting is available. |
pid |
current process | Process ID to measure. Usually left as default. |
CLI
Profile a command:
gmprof run -i 0.1 -o profile.dat -- python train.py
Include children is enabled by default. Disable it when needed:
gmprof run --no-children -o profile.dat -- python train.py
Generate a report:
gmprof report profile.dat
Plot sampled VRAM:
gmprof plot profile.dat -o profile.png --no-show
CLI Options
Global options:
| Option | Meaning |
|---|---|
-h, --help |
Show help. |
--version |
Print the package version. |
gmprof run
| Option | Default | Meaning |
|---|---|---|
-o, --out |
gmprofile_TIMESTAMP.dat |
Output .dat file. |
-i, --interval |
0.1 |
Sampling interval in seconds. Lower values catch shorter peaks and add overhead. |
-c, --include-children |
enabled | Include child process VRAM. |
--no-children |
disabled | Exclude child processes. |
-d, --device |
0 |
GPU device index. |
cmd |
required | Command to profile, usually after --. |
gmprof plot
| Option | Default | Meaning |
|---|---|---|
dat_file |
required | Input .dat file from gmprof run. |
-o, --output |
none | Save plot to this path, for example .png or .pdf. |
-t, --title |
generated | Plot title. |
--no-show |
disabled | Save without opening an interactive plot window. |
gmprof report
| Option | Default | Meaning |
|---|---|---|
dat_file |
required | Input .dat file from gmprof run. |
-o, --output |
none | Save report text to this path. |
-f, --format |
text |
Report format. Currently only text. |
.dat Format
gmprof run writes a text file with metadata comments followed by samples:
| Column | Meaning |
|---|---|
timestamp |
Wall-clock sample time. |
vram_mib |
VRAM usage in MiB. |
scope |
process for per-process samples, or device_total when fallback is used. |
Example Output
@vram_overview: start/end/peak for the workload
[gmprof:decorator_overview] device=0 scope=process | start=258.0 MiB | end=266.0 MiB | peak=1.3 GiB | delta=8.0 MiB | peak_delta=1.0 GiB | time=0.468s
@vram_profile: line-by-line usage for the same workload
[gmprof:decorator_profile] device=0 | scope=process | time=1.396s
+----------+---------+------------------------------------------------+-----------+------------+
| lineno | time | code | used | delta |
+==========+=========+================================================+===========+============+
| 22 | 0.0956s | assert cp is not None | 266.0 MiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 23 | 0.0927s | a = cp.ones((8192, 8192), dtype=cp.float32) | 522.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 24 | 0.1097s | cp.cuda.Stream.null.synchronize() | 522.0 MiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 25 | 0.1044s | b = cp.full((8192, 8192), 2, dtype=cp.float32) | 778.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 26 | 0.0900s | cp.cuda.Stream.null.synchronize() | 778.0 MiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 27 | 0.0815s | c = a + b | 1.0 GiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 28 | 0.0851s | cp.cuda.Stream.null.synchronize() | 1.0 GiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 29 | 0.0803s | d = c @ b | 1.3 GiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 30 | 0.0832s | cp.cuda.Stream.null.synchronize() | 1.3 GiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 31 | 0.0757s | del a | 1.3 GiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 32 | 0.0759s | cp.get_default_memory_pool().free_all_blocks() | 1.0 GiB | -256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 33 | 0.0851s | cp.cuda.Stream.null.synchronize() | 1.0 GiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 34 | 0.0907s | del b, c, d | 1.0 GiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
| 35 | 0.0835s | cp.get_default_memory_pool().free_all_blocks() | 266.0 MiB | -768.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
| 36 | 0.0785s | cp.cuda.Stream.null.synchronize() | 266.0 MiB | 0.0 B |
+----------+---------+------------------------------------------------+-----------+------------+
examples/results/gmprof_fast.dat
# gmprof profiling data
# pid: 786600
# include_children: True
# device: 0
# interval: 0.01
# start_time: 2026-06-29 14:42:36.939
timestamp vram_mib scope
2026-06-29 14:42:37.055 0.000 process
2026-06-29 14:42:37.225 0.000 process
2026-06-29 14:42:37.336 0.000 process
2026-06-29 14:42:37.443 0.000 process
2026-06-29 14:42:37.537 0.000 process
2026-06-29 14:42:37.651 0.000 process
2026-06-29 14:42:37.738 0.000 process
2026-06-29 14:42:37.832 0.000 process
2026-06-29 14:42:37.923 0.000 process
2026-06-29 14:42:38.020 0.000 process
2026-06-29 14:42:38.110 0.000 process
2026-06-29 14:42:38.196 0.000 process
2026-06-29 14:42:38.302 0.000 process
2026-06-29 14:42:38.412 0.000 process
2026-06-29 14:42:38.512 0.000 process
2026-06-29 14:42:38.613 0.000 process
2026-06-29 14:42:38.712 0.000 process
2026-06-29 14:42:38.805 0.000 process
2026-06-29 14:42:38.899 0.000 process
2026-06-29 14:42:38.991 0.000 process
2026-06-29 14:42:39.082 0.000 process
2026-06-29 14:42:39.172 0.000 process
2026-06-29 14:42:39.264 0.000 process
2026-06-29 14:42:39.362 0.000 process
2026-06-29 14:42:39.459 0.000 process
2026-06-29 14:42:39.547 0.000 process
2026-06-29 14:42:39.636 0.000 process
2026-06-29 14:42:39.722 0.000 process
2026-06-29 14:42:39.813 0.000 process
2026-06-29 14:42:39.903 0.000 process
2026-06-29 14:42:39.996 0.000 process
2026-06-29 14:42:40.181 0.000 process
2026-06-29 14:42:40.293 256.000 process
2026-06-29 14:42:40.435 1026.000 process
2026-06-29 14:42:40.556 1290.000 process
2026-06-29 14:42:40.681 266.000 process
2026-06-29 14:42:40.806 266.000 process
2026-06-29 14:42:40.930 266.000 process
2026-06-29 14:42:41.032 266.000 process
2026-06-29 14:42:41.133 266.000 process
2026-06-29 14:42:41.246 256.000 process
2026-06-29 14:42:41.406 0.000 process
examples/results/gmprof_fast_report.txt
============================================================
GMPROF REPORT
============================================================
COMMAND INFO
----------------------------------------
PID: 786600
Device: 0
Children: True
Start Time: 2026-06-29 14:42:36.939
Scope: process
SAMPLING INFO
----------------------------------------
Interval: 0.01s
Samples: 42
VRAM USAGE STATISTICS
----------------------------------------
Minimum: 0.0 B
Maximum: 1.3 GiB
Mean: 99.0 MiB
Median: 0.0 B
Std Dev: 260.9 MiB
Total Δ: 1.3 GiB
TIMELINE SUMMARY
----------------------------------------
First: 2026-06-29 14:42:37.055 - 0.000 MiB
Last: 2026-06-29 14:42:41.406 - 0.000 MiB
Peak: 2026-06-29 14:42:40.556 - 1290.000 MiB
============================================================
============================================================
Plot Files
The same code measured with different sampling intervals:
License
MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gmprof-0.1.0.tar.gz.
File metadata
- Download URL: gmprof-0.1.0.tar.gz
- Upload date:
- Size: 137.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21319b9271bdab0126aab7e59e654d0bb0980635b4be60dacc094a4b53a74464
|
|
| MD5 |
c329d9b5dc407d2985569a596b9f1e1d
|
|
| BLAKE2b-256 |
bf996761be29b4a5f59f0b745d32082f5f19504045e725c83c3ac23d4c72be84
|
File details
Details for the file gmprof-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gmprof-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4bf2ba94657c15ac1af7733e131f58ae6994d89beacfa32d017dcc3609ebadc
|
|
| MD5 |
68de10c1341f25588cef7e88cf660a0a
|
|
| BLAKE2b-256 |
8b8b728a145e45f462dbac060b7c51b4d37c5efb738782eba19717b09117d476
|