GPU VRAM profiler (mprof-style) for NVIDIA GPUs

These details have not been verified by PyPI

Project links

Project description

gmprof

gmprof is a small NVIDIA GPU VRAM profiler for Python. It is inspired by memory_profiler/mprof, but focuses on GPU memory instead of CPU RAM.

It provides:

an mprof-style CLI for sampling a subprocess and writing .dat files
plotting and text reports for sampled .dat files
vram_overview as both a decorator and context manager
vram_profile as both a decorator and context manager
Linux per-process VRAM accounting through NVML
Windows fallback to total device VRAM when per-process memory is unavailable

Requirements

Python 3.8+
NVIDIA GPU and NVIDIA driver
NVML access through nvidia-ml-py3

Core install:

pip install gmprof

Install plotting support:

pip install "gmprof[plot]"

Install the CuPy example dependencies for CUDA 12:

pip install "gmprof[examples]"

If you use CUDA 11, install the CuPy wheel that matches your CUDA runtime instead of cupy-cuda12x.

Important Caveats

gmprof run and vram_overview are sampling-based. If the sampling interval is too long, short-lived GPU allocations can be missed, and repeated runs of the same code may produce slightly different results due to sampling bias.. This caveat does not apply to vram_profile, which samples at Python line events. For more precise VRAM measurements, especially when tracking short-lived allocations or small differences, use a smaller sampling interval; see the results for comparison results.
vram_profile can significantly increase runtime because it samples on every executed line in the profiled function/block. Its time column is best used for comparing lines relative to each other, not as an exact benchmark of unprofiled application speed.

Platform Behavior

On Linux, gmprof uses NVML per-process accounting when available. With include_children=True, child processes are included in the measurement.

On Windows, NVIDIA tooling often does not expose per-process VRAM and reports process memory as unavailable. In that case, gmprof automatically reports total device VRAM and emits a warning the first time it falls back.

Python API

Overview Decorator

from gmprof import vram_overview


@vram_overview(device=0, label="train_step")
def train_step():
    ...


train_step()

vram_overview reports start, end, peak, delta, peak delta, and elapsed time.

Line-By-Line Decorator

from gmprof import vram_profile


@vram_profile(device=0, label="allocations")
def allocations():
    ...


allocations()

vram_profile reports each executed line with elapsed line time, used VRAM, and delta from the previous measured line.

Context Managers

from gmprof import vram_overview, vram_profile


with vram_overview(device=0, label="block"):
    ...


with vram_profile(device=0, label="line_block"):
    ...

The decorator and context-manager forms expose the same profiling behavior.

Python Arguments

vram_overview(...)

Argument	Default	Meaning
`device`	`0`	NVIDIA GPU index to inspect.
`interval`	`0.01`	Sampling interval in seconds for peak tracking. Shorter intervals catch shorter peaks but add overhead.
`label`	`None`	Optional name shown in the printed report. Defaults to the function name or `"overview"`.
`include_children`	`True`	Include child process VRAM when per-process accounting is available.
`pid`	current process	Process ID to measure. Usually left as default.

vram_profile(...)

Argument	Default	Meaning
`device`	`0`	NVIDIA GPU index to inspect.
`label`	`None`	Optional name shown in the printed report. Defaults to the function name or `"profile"`.
`include_children`	`True`	Include child process VRAM when per-process accounting is available.
`pid`	current process	Process ID to measure. Usually left as default.

CLI

Profile a command:

gmprof run -i 0.1 -o profile.dat -- python train.py

Include children is enabled by default. Disable it when needed:

gmprof run --no-children -o profile.dat -- python train.py

Generate a report:

gmprof report profile.dat

Plot sampled VRAM:

gmprof plot profile.dat -o profile.png --no-show

CLI Options

Global options:

Option	Meaning
`-h`, `--help`	Show help.
`--version`	Print the package version.

gmprof run

Option	Default	Meaning
`-o`, `--out`	`gmprofile_TIMESTAMP.dat`	Output `.dat` file.
`-i`, `--interval`	`0.1`	Sampling interval in seconds. Lower values catch shorter peaks and add overhead.
`-c`, `--include-children`	enabled	Include child process VRAM.
`--no-children`	disabled	Exclude child processes.
`-d`, `--device`	`0`	GPU device index.
`cmd`	required	Command to profile, usually after `--`.

gmprof plot

Option	Default	Meaning
`dat_file`	required	Input `.dat` file from `gmprof run`.
`-o`, `--output`	none	Save plot to this path, for example `.png` or `.pdf`.
`-t`, `--title`	generated	Plot title.
`--no-show`	disabled	Save without opening an interactive plot window.

gmprof report

Option	Default	Meaning
`dat_file`	required	Input `.dat` file from `gmprof run`.
`-o`, `--output`	none	Save report text to this path.
`-f`, `--format`	`text`	Report format. Currently only `text`.

`.dat` Format

gmprof run writes a text file with metadata comments followed by samples:

Column	Meaning
`timestamp`	Wall-clock sample time.
`vram_mib`	VRAM usage in MiB.
`scope`	`process` for per-process samples, or `device_total` when fallback is used.

Example Output

@vram_overview: start/end/peak for the workload
[gmprof:decorator_overview] device=0 scope=process | start=258.0 MiB | end=266.0 MiB | peak=1.3 GiB | delta=8.0 MiB | peak_delta=1.0 GiB | time=0.468s

@vram_profile: line-by-line usage for the same workload
[gmprof:decorator_profile] device=0 | scope=process | time=1.396s
+----------+---------+------------------------------------------------+-----------+------------+
|   lineno | time    | code                                           | used      | delta      |
+==========+=========+================================================+===========+============+
|       22 | 0.0956s | assert cp is not None                          | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       23 | 0.0927s | a = cp.ones((8192, 8192), dtype=cp.float32)    | 522.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       24 | 0.1097s | cp.cuda.Stream.null.synchronize()              | 522.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       25 | 0.1044s | b = cp.full((8192, 8192), 2, dtype=cp.float32) | 778.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       26 | 0.0900s | cp.cuda.Stream.null.synchronize()              | 778.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       27 | 0.0815s | c = a + b                                      | 1.0 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       28 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       29 | 0.0803s | d = c @ b                                      | 1.3 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       30 | 0.0832s | cp.cuda.Stream.null.synchronize()              | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       31 | 0.0757s | del a                                          | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       32 | 0.0759s | cp.get_default_memory_pool().free_all_blocks() | 1.0 GiB   | -256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       33 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       34 | 0.0907s | del b, c, d                                    | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       35 | 0.0835s | cp.get_default_memory_pool().free_all_blocks() | 266.0 MiB | -768.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       36 | 0.0785s | cp.cuda.Stream.null.synchronize()              | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+

`examples/results/gmprof_fast.dat`

# gmprof profiling data
# pid: 786600
# include_children: True
# device: 0
# interval: 0.01
# start_time: 2026-06-29 14:42:36.939
timestamp vram_mib scope
2026-06-29 14:42:37.055 0.000 process
2026-06-29 14:42:37.225 0.000 process
2026-06-29 14:42:37.336 0.000 process
2026-06-29 14:42:37.443 0.000 process
2026-06-29 14:42:37.537 0.000 process
2026-06-29 14:42:37.651 0.000 process
2026-06-29 14:42:37.738 0.000 process
2026-06-29 14:42:37.832 0.000 process
2026-06-29 14:42:37.923 0.000 process
2026-06-29 14:42:38.020 0.000 process
2026-06-29 14:42:38.110 0.000 process
2026-06-29 14:42:38.196 0.000 process
2026-06-29 14:42:38.302 0.000 process
2026-06-29 14:42:38.412 0.000 process
2026-06-29 14:42:38.512 0.000 process
2026-06-29 14:42:38.613 0.000 process
2026-06-29 14:42:38.712 0.000 process
2026-06-29 14:42:38.805 0.000 process
2026-06-29 14:42:38.899 0.000 process
2026-06-29 14:42:38.991 0.000 process
2026-06-29 14:42:39.082 0.000 process
2026-06-29 14:42:39.172 0.000 process
2026-06-29 14:42:39.264 0.000 process
2026-06-29 14:42:39.362 0.000 process
2026-06-29 14:42:39.459 0.000 process
2026-06-29 14:42:39.547 0.000 process
2026-06-29 14:42:39.636 0.000 process
2026-06-29 14:42:39.722 0.000 process
2026-06-29 14:42:39.813 0.000 process
2026-06-29 14:42:39.903 0.000 process
2026-06-29 14:42:39.996 0.000 process
2026-06-29 14:42:40.181 0.000 process
2026-06-29 14:42:40.293 256.000 process
2026-06-29 14:42:40.435 1026.000 process
2026-06-29 14:42:40.556 1290.000 process
2026-06-29 14:42:40.681 266.000 process
2026-06-29 14:42:40.806 266.000 process
2026-06-29 14:42:40.930 266.000 process
2026-06-29 14:42:41.032 266.000 process
2026-06-29 14:42:41.133 266.000 process
2026-06-29 14:42:41.246 256.000 process
2026-06-29 14:42:41.406 0.000 process

`examples/results/gmprof_fast_report.txt`

============================================================
GMPROF REPORT
============================================================

COMMAND INFO
----------------------------------------
PID:         786600
Device:      0
Children:    True
Start Time:  2026-06-29 14:42:36.939
Scope:       process

SAMPLING INFO
----------------------------------------
Interval:    0.01s
Samples:     42

VRAM USAGE STATISTICS
----------------------------------------
Minimum:     0.0 B
Maximum:     1.3 GiB
Mean:        99.0 MiB
Median:      0.0 B
Std Dev:     260.9 MiB
Total Δ:     1.3 GiB

TIMELINE SUMMARY
----------------------------------------
First:       2026-06-29 14:42:37.055 - 0.000 MiB
Last:        2026-06-29 14:42:41.406 - 0.000 MiB
Peak:        2026-06-29 14:42:40.556 - 1290.000 MiB

============================================================
============================================================

Plot Files

The same code measured with different sampling intervals:

gmprof CLI plot

gmprof fast CLI plot

License

MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gmprof-0.1.0.tar.gz (137.9 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gmprof-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file gmprof-0.1.0.tar.gz.

File metadata

Download URL: gmprof-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 137.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gmprof-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`21319b9271bdab0126aab7e59e654d0bb0980635b4be60dacc094a4b53a74464`
MD5	`c329d9b5dc407d2985569a596b9f1e1d`
BLAKE2b-256	`bf996761be29b4a5f59f0b745d32082f5f19504045e725c83c3ac23d4c72be84`

See more details on using hashes here.

File details

Details for the file gmprof-0.1.0-py3-none-any.whl.

File metadata

Download URL: gmprof-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gmprof-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4bf2ba94657c15ac1af7733e131f58ae6994d89beacfa32d017dcc3609ebadc`
MD5	`68de10c1341f25588cef7e88cf660a0a`
BLAKE2b-256	`8b8b728a145e45f462dbac060b7c51b4d37c5efb738782eba19717b09117d476`

See more details on using hashes here.

gmprof 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

gmprof

Requirements

Important Caveats

Platform Behavior

Python API

Overview Decorator

Line-By-Line Decorator

Context Managers

Python Arguments

CLI

CLI Options

.dat Format

Example Output

examples/results/gmprof_fast.dat

examples/results/gmprof_fast_report.txt

Plot Files

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`.dat` Format

`examples/results/gmprof_fast.dat`

`examples/results/gmprof_fast_report.txt`