Skip to main content

GPU VRAM profiler (mprof-style) for NVIDIA GPUs

Project description

gmprof

gmprof is a small NVIDIA GPU VRAM profiler for Python. It is inspired by memory_profiler/mprof, but focuses on GPU memory instead of CPU RAM.

It provides:

  • an mprof-style CLI for sampling a subprocess and writing .dat files
  • plotting and text reports for sampled .dat files
  • vram_overview as both a decorator and context manager
  • vram_profile as both a decorator and context manager
  • Linux per-process VRAM accounting through NVML
  • Windows fallback to total device VRAM when per-process memory is unavailable

Requirements

  • Python 3.8+
  • NVIDIA GPU and NVIDIA driver
  • NVML access through nvidia-ml-py3

Core install:

pip install gmprof

Install plotting support:

pip install "gmprof[plot]"

Install the CuPy example dependencies for CUDA 12:

pip install "gmprof[examples]"

If you use CUDA 11, install the CuPy wheel that matches your CUDA runtime instead of cupy-cuda12x.

Important Caveats

  • gmprof run and vram_overview are sampling-based. If the sampling interval is too long, short-lived GPU allocations can be missed, and repeated runs of the same code may produce slightly different results due to sampling bias.. This caveat does not apply to vram_profile, which samples at Python line events. For more precise VRAM measurements, especially when tracking short-lived allocations or small differences, use a smaller sampling interval; see the results for comparison results.

  • vram_profile can significantly increase runtime because it samples on every executed line in the profiled function/block. Its time column is best used for comparing lines relative to each other, not as an exact benchmark of unprofiled application speed.

Platform Behavior

On Linux, gmprof uses NVML per-process accounting when available. With include_children=True, child processes are included in the measurement.

On Windows, NVIDIA tooling often does not expose per-process VRAM and reports process memory as unavailable. In that case, gmprof automatically reports total device VRAM and emits a warning the first time it falls back.

Python API

Overview Decorator

from gmprof import vram_overview


@vram_overview(device=0, label="train_step")
def train_step():
    ...


train_step()

vram_overview reports start, end, peak, delta, peak delta, and elapsed time.

Line-By-Line Decorator

from gmprof import vram_profile


@vram_profile(device=0, label="allocations")
def allocations():
    ...


allocations()

vram_profile reports each executed line with elapsed line time, used VRAM, and delta from the previous measured line.

Context Managers

from gmprof import vram_overview, vram_profile


with vram_overview(device=0, label="block"):
    ...


with vram_profile(device=0, label="line_block"):
    ...

The decorator and context-manager forms expose the same profiling behavior.

Python Arguments

vram_overview(...)

Argument Default Meaning
device 0 NVIDIA GPU index to inspect.
interval 0.01 Sampling interval in seconds for peak tracking. Shorter intervals catch shorter peaks but add overhead.
label None Optional name shown in the printed report. Defaults to the function name or "overview".
include_children True Include child process VRAM when per-process accounting is available.
pid current process Process ID to measure. Usually left as default.

vram_profile(...)

Argument Default Meaning
device 0 NVIDIA GPU index to inspect.
label None Optional name shown in the printed report. Defaults to the function name or "profile".
include_children True Include child process VRAM when per-process accounting is available.
pid current process Process ID to measure. Usually left as default.

CLI

Profile a command:

gmprof run -i 0.1 -o profile.dat -- python train.py

Include children is enabled by default. Disable it when needed:

gmprof run --no-children -o profile.dat -- python train.py

Generate a report:

gmprof report profile.dat

Plot sampled VRAM:

gmprof plot profile.dat -o profile.png --no-show

CLI Options

Global options:

Option Meaning
-h, --help Show help.
--version Print the package version.

gmprof run

Option Default Meaning
-o, --out gmprofile_TIMESTAMP.dat Output .dat file.
-i, --interval 0.1 Sampling interval in seconds. Lower values catch shorter peaks and add overhead.
-c, --include-children enabled Include child process VRAM.
--no-children disabled Exclude child processes.
-d, --device 0 GPU device index.
cmd required Command to profile, usually after --.

gmprof plot

Option Default Meaning
dat_file required Input .dat file from gmprof run.
-o, --output none Save plot to this path, for example .png or .pdf.
-t, --title generated Plot title.
--no-show disabled Save without opening an interactive plot window.

gmprof report

Option Default Meaning
dat_file required Input .dat file from gmprof run.
-o, --output none Save report text to this path.
-f, --format text Report format. Currently only text.

.dat Format

gmprof run writes a text file with metadata comments followed by samples:

Column Meaning
timestamp Wall-clock sample time.
vram_mib VRAM usage in MiB.
scope process for per-process samples, or device_total when fallback is used.

Example Output

@vram_overview: start/end/peak for the workload
[gmprof:decorator_overview] device=0 scope=process | start=258.0 MiB | end=266.0 MiB | peak=1.3 GiB | delta=8.0 MiB | peak_delta=1.0 GiB | time=0.468s
@vram_profile: line-by-line usage for the same workload
[gmprof:decorator_profile] device=0 | scope=process | time=1.396s
+----------+---------+------------------------------------------------+-----------+------------+
|   lineno | time    | code                                           | used      | delta      |
+==========+=========+================================================+===========+============+
|       22 | 0.0956s | assert cp is not None                          | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       23 | 0.0927s | a = cp.ones((8192, 8192), dtype=cp.float32)    | 522.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       24 | 0.1097s | cp.cuda.Stream.null.synchronize()              | 522.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       25 | 0.1044s | b = cp.full((8192, 8192), 2, dtype=cp.float32) | 778.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       26 | 0.0900s | cp.cuda.Stream.null.synchronize()              | 778.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       27 | 0.0815s | c = a + b                                      | 1.0 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       28 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       29 | 0.0803s | d = c @ b                                      | 1.3 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       30 | 0.0832s | cp.cuda.Stream.null.synchronize()              | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       31 | 0.0757s | del a                                          | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       32 | 0.0759s | cp.get_default_memory_pool().free_all_blocks() | 1.0 GiB   | -256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       33 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       34 | 0.0907s | del b, c, d                                    | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       35 | 0.0835s | cp.get_default_memory_pool().free_all_blocks() | 266.0 MiB | -768.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       36 | 0.0785s | cp.cuda.Stream.null.synchronize()              | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+

examples/results/gmprof_fast.dat

# gmprof profiling data
# pid: 786600
# include_children: True
# device: 0
# interval: 0.01
# start_time: 2026-06-29 14:42:36.939
timestamp vram_mib scope
2026-06-29 14:42:37.055 0.000 process
2026-06-29 14:42:37.225 0.000 process
2026-06-29 14:42:37.336 0.000 process
2026-06-29 14:42:37.443 0.000 process
2026-06-29 14:42:37.537 0.000 process
2026-06-29 14:42:37.651 0.000 process
2026-06-29 14:42:37.738 0.000 process
2026-06-29 14:42:37.832 0.000 process
2026-06-29 14:42:37.923 0.000 process
2026-06-29 14:42:38.020 0.000 process
2026-06-29 14:42:38.110 0.000 process
2026-06-29 14:42:38.196 0.000 process
2026-06-29 14:42:38.302 0.000 process
2026-06-29 14:42:38.412 0.000 process
2026-06-29 14:42:38.512 0.000 process
2026-06-29 14:42:38.613 0.000 process
2026-06-29 14:42:38.712 0.000 process
2026-06-29 14:42:38.805 0.000 process
2026-06-29 14:42:38.899 0.000 process
2026-06-29 14:42:38.991 0.000 process
2026-06-29 14:42:39.082 0.000 process
2026-06-29 14:42:39.172 0.000 process
2026-06-29 14:42:39.264 0.000 process
2026-06-29 14:42:39.362 0.000 process
2026-06-29 14:42:39.459 0.000 process
2026-06-29 14:42:39.547 0.000 process
2026-06-29 14:42:39.636 0.000 process
2026-06-29 14:42:39.722 0.000 process
2026-06-29 14:42:39.813 0.000 process
2026-06-29 14:42:39.903 0.000 process
2026-06-29 14:42:39.996 0.000 process
2026-06-29 14:42:40.181 0.000 process
2026-06-29 14:42:40.293 256.000 process
2026-06-29 14:42:40.435 1026.000 process
2026-06-29 14:42:40.556 1290.000 process
2026-06-29 14:42:40.681 266.000 process
2026-06-29 14:42:40.806 266.000 process
2026-06-29 14:42:40.930 266.000 process
2026-06-29 14:42:41.032 266.000 process
2026-06-29 14:42:41.133 266.000 process
2026-06-29 14:42:41.246 256.000 process
2026-06-29 14:42:41.406 0.000 process

examples/results/gmprof_fast_report.txt

============================================================
GMPROF REPORT
============================================================

COMMAND INFO
----------------------------------------
PID:         786600
Device:      0
Children:    True
Start Time:  2026-06-29 14:42:36.939
Scope:       process

SAMPLING INFO
----------------------------------------
Interval:    0.01s
Samples:     42

VRAM USAGE STATISTICS
----------------------------------------
Minimum:     0.0 B
Maximum:     1.3 GiB
Mean:        99.0 MiB
Median:      0.0 B
Std Dev:     260.9 MiB
Total Δ:     1.3 GiB

TIMELINE SUMMARY
----------------------------------------
First:       2026-06-29 14:42:37.055 - 0.000 MiB
Last:        2026-06-29 14:42:41.406 - 0.000 MiB
Peak:        2026-06-29 14:42:40.556 - 1290.000 MiB

============================================================
============================================================

Plot Files

The same code measured with different sampling intervals:

gmprof CLI plot

gmprof fast CLI plot

License

MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gmprof-0.1.0.tar.gz (137.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gmprof-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file gmprof-0.1.0.tar.gz.

File metadata

  • Download URL: gmprof-0.1.0.tar.gz
  • Upload date:
  • Size: 137.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gmprof-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21319b9271bdab0126aab7e59e654d0bb0980635b4be60dacc094a4b53a74464
MD5 c329d9b5dc407d2985569a596b9f1e1d
BLAKE2b-256 bf996761be29b4a5f59f0b745d32082f5f19504045e725c83c3ac23d4c72be84

See more details on using hashes here.

File details

Details for the file gmprof-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gmprof-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for gmprof-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4bf2ba94657c15ac1af7733e131f58ae6994d89beacfa32d017dcc3609ebadc
MD5 68de10c1341f25588cef7e88cf660a0a
BLAKE2b-256 8b8b728a145e45f462dbac060b7c51b4d37c5efb738782eba19717b09117d476

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page