Scan resources used during training using HFTrainer
Project description
HFResourceScanner
Scan resources consumed during training using HFTrainer and break up consumption by category with almost ZERO overheads.
Works for all training approaches (such as full-fine tuning, Prompt Tuning, LORA) for HFTrainer and other trainers based on it, such as SFTTrainer. Limited support for FSDP (currently).
Measures and reports:
- GPU memory consumption broken up into 4 categories: paramters, Optimizer, Gradients and Activations.
- Time taken for a step broken up into 3 categories: forward, backward and optimizer update.
- Other time components: init time, checkpoint time etc
- Number and sizes of network primitive calls.
Install
From PyPi:
pip install HFResourceScanner
From source:
pip install .
Usage
2 line change to your existing code:
- Import the Scanner.
from HFResourceScanner import Scanner
- Create and add a Scanner object to the list of callbacks:
...
callbacks.append(Scanner())
...
In the default configuration, prints out data to stdout.
Configuring
You can further configure the Scanner to:
- Choose the step to instrument and scan at (we only scan at a single step). There is no reason to change from the default of 5.
- Output to stdout (the default), file or use a callback function to deal with the output. See examples provided in the
examples/folder.
Methodology
Uses a combination of the following items:
- HFTrainer Callbacks to measure memory and breakup at step boundary.
- Pytorch hook functions such as
nn.ModuleForward andoptimizer.stepfunction to measure memory at ideal locations.
It is important to note that this scanning happens for a single step:
- At step start, setup hook functions.
- During the step, run the functions to take single point measurements.
- At the end of the step, correlate the data and cleaup the hook functions.
Alternatives
- Pytorch Profile: can give a complete breakup of stack traces and memory consumption. While this is much more exhaustive and useful for optimizing implementations, this may be overwhelming for casual users. Also, this approach can take non-trivial amount of time to compute memory allocations and is quite slow for larger models.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hfresourcescanner-0.1.2.tar.gz.
File metadata
- Download URL: hfresourcescanner-0.1.2.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26e11abf1ba831eb3caf71e7b4b14b6ddd569ca09743d5c1edd0c53004fd8703
|
|
| MD5 |
daef271081a6ef6dfa95fca67907ce3f
|
|
| BLAKE2b-256 |
daf9f85dfbdc43653c3cba19b7ec46ab0aba7baf5f10724a2284861fc64cf72c
|
File details
Details for the file HFResourceScanner-0.1.2-py3-none-any.whl.
File metadata
- Download URL: HFResourceScanner-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a432059ff48ada80cd62a1320b92ffc106e8aaf3398aa4f33e399ad4154b8f93
|
|
| MD5 |
e645a18f66467684ca435b9e706ae949
|
|
| BLAKE2b-256 |
768170aeec24f506ea6c44f692dd556f323fd211b83cf3a67737428092288117
|