Skip to main content

Scan resources used during training using HFTrainer

Project description

HFResourceScanner

PyPI - Version

Scan resources consumed during training using HFTrainer and break up consumption by category with almost ZERO overheads.

Works for all training approaches (such as full-fine tuning, Prompt Tuning, LORA) for HFTrainer and other trainers based on it, such as SFTTrainer. Limited support for FSDP (currently).

Measures and reports:

  1. GPU memory consumption broken up into 4 categories: paramters, Optimizer, Gradients and Activations.
  2. Time taken for a step broken up into 3 categories: forward, backward and optimizer update.
  3. Other time components: init time, checkpoint time etc
  4. Number and sizes of network primitive calls.

Install

From PyPi:

pip install HFResourceScanner

From source:

pip install .

Usage

2 line change to your existing code:

  1. Import the Scanner.
from HFResourceScanner import Scanner
  1. Create and add a Scanner object to the list of callbacks:
...
callbacks.append(Scanner())
...

In the default configuration, prints out data to stdout.

Configuring

You can further configure the Scanner to:

  1. Choose the step to instrument and scan at (we only scan at a single step). There is no reason to change from the default of 5.
  2. Output to stdout (the default), file or use a callback function to deal with the output. See examples provided in the examples/ folder.

Methodology

Uses a combination of the following items:

  1. HFTrainer Callbacks to measure memory and breakup at step boundary.
  2. Pytorch hook functions such as nn.Module Forward and optimizer.step function to measure memory at ideal locations.

Memory breakup

It is important to note that this scanning happens for a single step:

  1. At step start, setup hook functions.
  2. During the step, run the functions to take single point measurements.
  3. At the end of the step, correlate the data and cleaup the hook functions.

Alternatives

  1. Pytorch Profile: can give a complete breakup of stack traces and memory consumption. While this is much more exhaustive and useful for optimizing implementations, this may be overwhelming for casual users. Also, this approach can take non-trivial amount of time to compute memory allocations and is quite slow for larger models.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hfresourcescanner-0.1.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

HFResourceScanner-0.1.1-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file hfresourcescanner-0.1.1.tar.gz.

File metadata

  • Download URL: hfresourcescanner-0.1.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for hfresourcescanner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 491b79dff1ce4a3fdb2701ef2f557de5f22d509d1813eb6207f37ae02e3c1480
MD5 11739f31e93a9ff63116a5099951cd2d
BLAKE2b-256 0711e3b76bbb862b59cf8702ff2cd1efeda6d3482ea663391a2d67c83440d6c5

See more details on using hashes here.

File details

Details for the file HFResourceScanner-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for HFResourceScanner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 063c11e539590742122d35b1a323b12d36cc667146dc253d2c4381b2e1a88072
MD5 d79c38a64d45fb4ef8d837aa5b5a084e
BLAKE2b-256 fd767e941978f32103a1c74d9873f611ee1e77e31726793ecf50ec6ec3999865

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page