A framework for deep learning energy measurement and optimization.

These details have not been verified by PyPI

Project links

Project description

Deep Learning Energy Measurement and Optimization

Project News ⚡

[2023/12] The preprint of the Perseus paper is out here!
[2023/10] We released Perseus, an energy optimizer for large model training. Get started here!
[2023/09] We moved to under ml-energy! Please stay tuned for new exciting projects!
[2023/07] ZeusMonitor was used to profile GPU time and energy consumption for the ML.ENERGY leaderboard & Colosseum.
[2023/03] Chase, an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
[2022/11] Carbon-Aware Zeus won the second overall best solution award at Carbon Hack 22.

Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.

Measuring GPU energy

from zeus.monitor import ZeusMonitor

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")

print(f"Energy: {measurement.total_energy} J")
print(f"Time  : {measurement.time} s")

Finding the optimal GPU power limit

Zeus silently profiles different power limits during training and converges to the optimal one.

from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

CLI power and energy monitor

$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}

$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})

Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.

Zeus is part of The ML.ENERGY Initiative.

Repository Organization

.
├── zeus/                # ⚡ Zeus Python package
│   ├── optimizer/       #    - GPU energy and time optimizers
│   ├── run/             #    - Tools for running Zeus on real training jobs
│   ├── policy/          #    - Optimization policies and extension interfaces
│   ├── util/            #    - Utility functions and classes
│   ├── monitor.py       #    - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py    #    - Tools for controlling the flow of training
│   ├── callback.py      #    - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py      #    - Tools for trace-driven simulation
│   ├── analyze.py       #    - Analysis functions for power logs
│   └── job.py           #    - Class for job specification
│
├── zeus_monitor/        # 🔌 GPU power monitor
│   ├── zemo/            #    -  A header-only library for querying NVML
│   └── main.cpp         #    -  Source code of the power monitor
│
├── examples/            # 🛠️ Examples of integrating Zeus
│
├── capriccio/           # 🌊 A drifting sentiment analysis dataset
│
└── trace/               # 🗃️ Train and power traces for various GPUs and DNNs

Getting Started

Refer to Getting started for complete instructions on environment setup, installation, and integration.

Docker image

We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:

docker run -it \
    --gpus all                  `# Mount all GPUs` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --ipc host                  `# PyTorch DataLoader workers need enough shm` \
    mlenergy/zeus:latest \
    bash

Refer to Environment setup for details.

Examples

We provide working examples for integrating and running Zeus in the examples/ directory.

Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to Extending Zeus for details.

Carbon-Aware Zeus

The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce Chase -- a carbon-aware solution. Chase dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our paper and the chase branch.

Citation

@inproceedings{zeus-nsdi23,
    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
    booktitle = {USENIX NSDI},
    year      = {2023}
}

Contact

Jae-Won Chung (jwnchung@umich.edu)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.10.1

Sep 10, 2024

0.10.0

Aug 16, 2024

0.9.1

May 7, 2024

0.9.0

May 6, 2024

0.8.2

Feb 26, 2024

This version

0.8.1

Feb 25, 2024

0.8.0

Oct 13, 2023

0.7.1

Sep 24, 2023

0.7.0

Aug 24, 2023

0.6.1

Aug 7, 2023

0.6.0

Jul 28, 2023

0.5.0

Jul 12, 2023

0.4.0

Jun 21, 2023

0.3.0

Dec 5, 2022

0.2.2

Dec 4, 2022

0.2.1

Oct 15, 2022

0.2.0

Oct 8, 2022

0.1.0

Aug 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zeus-ml-0.8.1.tar.gz (142.7 kB view hashes)

Uploaded Feb 25, 2024 Source

Built Distribution

zeus_ml-0.8.1-py3-none-any.whl (192.6 kB view hashes)

Uploaded Feb 25, 2024 Python 3

Hashes for zeus-ml-0.8.1.tar.gz

Hashes for zeus-ml-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`df008efa8eafade527fd8a6c2c505450f11fdbdccf9458dc0d491ab48acbb97b`
MD5	`a40d5133c2d472d63e230cee2cbeb692`
BLAKE2b-256	`09b67426001edca0d7992e00ebb24fe708ef60241a51b1ff455f2b04f57a1d0c`

Hashes for zeus_ml-0.8.1-py3-none-any.whl

Hashes for zeus_ml-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`67b618bd4b5973826c7c012e065263021712c8be4178d2f0bf8ef76ac895fee5`
MD5	`17eae7403b999ffbe6555f7c331c84a9`
BLAKE2b-256	`1ccc85592ea989ae26eb32d176a884890f663e3c60dd17f6f4dfdce37a341754`