A framework for deep learning energy measurement and optimization.
Project description
Deep Learning Energy Measurement and Optimization
Project News ⚡
- [2023/12] The preprint of the Perseus paper is out here!
- [2023/10] We released Perseus, an energy optimizer for large model training. Get started here!
- [2023/09] We moved to under
ml-energy
! Please stay tuned for new exciting projects! - [2023/07]
ZeusMonitor
was used to profile GPU time and energy consumption for the ML.ENERGY leaderboard & Colosseum. - [2023/03] Chase, an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
- [2022/11] Carbon-Aware Zeus won the second overall best solution award at Carbon Hack 22.
Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.
Measuring GPU energy
from zeus.monitor import ZeusMonitor
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")
print(f"Energy: {measurement.total_energy} J")
print(f"Time : {measurement.time} s")
Finding the optimal GPU power limit
Zeus silently profiles different power limits during training and converges to the optimal one.
from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer
monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)
plo.on_epoch_begin()
for x, y in train_dataloader:
plo.on_step_begin()
# Learn from x and y!
plo.on_step_end()
plo.on_epoch_end()
CLI power and energy monitor
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.
Zeus is part of The ML.ENERGY Initiative.
Repository Organization
.
├── zeus/ # ⚡ Zeus Python package
│ ├── optimizer/ # - GPU energy and time optimizers
│ ├── run/ # - Tools for running Zeus on real training jobs
│ ├── policy/ # - Optimization policies and extension interfaces
│ ├── util/ # - Utility functions and classes
│ ├── monitor.py # - `ZeusMonitor`: Measure GPU time and energy of any code block
│ ├── controller.py # - Tools for controlling the flow of training
│ ├── callback.py # - Base class for Hugging Face-like training callbacks.
│ ├── simulate.py # - Tools for trace-driven simulation
│ ├── analyze.py # - Analysis functions for power logs
│ └── job.py # - Class for job specification
│
├── zeus_monitor/ # 🔌 GPU power monitor
│ ├── zemo/ # - A header-only library for querying NVML
│ └── main.cpp # - Source code of the power monitor
│
├── examples/ # 🛠️ Examples of integrating Zeus
│
├── capriccio/ # 🌊 A drifting sentiment analysis dataset
│
└── trace/ # 🗃️ Train and power traces for various GPUs and DNNs
Getting Started
Refer to Getting started for complete instructions on environment setup, installation, and integration.
Docker image
We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:
docker run -it \
--gpus all `# Mount all GPUs` \
--cap-add SYS_ADMIN `# Needed to change the power limit of the GPU` \
--ipc host `# PyTorch DataLoader workers need enough shm` \
mlenergy/zeus:latest \
bash
Refer to Environment setup for details.
Examples
We provide working examples for integrating and running Zeus in the examples/
directory.
Extending Zeus
You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.
Refer to Extending Zeus for details.
Carbon-Aware Zeus
The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce Chase -- a carbon-aware solution. Chase dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our paper and the chase branch.
Citation
@inproceedings{zeus-nsdi23,
title = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
author = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
booktitle = {USENIX NSDI},
year = {2023}
}
Contact
Jae-Won Chung (jwnchung@umich.edu)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.