Skip to main content

A framework for deep learning energy measurement and optimization.

Project description

Zeus logo

Deep Learning Energy Measurement and Optimization

NSDI23 paper Docker Hub Slack workspace Homepage build Apache-2.0 License


Project News


Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.

Measuring GPU energy

from zeus.monitor import ZeusMonitor

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")

print(f"Energy: {measurement.total_energy} J")
print(f"Time  : {measurement.time} s")

Finding the optimal GPU power limit

Zeus silently profiles different power limits during training and converges to the optimal one.

from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()

CLI power and energy monitor

$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})

Please refer to our NSDI’23 paper and slides for details. Checkout Overview for a summary.

Zeus is part of The ML.ENERGY Initiative.

Repository Organization

.
├── zeus/                # ⚡ Zeus Python package
│   ├── optimizer/       #    - GPU energy and time optimizers
│   ├── run/             #    - Tools for running Zeus on real training jobs
│   ├── policy/          #    - Optimization policies and extension interfaces
│   ├── util/            #    - Utility functions and classes
│   ├── monitor.py       #    - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py    #    - Tools for controlling the flow of training
│   ├── callback.py      #    - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py      #    - Tools for trace-driven simulation
│   ├── analyze.py       #    - Analysis functions for power logs
│   └── job.py           #    - Class for job specification
│
├── zeus_monitor/        # 🔌 GPU power monitor
│   ├── zemo/            #    -  A header-only library for querying NVML
│   └── main.cpp         #    -  Source code of the power monitor
│
├── examples/            # 🛠️ Examples of integrating Zeus
│
├── capriccio/           # 🌊 A drifting sentiment analysis dataset
│
└── trace/               # 🗃️ Train and power traces for various GPUs and DNNs

Getting Started

Refer to Getting started for complete instructions on environment setup, installation, and integration.

Docker image

We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:

docker run -it \
    --gpus all                  `# Mount all GPUs` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --ipc host                  `# PyTorch DataLoader workers need enough shm` \
    symbioticlab/zeus:latest \
    bash

Refer to Environment setup for details.

Examples

We provide working examples for integrating and running Zeus in the examples/ directory.

Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to Extending Zeus for details.

Carbon-Aware Zeus

The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce Chase -- a carbon-aware solution. Chase dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our paper and the chase branch.

Citation

@inproceedings{zeus-nsdi23,
    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
    booktitle = {USENIX NSDI},
    year      = {2023}
}

Contact

Jae-Won Chung (jwnchung@umich.edu)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zeus-ml-0.7.0.tar.gz (115.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zeus_ml-0.7.0-py3-none-any.whl (159.5 kB view details)

Uploaded Python 3

File details

Details for the file zeus-ml-0.7.0.tar.gz.

File metadata

  • Download URL: zeus-ml-0.7.0.tar.gz
  • Upload date:
  • Size: 115.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for zeus-ml-0.7.0.tar.gz
Algorithm Hash digest
SHA256 24b8613113c21875649fc6d8802da3dc579b94e2958caccb9724141c1d13e0a9
MD5 c24f44dc3e7551b211f4dd733279093c
BLAKE2b-256 1803d27a0dd6c19a7156f0a956d74bc53eef852db637aced23a00279189ba013

See more details on using hashes here.

File details

Details for the file zeus_ml-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: zeus_ml-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 159.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for zeus_ml-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 40d2ffdac110eda651a681b939e5caea2c7815a04c72d30dce5c2614116c42ed
MD5 965ff21e1d873bcdf6ef3fa7b43e44b4
BLAKE2b-256 e60090c8b0375adc2107989531823e40001b806cb8faea98657e0604970f5b21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page