Skip to main content

Resource-Aware Data systems Tracker (radT) for automatically tracking and training machine learning software

Project description

preview

radT

radT (Resource Aware Data science Tracker) is an extension to MLFlow that simplifies the collection and exploration of hardware metrics of machine learning and deep learning applications. Usually, collecting and processing all the required metrics for these workloads is a hassle. In contrast, RADT is easy to deploy and use, with minimal impact on both performance and time investment. The codebase of RADT is documented and easily expandable.

This work has been published at the SIGMOD workshop DEEM 2023: https://itu-dasyalab.github.io/RAD/publication/papers/DEEM2023.pdf

pip install radt

Features

  • Wide configuration support including collocation
  • Track hardware and software metrics
  • Handle continuous streams of data
  • Support multiple visualization use-cases
  • Filter large amounts of inconsequential data
  • Minimal code impact

Sample usage & getting started

Replace python in your training script by radt, e.g.:

>>> radt train.py --batch-size 256

or, when using virtual environments/conda:

>>> python -m radt train.py --batch-size 256

For a complete getting started guide and examples please visit the Examples.

Easy to use via automated tracking

radT will automatically track hardware metrics for your application. The listeners will start tracking your application on invocation.

As radT extends MLFlow, you can either use the advanced tracking or use MLFlow to track software metrics (e.g. loss).

Advanced tracking options via context

If you want to have more control over what is logged, you can encapsulate your training loop in the RADT context:

from radtrun import RADT

with RADT as run:
  # training loop

CSV syntax for larger experiments

RADT can take the hassle of large experiments off you by training multiple models in succession. Models can even be trained at the same time on different gpus or at the same gpu using a range of collocation schemes.

Experiment,Workload,Status,Run,Devices,Collocation,File,Listeners,Params
2,21,,,0,-,../pytorch/cifar10.py,smi+top+dcgmi,batch-size=128
2,21,,,1,-,../pytorch/cifar10.py,smi+top+dcgmi,batch-size=128

When interrupted by any means, a csv experiment can be rescheduled to continue from where it left off.

Supported platforms

  • Linux

Contributors

Thank You!

Contributions are welcome. (Please add yourself to the list)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

radt-0.1.3.tar.gz (10.1 MB view hashes)

Uploaded Source

Built Distribution

radt-0.1.3-py2.py3-none-any.whl (8.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page