Resource-Aware Data systems Tracker (radT) for automatically tracking and training machine learning software
Project description
radT
radT (Resource Aware Data science Tracker) is an extension to MLFlow that simplifies the collection and exploration of hardware metrics of machine learning and deep learning applications. Usually, collecting and processing all the required metrics for these workloads is a hassle. In contrast, RADT is easy to deploy and use, with minimal impact on both performance and time investment. The codebase of RADT is documented and easily expandable.
This work has been published at the SIGMOD workshop DEEM 2023: https://itu-dasyalab.github.io/RAD/publication/papers/DEEM2023.pdf
pip install radt
Features
- Wide configuration support including collocation
- Track hardware and software metrics
- Handle continuous streams of data
- Support multiple visualization use-cases
- Filter large amounts of inconsequential data
- Minimal code impact
Sample usage & getting started
Replace python
in your training script by radt
, e.g.:
>>> radt train.py --batch-size 256
or, when using virtual environments/conda:
>>> python -m radt train.py --batch-size 256
For a complete getting started guide and examples please visit the Examples.
Easy to use via automated tracking
radT will automatically track hardware metrics for your application. The listeners will start tracking your application on invocation.
As radT extends MLFlow, you can either use the advanced tracking or use MLFlow to track software metrics (e.g. loss).
Advanced tracking options via context
If you want to have more control over what is logged, you can encapsulate your training loop in the RADT context:
from radtrun import RADT
with RADT as run:
# training loop
CSV syntax for larger experiments
RADT can take the hassle of large experiments off you by training multiple models in succession. Models can even be trained at the same time on different gpus or at the same gpu using a range of collocation schemes.
Experiment,Workload,Status,Run,Devices,Collocation,File,Listeners,Params
2,21,,,0,-,../pytorch/cifar10.py,smi+top+dcgmi,batch-size=128
2,21,,,1,-,../pytorch/cifar10.py,smi+top+dcgmi,batch-size=128
When interrupted by any means, a csv experiment can be rescheduled to continue from where it left off.
Supported platforms
- Linux
Contributors
Thank You!
Contributions are welcome. (Please add yourself to the list)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.