GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
Project description
Meta GPU Cluster Monitoring (GCM)
GCM is a set of tools used to do at-scale monitoring for HPC (High-Performance Computing) clusters, it powers Meta FAIR (Fundamental AI Research) AI workloads across hundreds of thousands of GPUs at Meta.
GCM is a monorepo with the following components:
- Monitoring: Collects cluster statistics from the Slurm workload scheduler, providing visibility into job performance and resource utilization.
- Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
- Telemetry Processor / GPU Metrics: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata, enabling attribution of metrics (e.g., GPU utilization) to specific jobs and users.
For more information, check our documentation.
Contributing
Each component has its own README with detailed guides:
Possible Expansions
- Integration with more GPU types (AMD, Intel, Custom Accelerators)
- Support for additional schedulers beyond Slurm
- Additional Slurm related Monitoring
- Support for new exporters
- Adding support for Slurm REST API querying
- Adding support for new Health Checks
- Distribution via Docker Images and Helm Charts
Code of Conduct
Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the full text so that you can understand what actions will and will not be tolerated.
The Team
GPU Cluster Monitoring is actively maintained by Lucca Bertoncini, Caleb Ho, Apostolos Kokolis, Liao Hu, Thanh Nguyen, Billy Campoli with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): Jörg Doku, Vivian Peng, Parth Malani, Kalyan Saladi, Shubho Sengupta, Leo Huang, Robert Vincent, Max Wang, Sujit Verma, Teng Li, James Taylor, Xiaodong Ma, Chris Henry, Jakob Johnson, Kareem Sakher, Abinesh Ramakrishnan, Nabib Ahmed, Yong Li, Junjie Qian, David Watson, Guanyu Wu, Jaromir Latal, Samuel Doud, Yidi Wu, Xinyuan Zhang, Neha Saxena.
Feel free to contribute and add your name!
License
Each GCM component has its own lincense.
/gcm is licensed under the MIT License.
/shelper is licensed under the MIT License.
/slurmprocessor is licensed under the Apache 2.0 License.
Remaining files are licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpucm-0.0.2.tar.gz.
File metadata
- Download URL: gpucm-0.0.2.tar.gz
- Upload date:
- Size: 5.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98347a4ad91e74d2f96f861515b70880a61696a7daf1892364fa7535c84dc2a3
|
|
| MD5 |
2bb1ed8c2ed97425b0a46437fc4e96f6
|
|
| BLAKE2b-256 |
78a3c08600e0086b204bfc807bbca7c21963e4b86332a32910b4c220cb5b89c7
|
File details
Details for the file gpucm-0.0.2-py3-none-any.whl.
File metadata
- Download URL: gpucm-0.0.2-py3-none-any.whl
- Upload date:
- Size: 632.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dacc2dc9e8ce8a1ceb519c1ae2c8220a676970c623276f1ee967b3104992db88
|
|
| MD5 |
1bb7247c116e95e3d1e855098ce997d4
|
|
| BLAKE2b-256 |
b6cd718a15de247d28b7ff6ac251c150320530a60c8c97bd4b0bef4314d81e6e
|