Skip to main content

GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring

Project description

Meta GPU Cluster Monitoring (GCM)

GCM Logo

GCM is a set of tools used to do at-scale monitoring for HPC (High-Performance Computing) clusters, it powers Meta FAIR (Fundamental AI Research) AI workloads across hundreds of thousands of GPUs at Meta.

GCM is a monorepo with the following components:

  • Monitoring: Collects cluster statistics from the Slurm workload scheduler, providing visibility into job performance and resource utilization.
  • Health Checks: Verifies the proper functioning of hardware, software, network, storage, and services throughout the job lifecycle.
  • Telemetry Processor / GPU Metrics: Enhances OpenTelemetry data by correlating telemetry with Slurm metadata, enabling attribution of metrics (e.g., GPU utilization) to specific jobs and users.

For more information, check our documentation.

Contributing

Each component has its own README with detailed guides:

Possible Expansions

Code of Conduct

Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the full text so that you can understand what actions will and will not be tolerated.

The Team

GPU Cluster Monitoring is actively maintained by Lucca Bertoncini, Caleb Ho, Apostolos Kokolis, Liao Hu, Thanh Nguyen, Billy Campoli with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): Jörg Doku, Vivian Peng, Parth Malani, Kalyan Saladi, Shubho Sengupta, Leo Huang, Robert Vincent, Max Wang, Sujit Verma, Teng Li, James Taylor, Xiaodong Ma, Chris Henry, Jakob Johnson, Kareem Sakher, Abinesh Ramakrishnan, Nabib Ahmed, Yong Li, Junjie Qian, David Watson, Guanyu Wu, Jaromir Latal, Samuel Doud, Yidi Wu, Xinyuan Zhang, Neha Saxena.

Feel free to contribute and add your name!

License

Each GCM component has its own lincense.

/gcm is licensed under the MIT License.

/shelper is licensed under the MIT License.

/slurmprocessor is licensed under the Apache 2.0 License.

Remaining files are licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpucm-0.0.2.tar.gz (5.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpucm-0.0.2-py3-none-any.whl (632.7 kB view details)

Uploaded Python 3

File details

Details for the file gpucm-0.0.2.tar.gz.

File metadata

  • Download URL: gpucm-0.0.2.tar.gz
  • Upload date:
  • Size: 5.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for gpucm-0.0.2.tar.gz
Algorithm Hash digest
SHA256 98347a4ad91e74d2f96f861515b70880a61696a7daf1892364fa7535c84dc2a3
MD5 2bb1ed8c2ed97425b0a46437fc4e96f6
BLAKE2b-256 78a3c08600e0086b204bfc807bbca7c21963e4b86332a32910b4c220cb5b89c7

See more details on using hashes here.

File details

Details for the file gpucm-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: gpucm-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 632.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for gpucm-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dacc2dc9e8ce8a1ceb519c1ae2c8220a676970c623276f1ee967b3104992db88
MD5 1bb7247c116e95e3d1e855098ce997d4
BLAKE2b-256 b6cd718a15de247d28b7ff6ac251c150320530a60c8c97bd4b0bef4314d81e6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page