Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all,as well as any send/receive based communication pattern.It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch,as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.
Project description
Deep-NCCL-Wrapper
Deep-NCCL-Wrapper is a wrapper for DeepNCCL which Optimized primitives for inter-GPU communication on Aliyun machines.
Introduction
Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.
Install
To install Deep NCCL on the system, create a package then install it as root as follow two methods:
- method1: rpm/deb (Recommended)
# Centos:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/rpm/deep-nccl-2.0.1.rpm
rpm -i deep-nccl-2.0.1.rpm
# Ubuntu:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/deb/deep-nccl-2.0.1.deb
dpkg -i deep-nccl-2.0.1.deb
- method2: python-pypi
pip install deep-nccl-wrapper
Usage
After install deep-nccl package, you need do nothing to change code!
Environment
- AIACC_FASTTUNING: Enable Fasttuning for LLMs, default=1 is to enable.
- NCCL_AIACC_ALLREDUCE_DISABLE: Disable allreduce algo, default=0 is to enable.
- NCCL_AIACC_ALLGATHER_DISABLE: Disable allgather algo, default=0 is to enable.
- NCCL_AIACC_REDUCE_SCATTER_DISABLE: Disable reduce_scatter algo, default=0 is to enable.
- AIACC_UPDATE_ALGO_DISABLE: Disable update aiacc nccl algo from aiacc-sql-server, default=0 is to enable.
Performance
Deep-NCCL can speedup the nccl performance on aliyun EGS(GPU machine), for example instance type 'ecs.ebmgn7ex.32xlarge' is A100 x 8 GPU and using network eRdma.
GPU(EGS) | Collective | Nodes | Network | Speedup(nccl-tests) |
---|---|---|---|---|
A100 x 8 | all_gather | 2-10 | VPC/eRdma | 30%+ |
A100 x 8 | reduce_scatter | 2-10 | VPC/eRdma | 30%+ |
A100 x 8 | all_reduce | 2-10 | VPC/eRdma | 20% |
V100 x 8 | all_reduce | 2-20 | VPC | 60%+ |
A10 x 8 | all_reduce | 1 | - | 20% |
Copyright
All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved. All modifications are copyright (c) 2020-2024, ALIYUN CORPORATION. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file deep_nccl_wrapper-1.0.2.tar.gz
.
File metadata
- Download URL: deep_nccl_wrapper-1.0.2.tar.gz
- Upload date:
- Size: 2.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33df29b698ca222f3ec116e67461e28537caf0db654f5c54fbd302b190ef9794 |
|
MD5 | 8e4908b76360e97e33ef79f67b96e8fe |
|
BLAKE2b-256 | 9a78a8b40add9cec1e73a75cdea0ee593d50eb292b1ef87d190078bb01a9dbff |