Skip to main content

Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all,as well as any send/receive based communication pattern.It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch,as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.

Project description

Deep-NCCL-Wrapper

Deep-NCCL-Wrapper is a wrapper for DeepNCCL which Optimized primitives for inter-GPU communication on Aliyun machines.

Introduction

Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.

Install

To install Deep NCCL on the system, create a package then install it as root as follow two methods:

  • method1: rpm/deb (Recommended)
# Centos:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/rpm/deep-nccl-2.0.1.rpm
rpm -i deep-nccl-2.0.1.rpm
# Ubuntu:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/deb/deep-nccl-2.0.1.deb
dpkg -i deep-nccl-2.0.1.deb
  • method2: python-pypi
pip install deep-nccl-wrapper

Usage

After install deep-nccl package, you need do nothing to change code!

Environment

  • AIACC_FASTTUNING: Enable Fasttuning for LLMs, default=1 is to enable.
  • NCCL_AIACC_ALLREDUCE_DISABLE: Disable allreduce algo, default=0 is to enable.
  • NCCL_AIACC_ALLGATHER_DISABLE: Disable allgather algo, default=0 is to enable.
  • NCCL_AIACC_REDUCE_SCATTER_DISABLE: Disable reduce_scatter algo, default=0 is to enable.
  • AIACC_UPDATE_ALGO_DISABLE: Disable update aiacc nccl algo from aiacc-sql-server, default=0 is to enable.

Performance

Deep-NCCL can speedup the nccl performance on aliyun EGS(GPU machine), for example instance type 'ecs.ebmgn7ex.32xlarge' is A100 x 8 GPU and using network eRdma.

GPU(EGS) Collective Nodes Network Speedup(nccl-tests)
A100 x 8 all_gather 2-10 VPC/eRdma 30%+
A100 x 8 reduce_scatter 2-10 VPC/eRdma 30%+
A100 x 8 all_reduce 2-10 VPC/eRdma 20%
V100 x 8 all_reduce 2-20 VPC 60%+
A10 x 8 all_reduce 1 - 20%

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved. All modifications are copyright (c) 2020-2024, ALIYUN CORPORATION. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_nccl_wrapper-1.0.2.tar.gz (2.9 kB view details)

Uploaded Source

File details

Details for the file deep_nccl_wrapper-1.0.2.tar.gz.

File metadata

  • Download URL: deep_nccl_wrapper-1.0.2.tar.gz
  • Upload date:
  • Size: 2.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for deep_nccl_wrapper-1.0.2.tar.gz
Algorithm Hash digest
SHA256 33df29b698ca222f3ec116e67461e28537caf0db654f5c54fbd302b190ef9794
MD5 8e4908b76360e97e33ef79f67b96e8fe
BLAKE2b-256 9a78a8b40add9cec1e73a75cdea0ee593d50eb292b1ef87d190078bb01a9dbff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page