Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all,as well as any send/receive based communication pattern.It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch,as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.

These details have not been verified by PyPI

Project links

Homepage

Project description

Deep-NCCL-Wrapper

Deep-NCCL-Wrapper is a wrapper for DeepNCCL which Optimized primitives for inter-GPU communication on Aliyun machines.

Introduction

Deep-NCCL is an AI-Accelerator communication framework for NVIDIA-NCCL. It implements optimized all-reduce, all-gather, reduce, broadcast, reduce-scatter, all-to-all, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on aliyun machines using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs, eRDMA or TCP/IP sockets.

Install

To install Deep NCCL on the system, create a package then install it as root as follow two methods:

method1: rpm/deb (Recommended)

# Centos:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/rpm/deep-nccl-2.0.1.rpm
rpm -i deep-nccl-2.0.1.rpm
# Ubuntu:
wget https://aiacc.oss-accelerate.aliyuncs.com/nccl/deb/deep-nccl-2.0.1.deb
dpkg -i deep-nccl-2.0.1.deb

method2: python-pypi

pip install deep-nccl-wrapper

Usage

After install deep-nccl package, you need do nothing to change code!

Environment

AIACC_FASTTUNING: Enable Fasttuning for LLMs, default=1 is to enable.
NCCL_AIACC_ALLREDUCE_DISABLE: Disable allreduce algo, default=0 is to enable.
NCCL_AIACC_ALLGATHER_DISABLE: Disable allgather algo, default=0 is to enable.
NCCL_AIACC_REDUCE_SCATTER_DISABLE: Disable reduce_scatter algo, default=0 is to enable.
AIACC_UPDATE_ALGO_DISABLE: Disable update aiacc nccl algo from aiacc-sql-server, default=0 is to enable.

Performance

Deep-NCCL can speedup the nccl performance on aliyun EGS(GPU machine), for example instance type 'ecs.ebmgn7ex.32xlarge' is A100 x 8 GPU and using network eRdma.

GPU(EGS)	Collective	Nodes	Network	Speedup(nccl-tests)
A100 x 8	all_gather	2-10	VPC/eRdma	30%+
A100 x 8	reduce_scatter	2-10	VPC/eRdma	30%+
A100 x 8	all_reduce	2-10	VPC/eRdma	20%
V100 x 8	all_reduce	2-20	VPC	60%+
A10 x 8	all_reduce	1	-	20%

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved. All modifications are copyright (c) 2020-2024, ALIYUN CORPORATION. All rights reserved.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.2

Nov 9, 2023

1.0.1

Nov 9, 2023

1.0.0

Oct 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_nccl_wrapper-1.0.2.tar.gz (2.9 kB view details)

Uploaded Nov 9, 2023 Source

File details

Details for the file deep_nccl_wrapper-1.0.2.tar.gz.

File metadata

Download URL: deep_nccl_wrapper-1.0.2.tar.gz
Upload date: Nov 9, 2023
Size: 2.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for deep_nccl_wrapper-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`33df29b698ca222f3ec116e67461e28537caf0db654f5c54fbd302b190ef9794`
MD5	`8e4908b76360e97e33ef79f67b96e8fe`
BLAKE2b-256	`9a78a8b40add9cec1e73a75cdea0ee593d50eb292b1ef87d190078bb01a9dbff`