Automatically shard your large model between multiple GPUs, works without torch.distributed

Project description

tensor_parallel

Run large PyTorch models on multiple GPUs in one line of code.

import transformers
import tensor_parallel as tp
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-13b")
model = transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-13b")  # use opt-125m for testing

model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])  # <- each GPU has half the weights

inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"].to("cuda:0")
outputs = model.generate(inputs, num_beams=5)
print(tokenizer.decode(outputs[0])) # A cat sat on my lap for a few minutes ...

model(input_ids=inputs, labels=inputs).loss.backward()  # training works as usual

Installation

Latest stable version (recommended):

pip install tensor_parallel

Bleeding edge version:

pip install https://github.com/BlackSamorez/tensor_parallel/archive/main.zip

Usage

Simply wrap your PyTorch model with tp.tensor_parallel and use it normally. For best memory efficiency, call tp.tensor_parallel while the model is still on CPU.

Here's a few use cases:

examples/training_flan-t5-xl.ipynb - fine-tune full FLAN-T5 model on text summarization
TBA - inferencing a large language model with LLM.8bit + tensor_parallel
TBA - defining custom parallelism strategy

Advanced parameters to tensor_parallel:

device_ids: List[device] - which devices to use; defaults to all available GPUs
output_device: device - model outputs will have this device
config: tp.Config - use custom parallelism strategy, see slicing_configs.py
distributed: bool - if True, use torch.distributed backend instead of threading (requires torchrun)
sharded: bool - if True, find all trainable parameters that weren't split by Tensor Parallelism and split them using ZeRO-3 algorithm.
- weights will be split between GPUs and re-assembled before each forward pass
- TL;DR use this when training to avoid duplicate parameters (enabled by default!)
- sharded_param_names: List[str] - parameter names that should be sharded this way, default = found automatically

FAQ

Q: I don't have a multi-GPU server. Can I use tensor_parallel in Google Colab?
A: Colab has a single GPU, so there's no point in tensor parallelism. However, Kaggle offers two T4 for free to all phone-verified accounts.
Q: What is tensor parallelism?
A: You split each layer's weights into parts, multiply each part on a separate GPU, then gather results. Read more here
Q: Should I use TensorParallel or DataParallel?
A: TensorParallel for large models, DataParallel for smaller ones
Q: How does it compare against FullyShardedDataParallel and ZeRO?
A: ZeRO is better if you can fit a large batch, TensorParallel is better for small batches

Why use tensor_parallel ...

v.s. DeepSpeed and FairScale
- DeepSpeed has many parallelization strategies, but requires careful configuration
- tensor_parallel has one strategy that works with 1 line of code
- tensor_parallel works in a jupyter notebook
v.s. MegatronLM?
- MegatronLM has great tensor parallelism for one model architecture
- tensor_parallel has good parallelism for any architecture
- tensor_parallel is way easier to install
v.s. parallelformers?
- parallelformers implements a fixed list of architectures
- tensor_parallel works for any architecture automatically
- parallelformers is inference-only, tensor_parallel supports training
v.s. alpa
- alpa is a powerful tool for automatic distributed training / inference in JAX
- tensor_parallel works with PyTorch
v.s. Model.parallelize()?
- both are easy to use, both fit large models
- in parallelize, one GPU works at a time
- in tensor_parallel, GPUs work in parallel

In short, use tensor_parallel for quick prototyping on a single machine. Use DeepSpeed+Megatron or alpa for million-dollar training runs.

Troubleshooting

If you experience NCCL errors, or random hanging, you may have some code errors that are not displayed properly. To debug these errors, we recommend restarting with export TENSOR_PARALLEL_USE_NATIVE=1 or a on single device.

If you found a bug or encountered a problem, please report it to our issue tracker. We will do our best to help, but it may take some time before we get to it. Please create issues only if your problem is specifically with tensor_parallel. For example, if you need help installing transformers or optimizing your code, please seek it elsewhere.

Code style

We use black and isort for all pull requests. Before committing your code, simply run black . && isort . and you will be fine.

Project details

Release history Release notifications | RSS feed

2.0.0

Aug 6, 2023

1.3.2

Jul 27, 2023

1.3.1

Jul 26, 2023

1.3.0

Jul 22, 2023

1.2.9

Jul 21, 2023

1.2.8

Jun 23, 2023

1.2.7

Jun 20, 2023

1.2.6

Jun 19, 2023

1.2.5

Jun 14, 2023

1.2.4

May 14, 2023

1.2.3

May 14, 2023

1.2.2

Apr 17, 2023

1.2.1

Apr 10, 2023

1.2.0

Apr 3, 2023

1.1.4

Mar 27, 2023

1.1.3

Mar 23, 2023

1.1.2 yanked

Mar 22, 2023

Reason this release was yanked:

This version has broken dispatch

1.1.1

Mar 15, 2023

1.1.0

Mar 6, 2023

This version

1.0.25

Feb 21, 2023

1.0.24

Jan 12, 2023

1.0.23

Jan 3, 2023

1.0.22

Dec 30, 2022

1.0.21.dev0 pre-release

Dec 22, 2022

1.0.19

Dec 26, 2022

1.0.18

Dec 15, 2022

1.0.17

Dec 14, 2022

1.0.16

Dec 14, 2022

1.0.15

Dec 14, 2022

1.0.14

Dec 14, 2022

1.0.3

Dec 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensor_parallel-1.0.25.tar.gz (28.9 kB view hashes)

Uploaded Feb 21, 2023 Source

Built Distribution

tensor_parallel-1.0.25-py3-none-any.whl (27.0 kB view hashes)

Uploaded Feb 21, 2023 Python 3

Hashes for tensor_parallel-1.0.25.tar.gz

Hashes for tensor_parallel-1.0.25.tar.gz
Algorithm	Hash digest
SHA256	`bd7ffc8d2b2d440b2996c292fc2f95651a9aec1db8a02297cea6e83b236e2e2b`
MD5	`cf00ef47bfbfe4df31fed16faaf62d1a`
BLAKE2b-256	`9c8512bfa8f4eb40d825d82ebd85b948f382ad20fc7f3acfc3dc88ac01694afd`

Hashes for tensor_parallel-1.0.25-py3-none-any.whl

Hashes for tensor_parallel-1.0.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`34aa4350984a6aad2fc4b14e364ce41682946683ea9cef7eef6623539247f728`
MD5	`577ec7071c29bff8be6e0b5910554be5`
BLAKE2b-256	`4b775663e5c266ef0dd2d702017e02ccefa049805cc7c2e4ff4defd2c4d57e73`