Automatically shard your large model between multiple GPUs, works without torch.distributed

Project description

petals_local_parallel

YSDA project

import torch, torch.nn as nn
model = nn.Sequential(nn.Embedding(1337, 64), nn.LayerNorm(64), nn.ReLU(), nn.Linear(128, 10))

from tensor_parallel import TensorParallel
model = TensorParallel(model, device_ids=['cuda:0', 'cuda:1'])

normal_outputs = model(**normal_inputs)  # forward and backward works just like in the base model

Benchmarking tutorial

You may either use manual benchmark (benchmark_manual.py) or auto (markbench.py)

Manual benchmark

consider command line arguments:

-d | do_backward -- wether you need backward passes or not

-n | num_iter -- number of iterations

-s | seq_length -- sequence length

-b | batch_size -- okay

-c | bloomconfig -- str used in BloomConfig().from_pretrained to specify the model you need

CUDA_VISIBLE_DEVICES -- gpus, you are using

nproc_per_node -- # of gpus/ processes

Don't forget to set correct gpu ids: export CUDA_DEVICE_ORDER=PCI_BUS_ID

So the following command CUDA_VISIBLE_DEVICES=4,5 torchrun --nproc_per_node 2 benchmark.py -d 0 -n 100 -s 17 -b 16 -c bloom will run the manual benchmark with no backward pass, 100 iterations, sequence length of 17, batch size of 16 and "bloom" 176B model.

Auto benchmark

no command line arguments this time, just run markbench.py

The script will run several experiments in cycle. To see the parameters, check the experiment setting section in the markbench.py. Models are tested both with and without backward passes. The results will be printed for all of the ranks. (MESS)

TODO:

Decide which models are too big for backward passes and don't check them
Decide what to do if one of the experiments failed

Project details

Release history Release notifications | RSS feed

2.0.0

Aug 6, 2023

1.3.2

Jul 27, 2023

1.3.1

Jul 26, 2023

1.3.0

Jul 22, 2023

1.2.9

Jul 21, 2023

1.2.8

Jun 23, 2023

1.2.7

Jun 20, 2023

1.2.6

Jun 19, 2023

1.2.5

Jun 14, 2023

1.2.4

May 14, 2023

1.2.3

May 14, 2023

1.2.2

Apr 17, 2023

1.2.1

Apr 10, 2023

1.2.0

Apr 3, 2023

1.1.4

Mar 27, 2023

1.1.3

Mar 23, 2023

1.1.2 yanked

Mar 22, 2023

Reason this release was yanked:

This version has broken dispatch

1.1.1

Mar 15, 2023

1.1.0

Mar 6, 2023

1.0.25

Feb 21, 2023

1.0.24

Jan 12, 2023

1.0.23

Jan 3, 2023

1.0.22

Dec 30, 2022

1.0.21.dev0 pre-release

Dec 22, 2022

1.0.19

Dec 26, 2022

1.0.18

Dec 15, 2022

1.0.17

Dec 14, 2022

1.0.16

Dec 14, 2022

1.0.15

Dec 14, 2022

This version

1.0.14

Dec 14, 2022

1.0.3

Dec 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tensor_parallel-1.0.14.tar.gz (3.4 kB view hashes)

Uploaded Dec 14, 2022 Source

Built Distribution

tensor_parallel-1.0.14-py3-none-any.whl (3.3 kB view hashes)

Uploaded Dec 14, 2022 Python 3

Hashes for tensor_parallel-1.0.14.tar.gz

Hashes for tensor_parallel-1.0.14.tar.gz
Algorithm	Hash digest
SHA256	`a0a11a9d09afc9ba73dd70018d2f91bebf54ad15e247a4470bd3a70c6198a295`
MD5	`ef933272e9297d2ab13a6b27356a1212`
BLAKE2b-256	`59618e200672ceab74eb33c2561ff06553d1d7a1948f0a984e591b4c26b0b8e7`

Hashes for tensor_parallel-1.0.14-py3-none-any.whl

Hashes for tensor_parallel-1.0.14-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46c7d48ddab4c7276656c8d8e7ed5f294c3c5a172359b2075b36de48afe14996`
MD5	`6201be66dd7845ccb8f6f5bb1363bca0`
BLAKE2b-256	`570532726c66bd0017d4a225e5eb913a1ef6c853c3b594888e5a300c458dea5f`