Transformers at any scale

These details have not been verified by PyPI

Project links

Homepage

Project description

TorchScale - A Library of Foundation Architectures

TorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively.

Fundamental research to develop new architectures for foundation models and A(G)I, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
Capability - A Length-Extrapolatable Transformer
Efficiency - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

Revolutionizing Transformers for (M)LLMs and AI

RetNet: Retentive Network: A Successor to Transformer for Large Language Models
LongNet: Scaling Transformers to 1,000,000,000 Tokens

News

October, 2023: Update RMSNorm and SwiGLU as the default module in RetNet
November, 2022: TorchScale 0.1.1 released [Paper] [PyPI]

Installation

To install:

pip install torchscale

Alternatively, you can develop it locally:

git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .

Getting Started

It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000)
>>> model = Encoder(config)

>>> print(model)

We also support the Decoder architecture and the EncoderDecoder architecture:

# Creating a decoder model
>>> from torchscale.architecture.config import DecoderConfig
>>> from torchscale.architecture.decoder import Decoder

>>> config = DecoderConfig(vocab_size=64000)
>>> decoder = Decoder(config)
>>> print(decoder)

# Creating a encoder-decoder model
>>> from torchscale.architecture.config import EncoderDecoderConfig
>>> from torchscale.architecture.encoder_decoder import EncoderDecoder

>>> config = EncoderDecoderConfig(vocab_size=64000)
>>> encdec = EncoderDecoder(config)
>>> print(encdec)

It takes only several lines of code to create a RetNet model:

# Creating a RetNet model
>>> import torch
>>> from torchscale.architecture.config import RetNetConfig
>>> from torchscale.architecture.retnet import RetNetDecoder

>>> config = RetNetConfig(vocab_size=64000)
>>> retnet = RetNetDecoder(config)

>>> print(retnet)

Key Features

DeepNorm to improve the training stability of Post-LayerNorm Transformers
- enabled by setting deepnorm=True in the Config class.
- It adjusts both the residual connection and the initialization method according to the model architecture (i.e., encoder, decoder, or encoder-decoder).
SubLN for the model generality and the training stability
- enabled by subln=True. This is enabled by default.
- It introduces another LayerNorm to each sublayer and adjusts the initialization according to the model architecture.
- Note that SubLN and DeepNorm cannot be used in one single model.
X-MoE: efficient and finetunable sparse MoE modeling
- enabled by use_xmoe=True.
- It replaces every 'moe_freq' FeedForwardNetwork layers with the X-MoE layers.
Multiway architecture for multimodality
- enabled by multiway=True.
- It provides a pool of Transformer's parameters used for different modalities.
Extrapolatable position embedding (Xpos)
- enabled by xpos_rel_pos=True.
Relative position bias
- enabled by adjusting rel_pos_buckets and max_rel_pos.
SparseClip: improving the gradient clipping for sparse MoE models
- we provide a sample code that can be easily adapted to the FairSeq (or other) repo.
Retentive Network: A Successor to Transformer for Large Language Models
- created by config = RetNetConfig(vocab_size=64000) and retnet = RetNetDecoder(config).

Most of the features above can be used by simply passing the corresponding parameters to the config. For example:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000, deepnorm=True, multiway=True)
>>> model = Encoder(config)

>>> print(model)

Examples

We have examples of how to use TorchScale in the following scenarios/tasks:

Language
Vision
- ViT/BEiT [In progress]
Speech
Multimodal
- Multiway Transformers/BEiT-3

We plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. DeepSpeed and Megatron-LM). Any comments or PRs are welcome!

Results

Stability Evaluation

The training curve is smooth by using TorchScale, while the baseline Transformer cannot converge.

Scaling-up Experiments

TorchScale supports arbitrary depths and widths, successfully scaling-up the models without pain.

Acknowledgments

Some implementations in TorchScale are either adapted from or inspired by the FairSeq repository and the UniLM repository.

Citations

If you find this repository useful, please consider citing our work:

@article{torchscale,
  author    = {Shuming Ma and Hongyu Wang and Shaohan Huang and Wenhui Wang and Zewen Chi and Li Dong and Alon Benhaim and Barun Patra and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {{TorchScale}: {Transformers} at Scale},
  journal   = {CoRR},
  volume    = {abs/2211.13184},
  year      = {2022}
}

@article{deepnet,
  author    = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
  title     = {{DeepNet}: Scaling {Transformers} to 1,000 Layers},
  journal   = {CoRR},
  volume    = {abs/2203.00555},
  year      = {2022},
}

@article{magneto,
  author    = {Hongyu Wang and Shuming Ma and Shaohan Huang and Li Dong and Wenhui Wang and Zhiliang Peng and Yu Wu and Payal Bajaj and Saksham Singhal and Alon Benhaim and Barun Patra and Zhun Liu and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {Foundation {Transformers}},
  journal   = {CoRR},
  volume    = {abs/2210.06423},
  year      = {2022}
}

@inproceedings{xmoe,
  title={On the Representation Collapse of Sparse Mixture of Experts},
  author={Zewen Chi and Li Dong and Shaohan Huang and Damai Dai and Shuming Ma and Barun Patra and Saksham Singhal and Payal Bajaj and Xia Song and Xian-Ling Mao and Heyan Huang and Furu Wei},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
  url={https://openreview.net/forum?id=mWaYC6CZf5}
}

@article{retnet,
  author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},
  title     = {Retentive Network: A Successor to {Transformer} for Large Language Models},
  journal   = {ArXiv},
  volume    = {abs/2307.08621},
  year      = {2023}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact Furu Wei and Shuming Ma with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party's policies.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.0

Oct 20, 2023

0.2.0

Mar 15, 2023

0.1.2

Mar 4, 2023

0.1.1

Nov 23, 2022

0.1.0

Nov 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchscale-0.3.0.tar.gz (51.1 kB view details)

Uploaded Oct 20, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

torchscale-0.3.0-py3-none-any.whl (71.2 kB view details)

Uploaded Oct 20, 2023 Python 3

File details

Details for the file torchscale-0.3.0.tar.gz.

File metadata

Download URL: torchscale-0.3.0.tar.gz
Upload date: Oct 20, 2023
Size: 51.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.12

File hashes

Hashes for torchscale-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`10d109d7d01e87db573fb2fe183fe5469521b9c7a2fff02f039c6e674cf45685`
MD5	`725d2e9e2ac0550313b12095cafd9f55`
BLAKE2b-256	`1a959ca4618530bc2dce09a1e29b3ae2a48c087b1132f02c10c02020af2afc7f`

See more details on using hashes here.

File details

Details for the file torchscale-0.3.0-py3-none-any.whl.

File metadata

Download URL: torchscale-0.3.0-py3-none-any.whl
Upload date: Oct 20, 2023
Size: 71.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.12

File hashes

Hashes for torchscale-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3bf4950cece4b84c5c20a8ce782ff52cf4234c10539a2638524dc087d389a73`
MD5	`40e9c4acfbcf626b8baa6e3899cac775`
BLAKE2b-256	`9e2838455ac7991ea7f250b1be484b1c1da1c5e089dc56542882abfd98497d75`

See more details on using hashes here.

torchscale 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TorchScale - A Library of Foundation Architectures

Revolutionizing Transformers for (M)LLMs and AI

News

Installation

Getting Started

Key Features

Examples

Results

Stability Evaluation

Scaling-up Experiments

Acknowledgments

Citations

Contributing

Trademarks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes