Skip to main content

Transformers at any scale

Project description

TorchScale - A Library for Transformers at (Any) Scale

MIT License MIT License

TorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively. It has the implementation of fundamental research to improve modeling generality and capability as well as training stability and efficiency of scaling Transformers.

  • Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond
  • Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)
  • Capability - A Length-Extrapolatable Transformer
  • Efficiency - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

News

  • November, 2022: TorchScale 0.1.1 released [Paper] [PyPI]

Installation

To install:

pip install torchscale

Alternatively, you can develop it locally:

git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .

Getting Started

It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000)
>>> model = Encoder(config)

>>> print(model)

We also support the Decoder architecture and the EncoderDecoder architecture:

# Creating a decoder model
>>> from torchscale.architecture.config import DecoderConfig
>>> from torchscale.architecture.decoder import Decoder

>>> config = DecoderConfig(vocab_size=64000)
>>> decoder = Decoder(config)
>>> print(decoder)

# Creating a encoder-decoder model
>>> from torchscale.architecture.config import EncoderDecoderConfig
>>> from torchscale.architecture.encoder_decoder import EncoderDecoder

>>> config = EncoderDecoderConfig(vocab_size=64000)
>>> encdec = EncoderDecoder(config)
>>> print(encdec)

Key Features

Most of the features above can be used by simply passing the corresponding parameters to the config. For example:

>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder

>>> config = EncoderConfig(vocab_size=64000, deepnorm=True, multiway=True)
>>> model = Encoder(config)

>>> print(model)

Examples

We have the examples of how to use TorchScale in the following scenarios/tasks:

We plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. DeepSpeed and Megatron-LM). Any comments or PRs are welcome!

Results

Stability Evaluation

The training curve is smooth by using TorchScale, while the baseline Transformer cannot converge.

Scaling-up Experiments

TorchScale supports arbitrary depths and widths, successfully scaling-up the models without pain.

Acknowledgments

Some implementations in TorchScale are either adapted from or inspired by the FairSeq repository and the UniLM repository.

Citations

If you find this repository useful, please consider citing our work:

@article{torchscale,
  author    = {Shuming Ma and Hongyu Wang and Shaohan Huang and Wenhui Wang and Zewen Chi and Li Dong and Alon Benhaim and Barun Patra and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {{TorchScale}: {Transformers} at Scale},
  journal   = {CoRR},
  volume    = {abs/2211.13184},
  year      = {2022}
}
@article{deepnet,
  author    = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
  title     = {{DeepNet}: Scaling {Transformers} to 1,000 Layers},
  journal   = {CoRR},
  volume    = {abs/2203.00555},
  year      = {2022},
}
@article{magneto,
  author    = {Hongyu Wang and Shuming Ma and Shaohan Huang and Li Dong and Wenhui Wang and Zhiliang Peng and Yu Wu and Payal Bajaj and Saksham Singhal and Alon Benhaim and Barun Patra and Zhun Liu and Vishrav Chaudhary and Xia Song and Furu Wei},
  title     = {Foundation {Transformers}},
  journal   = {CoRR},
  volume    = {abs/2210.06423},
  year      = {2022}
}
@inproceedings{xmoe,
  title={On the Representation Collapse of Sparse Mixture of Experts},
  author={Zewen Chi and Li Dong and Shaohan Huang and Damai Dai and Shuming Ma and Barun Patra and Saksham Singhal and Payal Bajaj and Xia Song and Xian-Ling Mao and Heyan Huang and Furu Wei},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
  url={https://openreview.net/forum?id=mWaYC6CZf5}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact Furu Wei and Shuming Ma with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

APAC SCALE-0.1.2.tar.gz (54.5 kB view details)

Uploaded Source

Built Distribution

APAC_SCALE-0.1.2-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file APAC SCALE-0.1.2.tar.gz.

File metadata

  • Download URL: APAC SCALE-0.1.2.tar.gz
  • Upload date:
  • Size: 54.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/35.0 requests/2.28.1 requests-toolbelt/0.9.1 urllib3/1.26.10 tqdm/4.64.0 importlib-metadata/4.12.0 keyring/23.7.0 rfc3986/1.5.0 colorama/0.4.5 CPython/3.7.8

File hashes

Hashes for APAC SCALE-0.1.2.tar.gz
Algorithm Hash digest
SHA256 05f067923d09af78a511d8e378c1ca5937155a97e97192a094d548f648d30f51
MD5 2d29f434960356ecbc828f1e2d0b9a15
BLAKE2b-256 54404c4cd202872bef15d6f1991dde2280b7ef00370d994feaeea9b7e15a2603

See more details on using hashes here.

File details

Details for the file APAC_SCALE-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: APAC_SCALE-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 59.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/35.0 requests/2.28.1 requests-toolbelt/0.9.1 urllib3/1.26.10 tqdm/4.64.0 importlib-metadata/4.12.0 keyring/23.7.0 rfc3986/1.5.0 colorama/0.4.5 CPython/3.7.8

File hashes

Hashes for APAC_SCALE-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e21c7bc2c16bdeb1dec5a1bf9ff2ad64697f61bc137d07848ecd60892964e9c4
MD5 ff100f04ab96c289144a5622ce421431
BLAKE2b-256 f6e3847ebe3cc7dd28b406971cd1d778d8c3eb26358699b46775fac9be213527

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page