Plug-in-and-Play Toolbox for Stablizing Transformer Training
Project description
Admin-Torch
Transformers Training **Stabilized**
What's New? • Key Idea • How To Use • Docs • Examples • Citation • License
Here, we provide a plug-in-and-play implementation of Admin, which stabilizes previously-diverged Transformer training and achieves better performance, without introducing additional hyper-parameters. The design of Admin is half-precision friendly and can be reparameterized into the original Transformer.
What's New?
Beyond the original admin implementation:
admin-torch
removed the profilling stage and is plug-in-and-play.admin-torch
's implementation is more robust (see below).
Comparison w. the DeepNet Init and the Original Admin Init (on WMT'17).
Regular batch size (8x4096) | Huge batch size (128x4096) | |
---|---|---|
Original Admin | ✅ | ❌ |
DeepNet | ❌ | ✅ |
admin-torch |
✅ | ✅ |
More details can be found in our example.
Key Idea
What complicates Transformer training?
For Transformer f, input x, randomly initialized weight w, we describe its stability (output_change_scale
) as
In our study, we show that, an original N-layer Transformer's
output_change_scale
is O(n)
, which unstabilizes its training. Admin stabilize Transformer's
training by regulating this scale to O(logn)
or O(1)
.
More details can be found in our paper.
How to use?
install
pip install admin-torch
import
import admin-torch
enjoy
def __init__(self, ...):
...
+(residual = admin-torch.as_module(self, self.number_of_sub_layers))+
...
def forward(self, ):
...
-!x = x + f(x)!-
+(x = residual(x, f(x)))+
x = self.LN(x)
...
An elaborated example can be found at our doc, and a real working example can be found at LiyuanLucasLiu/fairseq (training recipe is available at our example).
Citation
Please cite the following papers if you found our model useful. Thanks!
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20).
@inproceedings{liu2020admin,
title={Understanding the Difficulty of Training Transformers},
author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
year={2020}
}
Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao (2020). Very Deep Transformers for Neural Machine Translation. arXiv preprint arXiv:2008.07772 (2020).
@inproceedings{liu_deep_2020,
author = {Liu, Xiaodong and Duh, Kevin and Liu, Liyuan and Gao, Jianfeng},
booktitle = {arXiv:2008.07772 [cs]},
title = {Very Deep Transformers for Neural Machine Translation},
year = {2020}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file admin_torch-0.1.0.tar.gz
.
File metadata
- Download URL: admin_torch-0.1.0.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc7fce3fafed83d719e0a0594a0201bee54e53807d6f9f0715391bfb89803db4 |
|
MD5 | 10ac9b8de35b6f3a0f71457bf05b215b |
|
BLAKE2b-256 | e56f9b420533c0f9f09536d88f17fe5b79e046c225b414d6f9b872de17978189 |
File details
Details for the file admin_torch-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: admin_torch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd5696ff43a699b97bee2f58b476653dcf393c0320f43d9a22eee9a6a71ddc70 |
|
MD5 | cde297e48c2df8115ddba75ce12ffb29 |
|
BLAKE2b-256 | 8ff8a637a2448682e641efb852821084906ff07e6548d75c2ef322f35c342763 |