Skip to main content

PAI EasyNLP Toolkit

Project description



EasyNLP is a Comprehensive and Easy-to-use NLP Toolkit

website online Open in PAI-DSW open issues GitHub pull-requests GitHub latest commit PRs Welcome

EasyNLP

EasyNLP is an easy-to-use NLP development and application toolkit in PyTorch, first released inside Alibaba in 2021. It is built with scalable distributed training strategies and supports a comprehensive suite of NLP algorithms for various NLP applications. EasyNLP integrates knowledge distillation and few-shot learning for landing large pre-trained models and provides a unified framework of model training, inference, and deployment for real-world applications. It has powered more than 10 BUs and more than 20 business scenarios within the Alibaba group. It is seamlessly integrated to Platform of AI (PAI) products, including PAI-DSW for development, PAI-DLC for cloud-native training, PAI-EAS for serving, and PAI-Designer for zero-code model training.

Main Features

  • Easy to use and highly customizable: In addition to providing easy-to-use and concise commands to call cutting-edge models, it also abstracts certain custom modules such as AppZoo and ModelZoo to make it easy to build NLP applications. It is equipped with the PAI PyTorch distributed training framework TorchAccelerator to speed up distributed training.
  • Compatible with open-source libraries: EasyNLP has APIs to support the training of models from Huggingface/Transformers with the PAI distributed framework. It also supports the pre-trained models in EasyTransfer ModelZoo.
  • Knowledge-injected pre-training: The PAI team has a lot of research on knowledge-injected pre-training, and builds a knowledge-injected model that wins first place in the CCF knowledge pre-training competition. EasyNLP integrates these cutting-edge knowledge pre-trained models, including DKPLM and KGBERT.
  • Landing large pre-trained models: EasyNLP provides few-shot learning capabilities, allowing users to finetune large models with only a few samples to achieve good results. At the same time, it provides knowledge distillation functions to help quickly distill large models to a small and efficient model to facilitate online deployment.

Installation

You can either install it from pip

$ pip install pai-easynlp (to be released)

or setup from the source:

$ git clone https://github.com/alibaba/EasyNLP.git
$ cd EasyNLP
$ python setup.py install

This repo is tested on Python3.6, PyTorch >= 1.8.

Quick Start

Now let's show how to use just a few lines of code to build a text classification model based on BERT.

from easynlp.core import Trainer
from easynlp.appzoo import ClassificationDataset, SequenceClassification
from easynlp.utils import initialize_easynlp

args = initialize_easynlp()

train_dataset = ClassificationDataset(
    pretrained_model_name_or_path=args.pretrained_model_name_or_path,
    data_file=args.tables,
    max_seq_length=args.sequence_length,
    input_schema=args.input_schema,
    first_sequence=args.first_sequence,
    label_name=args.label_name,
    label_enumerate_values=args.label_enumerate_values,
    is_training=True)

model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
Trainer(model=model,  train_dataset=train_dataset).train()

Then you can run the code:

python main.py \
  --mode train \
  --tables=train_toy.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --first_sequence=sent1 \
  --label_name=label \
  --label_enumerate_values=0,1 \
  --checkpoint_dir=./tmp/ \
  --epoch_num=1  \
  --app_name=text_classify \
  --user_defined_parameters='pretrain_model_name_or_path=bert-tiny-uncased'

You can also use AppZoo Command Line Tools to quickly train an App model. Take text classification on SST-2 dataset as an example. First you can download the train.tsv, and dev.tsv, then start training:

$ easynlp \
   --mode=train \
   --worker_gpu=1 \
   --tables=train.tsv,dev.tsv \
   --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
   --first_sequence=sent1 \
   --label_name=label \
   --label_enumerate_values=0,1 \
   --checkpoint_dir=./classification_model \
   --epoch_num=1  \
   --sequence_length=128 \
   --app_name=text_classify \
   --user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'

And then predict:

$ easynlp \
  --mode=predict \
  --tables=dev.tsv \
  --outputs=dev.pred.tsv \
  --input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
  --output_schema=predictions,probabilities,logits,output \
  --append_cols=label \
  --first_sequence=sent1 \
  --checkpoint_path=./classification_model \
  --app_name=text_classify

To learn more about the usage of AppZoo, please refer to our documentation.

ModelZoo

EasyNLP currently provides the following models in ModelZoo:

  1. PAI-BERT-zh (from Alibaba PAI): pre-trained BERT models with a large Chinese corpus.
  2. DKPLM (from Alibaba PAI): released with the paper DKPLM: Decomposable Knowledge-enhanced Pre-trained Language Model for Natural Language Understanding by Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He and Jun Huang.
  3. KGBERT (from Alibaba Damo Academy & PAI): pre-train BERT models with knowledge graph embeddings injected.
  4. BERT (from Google): released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  5. RoBERTa (from Facebook): released with the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov.
  6. Chinese RoBERTa (from HFL): the Chinese version of RoBERTa.
  7. MacBERT (from HFL): released with the paper Revisiting Pre-trained Models for Chinese Natural Language Processing by Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang and Guoping Hu.
  8. WOBERT (from ZhuiyiTechnology): the word-based BERT for the Chinese language.
  9. FashionBERT (from Alibaba PAI & ICBU): in progress.
  10. GEEP (from Alibaba PAI): in progress.

Please refer to this readme for the usage of these models in EasyNLP. Meanwhile, EasyNLP supports to load pretrained models from Huggingface/Transformers, please refer to this tutorial for details.

Landing Large Pre-trained Models

EasyNLP provide few-shot learning and knowledge distillation to help land large pre-trained models.

  1. PET (from LMU Munich and Sulzer GmbH): released with the paper Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference by Timo Schick and Hinrich Schutze. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  2. P-Tuning (from Tsinghua University, Beijing Academy of AI, MIT and Recurrent AI, Ltd.): released with the paper GPT Understands, Too by Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang and Jie Tang. We have made some slight modifications to make the algorithm suitable for the Chinese language.
  3. CP-Tuning (from Alibaba PAI): released with the paper Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning by Ziyun Xu, Chengyu Wang, Minghui Qiu, Fuli Luo, Runxin Xu, Songfang Huang and Jun Huang.
  4. Vanilla KD (from Alibaba PAI): distilling the logits of large BERT-style models to smaller ones.
  5. Meta KD (from Alibaba PAI): released with the paper Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains by Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li and Jun Huang.
  6. Data Augmentation (from Alibaba PAI): augmentating the data based on the MLM head of pre-trained language models.

CLUE Benchmark

EasyNLP provides a simple toolkit to benchmark clue datasets. You can simply use just this command to benchmark CLUE dataset.

# Format: bash run_clue.sh device_id train/predict dataset
# e.g.: 
bash run_clue.sh 0 train csl

We've tested a bert model (bert-base-chinese) on the datasets, the results of dev set are:

Task AFQMC CMNLI CSL IFLYTEK OCNLI TNEWS WSC
P 72.17% 79.10% 80.93% 60.22% 78.31% 57.52% 63.49%
F1 52.96% 79.10% 81.71% 60.22% 78.30% 57.52% 77.67%

Here is the detailed CLUE benchmark example.

Tutorials

License

This project is licensed under the Apache License (Version 2.0). This toolkit also contains some code modified from other repos under other open-source licenses. See the NOTICE file for more information.

ChangeLog

  • EasyNLP v0.0.3 was released in 01/04/2022. Please refer to tag_v0.0.3 for more details and history.

Contact Us

Scan the following QR codes to join Dingtalk discussion group. The group discussions are mostly in Chinese, but English is also welcomed.

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pai_easynlp-0.0.3-py3-none-any.whl (434.8 kB view details)

Uploaded Python 3

File details

Details for the file pai_easynlp-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pai_easynlp-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 434.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.5 tqdm/4.64.0 importlib-metadata/4.5.0 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.13

File hashes

Hashes for pai_easynlp-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 aa66b5a7e7ab087e7f1fd734c187ff78d1c60696bf6ebe88f425f6dfb2d5d76f
MD5 9fbbd2829011d12dbaba85108ca41cdd
BLAKE2b-256 fbf344c218cd8fedc35422b5dd09399aa05ec1e08fdc5b73ae6442a0e05d0df2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page