Chinese Generative Pre-Training Transformer

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Chinese-GPT 中文GPT预训练模型

Chinese Generative Pre-Training(GPT) Language Model

This project is unidirectional transformer GPT model (117M) trained on a large corpus dataset following the approach OpenAI GPT-2. Due to limited computational resources, we did not train our model from scratch. Instead, we take the advantage of BERT and use its weights as initialization to train our Chinese GPT. This makes the training possible on 4 x 1080Ti.

However, please notice that currently the performance still cannot match the original English GPT-2 model for various reasons. This can be that OpenAI has done better text filtering and has a dataset with better quality. Also, they have trained their model for about 300 GPU days at least. But the model here can be a good starting point if you want to apply it for substream tasks.

Features

This repository contains a rewritten cached Transformed based on BERT, which is the same technique used in GPT-2 implementation. It can cache the intermediate results, and therefore save the compuation time and memory during the decoding stage.

Also, a CUDA kernel version of GELU activation function is provided. You have to insatll Cupy to use it. You can check cuda_gelu for the implementation. It is 2x faster than the original implementation!

Installation

Before using it, you might want to install the requirements first.

pip install -r requirements.txt

You can also install it via pip.

pip install chinese-gpt

Usage

Check tutorials for details.

I have also included a colab for demo: https://colab.research.google.com/drive/1cvBSt2uF7hYL1feDGt0dkCxIeaVXQs5x

Encoder Weights: https://drive.google.com/open?id=1Mr2-x_qT2hgyo0RalPjc09NmyNi6a_gs

Decoder Weights: https://drive.google.com/open?id=1W6n7Kv6kvHthUX18DhdGSzBYkyzDvxYh

Data Preparation

To train GPT, it requires a dataset from a wide range of sources.

We collected data from NLP Chinese Corpus

In details, we used:

社区问答json版(webtext2019zh) ：大规模高质量数据集
百科类问答json版(baike2018qa)
新闻语料json版(news2016zh)

Text Filtering

One thing to take care of is that text filtering. Since Bert Chinese tokenizer doesn't include some punctuations. You might want to use the following code to clean your data first:

import regex as re

def filterPunctuation(x):
    x = re.sub(r'[‘’]', "'", x)
    x = re.sub(r'[“”]', '"', x)
    x = re.sub(r'[…]', '...', x)
    x = re.sub(r'[—]', '-', x)
    x = re.sub(r"&nbsp", "", x)
    return x

You may also want to convert traditional Chinese to simplified Chinese and apply some other filtering techniques based on your data.

Reference

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

May 4, 2019

0.1.2

May 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_gpt-0.1.3.tar.gz (6.5 kB view details)

Uploaded May 4, 2019 Source

File details

Details for the file chinese_gpt-0.1.3.tar.gz.

File metadata

Download URL: chinese_gpt-0.1.3.tar.gz
Upload date: May 4, 2019
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for chinese_gpt-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`4207a4983965184710c7b11bd7a42d185e7f562b5830b2becf2ddb39ab53ebb3`
MD5	`c11d70c6b70796121003397c970171ef`
BLAKE2b-256	`fc07545022bcb92f355ae0005c0fb8391ad37679622e18bd7e816ede5727b61e`

See more details on using hashes here.

chinese-gpt 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Chinese-GPT 中文GPT预训练模型

Features

Installation

Usage

Data Preparation

Text Filtering

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes