Berserker - BERt chineSE woRd toKenizER
Project description
# Berserker
Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.
## Installation
```python
pip install basaka
```
## Usage
```python
import berserker
berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']
```
## Benchmark
The table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).
The result below is trained with 15 epoches on each dataset with a batch size of 64.
| | PKU | CITYU | MSR | AS |
|--------------------|----------|----------|----------|----------|
| Liu et al. (2016) | **96.8** | -- | 97.3 | -- |
| Yang et al. (2017) | 96.3 | 96.9 | 97.5 | 95.7 |
| Zhou et al. (2017) | 96.0 | -- | 97.8 | -- |
| Cai et al. (2017) | 95.8 | 95.6 | 97.1 | -- |
| Chen et al. (2017) | 94.3 | 95.6 | 96.0 | 94.6 |
| Wang and Xu (2017) | 96.5 | -- | 98.0 | -- |
| Ma et al. (2018) | 96.1 | **97.2** | 98.1 | 96.2 |
|--------------------|----------|----------|----------|----------|
| Berserker | 96.6 | 97.1 | **98.4** | **96.5** |
Reference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)
## Limitation
Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.
Currently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.
## Architecture
Berserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.
## Training
We provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.
## Bonus Video
[<img src="https://img.youtube.com/vi/H_xmyvABZnE/maxres1.jpg" alt="Yachae!! BERSERKER!!"/>](https://www.youtube.com/watch?v=H_xmyvABZnE)
Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.
## Installation
```python
pip install basaka
```
## Usage
```python
import berserker
berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']
```
## Benchmark
The table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).
The result below is trained with 15 epoches on each dataset with a batch size of 64.
| | PKU | CITYU | MSR | AS |
|--------------------|----------|----------|----------|----------|
| Liu et al. (2016) | **96.8** | -- | 97.3 | -- |
| Yang et al. (2017) | 96.3 | 96.9 | 97.5 | 95.7 |
| Zhou et al. (2017) | 96.0 | -- | 97.8 | -- |
| Cai et al. (2017) | 95.8 | 95.6 | 97.1 | -- |
| Chen et al. (2017) | 94.3 | 95.6 | 96.0 | 94.6 |
| Wang and Xu (2017) | 96.5 | -- | 98.0 | -- |
| Ma et al. (2018) | 96.1 | **97.2** | 98.1 | 96.2 |
|--------------------|----------|----------|----------|----------|
| Berserker | 96.6 | 97.1 | **98.4** | **96.5** |
Reference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)
## Limitation
Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.
Currently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.
## Architecture
Berserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.
## Training
We provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.
## Bonus Video
[<img src="https://img.youtube.com/vi/H_xmyvABZnE/maxres1.jpg" alt="Yachae!! BERSERKER!!"/>](https://www.youtube.com/watch?v=H_xmyvABZnE)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
basaka-0.2.1.tar.gz
(86.5 kB
view details)
Built Distribution
basaka-0.2.1-py3-none-any.whl
(70.6 kB
view details)
File details
Details for the file basaka-0.2.1.tar.gz
.
File metadata
- Download URL: basaka-0.2.1.tar.gz
- Upload date:
- Size: 86.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00985705ea711950378335a98f69b929dc602e3b92fd698f7ab76d4215e9cd69 |
|
MD5 | 590c5e641358deb36cd353cd8ebcdfaa |
|
BLAKE2b-256 | 4b9e84c6924c38e6871c1edf5191a825e5ca356a3cdcddaf91d71cb8e94a4d35 |
File details
Details for the file basaka-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: basaka-0.2.1-py3-none-any.whl
- Upload date:
- Size: 70.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04fb1365d3632de8f134a30ac10b4239dbb4380451a4717b45e82834d17abab7 |
|
MD5 | 95c79ac232e69319e62489ba5e6f528b |
|
BLAKE2b-256 | 1b41c6e5cfce944b020a277c9f3491c9e094dc15a565875af6155becf0f6f3ad |