Skip to main content

Berserker - BERt chineSE woRd toKenizER

Project description

# Berserker
Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.

## Installation
```python
pip install basaka
```

## Usage
```python
import berserker

berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']
```

## Benchmark
The table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).

The result below is trained with 15 epoches on each dataset with a batch size of 64.

| | PKU | CITYU | MSR | AS |
|--------------------|----------|----------|----------|----------|
| Liu et al. (2016) | **96.8** | -- | 97.3 | -- |
| Yang et al. (2017) | 96.3 | 96.9 | 97.5 | 95.7 |
| Zhou et al. (2017) | 96.0 | -- | 97.8 | -- |
| Cai et al. (2017) | 95.8 | 95.6 | 97.1 | -- |
| Chen et al. (2017) | 94.3 | 95.6 | 96.0 | 94.6 |
| Wang and Xu (2017) | 96.5 | -- | 98.0 | -- |
| Ma et al. (2018) | 96.1 | **97.2** | 98.1 | 96.2 |
|--------------------|----------|----------|----------|----------|
| Berserker | 96.6 | 97.1 | **98.4** | **96.5** |

Reference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)

## Limitation
Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.

## Architecture
Berserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

## Training
We provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.

## Bonus Video
[<img src="https://img.youtube.com/vi/H_xmyvABZnE/maxres1.jpg" alt="Yachae!! BERSERKER!!"/>](https://www.youtube.com/watch?v=H_xmyvABZnE)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

basaka-0.2.1.tar.gz (86.5 kB view details)

Uploaded Source

Built Distribution

basaka-0.2.1-py3-none-any.whl (70.6 kB view details)

Uploaded Python 3

File details

Details for the file basaka-0.2.1.tar.gz.

File metadata

  • Download URL: basaka-0.2.1.tar.gz
  • Upload date:
  • Size: 86.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.4

File hashes

Hashes for basaka-0.2.1.tar.gz
Algorithm Hash digest
SHA256 00985705ea711950378335a98f69b929dc602e3b92fd698f7ab76d4215e9cd69
MD5 590c5e641358deb36cd353cd8ebcdfaa
BLAKE2b-256 4b9e84c6924c38e6871c1edf5191a825e5ca356a3cdcddaf91d71cb8e94a4d35

See more details on using hashes here.

File details

Details for the file basaka-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: basaka-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 70.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.6.4

File hashes

Hashes for basaka-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 04fb1365d3632de8f134a30ac10b4239dbb4380451a4717b45e82834d17abab7
MD5 95c79ac232e69319e62489ba5e6f528b
BLAKE2b-256 1b41c6e5cfce944b020a277c9f3491c9e094dc15a565875af6155becf0f6f3ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page