A Lightweight Word Piece Tokenizer
Project description
A Lightweight Word Piece Tokenizer
This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.
Table of Contents
Usage
Installing
Install and update using pip
pip install word-piece-tokenizer
Example
from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()
ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]
tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']
tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'
Running Tests
Test the tokenizer against hugging's face implementation:
pip install transformers
python tests/tokenizer_test.py
Making It Lightweight
To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.
Optional Features
The following features has been enabled by default instead of being configurable:
Category | Feature |
---|---|
Tokenizer | - The tokenizer utilises the pre-trained bert-based-uncased vocab list. - Basic tokenization is performed before word piece tokenization |
Text Cleaning | - Chinese characters are padded with whitespace - Characters are converted to lowercase - Input string is stripped of accent |
Unused Features
The following features has been removed from the tokenizer:
pad_token
,mask_token
, and special tokens- Ability to add new tokens to the tokenizer
- Ability to never split certain strings (
never_split
) - Unused functions such as
build_inputs_with_special_tokens
,get_special_tokens_mask
,get_vocab
,save_vocabulary
, and more...
Matching Algorithm
The tokenizer's longest substring token matching algorithm is implemented using a trie
instead of greedy longest-match-first
The Trie
The original Trie
class has been modified to adapt to the modified longest substring token matching algorithm.
Instead of a split
function that seperates the input string into substrings, the new trie implements a getLongestMatchToken
function that returns the token value (int)
of the longest substring match, and the remaining unmatched substring (str)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file word-piece-tokenizer-1.0.1.tar.gz
.
File metadata
- Download URL: word-piece-tokenizer-1.0.1.tar.gz
- Upload date:
- Size: 120.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46348d0a75fb4df364fa20c1dcf050b813debe7dae9e2971cadd05e3c8c0b4ba |
|
MD5 | e059e09ca96646a424f66a8868d67164 |
|
BLAKE2b-256 | c1e57a38a1cc6fe9d729c0ccbb92273bc6b0ffb8e1d4d9c76c3ee3b522b8fc8b |
File details
Details for the file word_piece_tokenizer-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: word_piece_tokenizer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 119.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 655b8e918cebddfd1a13d11d77ca2c670939443e3dc9c4802d4472cc3eb6322c |
|
MD5 | a95321ddddc3bdeac359711ed1cdf72e |
|
BLAKE2b-256 | 48f74bd06cb1294f0f6b133e415a1226c4e4f2a8ae01f1da869315f82674eed8 |