Skip to main content

A Lightweight Word Piece Tokenizer

Project description

A Lightweight Word Piece Tokenizer

PyPI version shields.io

This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.

Table of Contents

  1. Usage
  2. Making it Lightweight
  3. Matching Algorithm

Usage

Installing

Install and update using pip

pip install word-piece-tokenizer

Example

from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()

ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]

tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']

tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'

Running Tests

Test the tokenizer against hugging's face implementation:

pip install transformers
python tests/tokenizer_test.py

Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

Optional Features

The following features has been enabled by default instead of being configurable:

Category Feature
Tokenizer - The tokenizer utilises the pre-trained bert-based-uncased vocab list.
- Basic tokenization is performed before word piece tokenization
Text Cleaning - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent

Unused Features

The following features has been removed from the tokenizer:

  • pad_token, mask_token, and special tokens
  • Ability to add new tokens to the tokenizer
  • Ability to never split certain strings (never_split)
  • Unused functions such as build_inputs_with_special_tokens, get_special_tokens_mask, get_vocab, save_vocabulary, and more...

Matching Algorithm

The tokenizer's longest substring token matching algorithm is implemented using a trie instead of greedy longest-match-first

The Trie

The original Trie class has been modified to adapt to the modified longest substring token matching algorithm.

Instead of a split function that seperates the input string into substrings, the new trie implements a getLongestMatchToken function that returns the token value (int) of the longest substring match, and the remaining unmatched substring (str)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word-piece-tokenizer-1.0.1.tar.gz (120.3 kB view hashes)

Uploaded Source

Built Distribution

word_piece_tokenizer-1.0.1-py3-none-any.whl (119.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page