Skip to main content

A Lightweight Word Piece Tokenizer

Project description

A Lightweight Word Piece Tokenizer

PyPI version shields.io

This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.

Table of Contents

  1. Usage
  2. Making it Lightweight
  3. Matching Algorithm

Usage

Installing

Install and update using pip

pip install word-piece-tokenizer

Example

from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()

ids = tokenizer.tokenize('reading a storybook!')
# [101, 3752, 1037, 2466, 8654, 999, 102]

tokens = tokenizer.convert_ids_to_tokens(ids)
# ['[CLS]', 'reading', 'a', 'story', '##book', '!', '[SEP]']

tokenizer.convert_tokens_to_string(tokens)
# '[CLS] reading a storybook ! [SEP]'

Running Tests

Test the tokenizer against hugging's face implementation:

pip install transformers
python tests/tokenizer_test.py

Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

Optional Features

The following features has been enabled by default instead of being configurable:

Category Feature
Tokenizer - The tokenizer utilises the pre-trained bert-based-uncased vocab list.
- Basic tokenization is performed before word piece tokenization
Text Cleaning - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent

Unused Features

The following features has been removed from the tokenizer:

  • pad_token, mask_token, and special tokens
  • Ability to add new tokens to the tokenizer
  • Ability to never split certain strings (never_split)
  • Unused functions such as build_inputs_with_special_tokens, get_special_tokens_mask, get_vocab, save_vocabulary, and more...

Matching Algorithm

The tokenizer's longest substring token matching algorithm is implemented using a trie instead of greedy longest-match-first

The Trie

The original Trie class has been modified to adapt to the modified longest substring token matching algorithm.

Instead of a split function that seperates the input string into substrings, the new trie implements a getLongestMatchToken function that returns the token value (int) of the longest substring match, and the remaining unmatched substring (str)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word-piece-tokenizer-1.0.1.tar.gz (120.3 kB view details)

Uploaded Source

Built Distribution

word_piece_tokenizer-1.0.1-py3-none-any.whl (119.9 kB view details)

Uploaded Python 3

File details

Details for the file word-piece-tokenizer-1.0.1.tar.gz.

File metadata

  • Download URL: word-piece-tokenizer-1.0.1.tar.gz
  • Upload date:
  • Size: 120.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for word-piece-tokenizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 46348d0a75fb4df364fa20c1dcf050b813debe7dae9e2971cadd05e3c8c0b4ba
MD5 e059e09ca96646a424f66a8868d67164
BLAKE2b-256 c1e57a38a1cc6fe9d729c0ccbb92273bc6b0ffb8e1d4d9c76c3ee3b522b8fc8b

See more details on using hashes here.

File details

Details for the file word_piece_tokenizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for word_piece_tokenizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 655b8e918cebddfd1a13d11d77ca2c670939443e3dc9c4802d4472cc3eb6322c
MD5 a95321ddddc3bdeac359711ed1cdf72e
BLAKE2b-256 48f74bd06cb1294f0f6b133e415a1226c4e4f2a8ae01f1da869315f82674eed8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page