Skip to main content

A Lightweight Word Piece Tokenizer

Project description

A Lightweight Word Piece Tokenizer

This library is an implementation of a modified version of Huggingface's Bert Tokenizer in pure python.

Table of Contents

  1. Usage
  2. Making it Lightweight
  3. Matching Algorithm

Usage

Installing

Install and update using pip

pip install word-piece-tokenizer

Example

from word_piece_tokenizer import WordPieceTokenizer
tokenizer = WordPieceTokenizer()
tokenizer.tokenize("Hello World!")

Running Tests

Test the tokenizer against hugging's face implementation:

pip install transformers
python tests/tokenizer_test.py

Making It Lightweight

To make the tokenizer more lightweight and versatile for usage such as embedded systems and browsers, the tokenizer has been stripped of optional and unused features.

Optional Features

The following features has been enabled by default instead of being configurable:

Category Feature
Tokenizer - The tokenizer utilises the pre-trained bert-based-uncased vocab list.
- Basic tokenization is performed before word piece tokenization
Text Cleaning - Chinese characters are padded with whitespace
- Characters are converted to lowercase
- Input string is stripped of accent

Unused Features

The following features has been removed from the tokenizer:

  • pad_token, mask_token, and special tokens
  • Ability to add new tokens to the tokenizer
  • Ability to never split certain strings (never_split)
  • Unused functions such as build_inputs_with_special_tokens, get_special_tokens_mask, get_vocab, save_vocabulary, and more...

Matching Algorithm

The tokenizer's longest substring token matching algorithm is implemented using a trie instead of greedy longest-match-first

The Trie

The original Trie class has been modified to adapt to the modified longest substring token matching algorithm.

Instead of a split function that seperates the input string into substrings, the new trie implements a getLongestMatchToken function that returns the token value (int) of the longest substring match, and the remaining unmatched substring (str)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word-piece-tokenizer-1.0.0.tar.gz (119.8 kB view details)

Uploaded Source

Built Distribution

word_piece_tokenizer-1.0.0-py3-none-any.whl (119.6 kB view details)

Uploaded Python 3

File details

Details for the file word-piece-tokenizer-1.0.0.tar.gz.

File metadata

  • Download URL: word-piece-tokenizer-1.0.0.tar.gz
  • Upload date:
  • Size: 119.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.7

File hashes

Hashes for word-piece-tokenizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9eec64839995153eda13de124407b6f97056ff416f6ed99825621d01259175ae
MD5 b35878715c2d9c1a7dd4e4d97f5b6d1a
BLAKE2b-256 1a12ef710a371775acac63ac07aa75caeae9a3e6d985d284576b91c23a7f57c5

See more details on using hashes here.

Provenance

File details

Details for the file word_piece_tokenizer-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for word_piece_tokenizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d2de03358bcbe01acc9827f3e7256e0cc4debcdcc48dd170cd8f4039abdc926f
MD5 dd8fdb7869587f653ef0a26189f5fe7a
BLAKE2b-256 b1730e2ddab45ad7a8af6ae463074cc2af0d68ad994b28eb6ac3605dc86b81d9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page