Skip to main content

A useful tool for bert tokenization

Project description

Bert tokens Tools

A useful tools to handle problems when you use Bert.

Installation

With Pip

This repository is tested on Python 3.6+, and can be installed using pip as follows:

pip install bert-tokens

Usage

Tokenization and token-span-convert

WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:

Token-span-convert

Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.

For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]

And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.

Example

from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span

dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert-tokens-0.0.3.tar.gz (5.8 kB view hashes)

Uploaded Source

Built Distribution

bert_tokens-0.0.3-py3-none-any.whl (7.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page