A useful tool for bert tokenization
Project description
Bert tokens Tools
A useful tools to handle problems when you use Bert.
Installation
With Pip
This repository is tested on Python 3.6+, and can be installed using pip as follows:
pip install bert-tokens
Usage
Tokenization and token-span-convert
WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:
Token-span-convert
Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.
For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]
And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.
Example
from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span
dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bert_tokens-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 569f01b9ee3e5c4d036d14efbac4bd4c29cee169ddb9744a21786477c691d6eb |
|
MD5 | bea27ed5186447c6e50220a5f695ce71 |
|
BLAKE2b-256 | cede054f1dc784acaa8010a73b16e72cc3cf166cbf3318ec31beff7373fae3a3 |