A useful tool for bert tokenization
Project description
Bert tokens Tools
A useful tools to handle problems when you use Bert.
Installation
With Pip
This repository is tested on Python 3.6+, and can be installed using pip as follows:
pip install bert-tokens
Usage
Tokenization and token-span-convert
WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:
Token-span-convert
Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.
For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]
Example
from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span
dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
print(convert_word_span)
## [2, 4]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bert-tokens-0.0.2.tar.gz
(5.6 kB
view hashes)
Built Distribution
Close
Hashes for bert_tokens-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20ab3fc6bf48710e7b1938d303cbc944eff97cbd53dafc6ad3c8ccd4b864ca48 |
|
MD5 | 2340d682d45de46022f271672fcea145 |
|
BLAKE2b-256 | 818f22d75fb1d1dcdd0d8c35e9b7be91726ad3a84e803819aaa09be953b8129e |