A useful tool for bert tokenization
Project description
Bert tokens Tools
A useful tools to handle problems when you use Bert.
Installation
With Pip
This repository is tested on Python 3.6+, and can be installed using pip as follows:
pip install bert-tokens
Usage
Tokenization and token-span-convert
WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:
Token-span-convert
Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.
For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]
And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.
Example
from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span
dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bert-tokens-0.0.3.tar.gz
.
File metadata
- Download URL: bert-tokens-0.0.3.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.26.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f252c2197d41889dc64babb28af88871cb3b161f94dfd9e56ff8e68e4bfd7e4 |
|
MD5 | b7a3eb98f6d76850f00c506239bc3fa1 |
|
BLAKE2b-256 | 98255ad04c7c0437ec44c5f7734fe2471898bdb68b2eeda6765fa7de8fea9b41 |
File details
Details for the file bert_tokens-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: bert_tokens-0.0.3-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.26.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 569f01b9ee3e5c4d036d14efbac4bd4c29cee169ddb9744a21786477c691d6eb |
|
MD5 | bea27ed5186447c6e50220a5f695ce71 |
|
BLAKE2b-256 | cede054f1dc784acaa8010a73b16e72cc3cf166cbf3318ec31beff7373fae3a3 |