Skip to main content

A useful tool for bert tokenization

Project description

Bert tokens Tools

A useful tools to handle problems when you use Bert.

Installation

With Pip

This repository is tested on Python 3.6+, and can be installed using pip as follows:

pip install bert-tokens

Usage

Tokenization and token-span-convert

WordPiece tokenization for BERT, which can be universally applicable for different language versions for BERT. The supported BERT checkpoints including but not limited to:

Token-span-convert

Convert token span from char-level to wordpiece-level. This usually happens in multi-lingual scenarios.

For example, query="播放mylove",the char-level index of sequence "mylove" is [2,8], while the token index after bert tokenization should be [2,4]

And convert token span from wordpiece-level to char-level, just as the reverse procedure of above.

Example

from bert_tokens.bert_tokenizer import Tokenizer
from bert_tokens.convert_word_span import convert_word_span, convert_char_span

dict_path = "vocab/vocab.txt"
tokenizer = Tokenizer(dict_path, do_lower_case=True)
tokens = tokenizer.tokenize("播放MYLOVE")
print(tokens)
## ['[CLS]', '播', '放', 'my', '##love', '[SEP]']
convert_word_span("播放MYLOVE", [2,8], tokenizer)
## [2, 4]
convert_char_span("播放MYLOVE", [2,4], tokenizer)
## [2, 8]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert-tokens-0.0.3.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

bert_tokens-0.0.3-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file bert-tokens-0.0.3.tar.gz.

File metadata

  • Download URL: bert-tokens-0.0.3.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.26.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.8

File hashes

Hashes for bert-tokens-0.0.3.tar.gz
Algorithm Hash digest
SHA256 9f252c2197d41889dc64babb28af88871cb3b161f94dfd9e56ff8e68e4bfd7e4
MD5 b7a3eb98f6d76850f00c506239bc3fa1
BLAKE2b-256 98255ad04c7c0437ec44c5f7734fe2471898bdb68b2eeda6765fa7de8fea9b41

See more details on using hashes here.

File details

Details for the file bert_tokens-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: bert_tokens-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.26.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.8

File hashes

Hashes for bert_tokens-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 569f01b9ee3e5c4d036d14efbac4bd4c29cee169ddb9744a21786477c691d6eb
MD5 bea27ed5186447c6e50220a5f695ce71
BLAKE2b-256 cede054f1dc784acaa8010a73b16e72cc3cf166cbf3318ec31beff7373fae3a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page