Skip to main content

Extremely fast bert tokenizer

Project description

Usage Guide for FlashTokenizer

FlashTokenizer is a high-performance tokenizer implemented in C++ for efficient LLM inference. It's designed to be significantly faster and equally accurate compared to traditional tokenizers.


Installation

Install FlashTokenizer easily using pip:

pip install -U flash-tokenizer

Or from source:

git clone https://github.com/NLPOptimize/flash-tokenizer
cd flash-tokenizer/prj
pip install .

Prerequisites

  • Windows (AMD64), MacOS (ARM64), Ubuntu (x86-64)
  • Python 3.8 to 3.13
  • g++, clang++, or MSVC

Quick Start

FlashTokenizer supports various pretrained models:

from flash_tokenizer import BertTokenizerFlash

print(*BertTokenizerFlash.get_pretrained(), sep="\n")

Output:

bert-base-cased
bert-base-uncased
bert-base-chinese
bert-base-multilingual-cased
bert-base-multilingual-uncased
kcbert-base
llmlingua-2-bert-base-multilingual-cased-meetingbank

Tokenizing Text

FlashTokenizer usage aligns closely with Hugging Face's BertTokenizer:

from flash_tokenizer import BertTokenizerFlash
from transformers import BertTokenizer

titles = [
    'is there any doubt about it "None whatsoever"',
    "세상 어떤 짐승이 이를 드러내고 사냥을 해? 약한 짐승이나 몸을 부풀리지, 진짜 짐승은 누구보다 침착하지.",
    'そのように二番目に死を偽装して生き残るようになったイタドリがどうして初めて見る自分をこんなに気遣ってくれるのかと尋ねると「私が大切にする人たちがあなたを大切にするから」と答えては'
]

# Load tokenizer with vocab file
tokenizer = BertTokenizerFlash('vocab.txt', do_lower_case=False, model_max_length=512)

for title in titles:
    tokens = tokenizer.tokenize(title)
    token_ids = tokenizer(title, max_length=512, padding="longest").input_ids[0]
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}\n")

Using Pretrained Tokenizers

You can also directly load pretrained tokenizers:

from flash_tokenizer import BertTokenizerFlash

tokenizer = BertTokenizerFlash.from_pretrained('bert-base-multilingual-cased')

Comparing Accuracy

FlashTokenizer allows easy accuracy comparison with Hugging Face's tokenizer:

from flash_tokenizer import BertTokenizerFlash
from transformers import BertTokenizer

texts = ["Chess is Life.", "Dies Spiel ist ein Probierstein des Gehirns."]

flash_tokenizer = BertTokenizerFlash.from_pretrained('bert-base-multilingual-uncased', original=True)
hf_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')

correct = 0
for text in texts:
    flash_ids = flash_tokenizer(text, max_length=512).input_ids[0]
    hf_ids = hf_tokenizer(text, max_length=512, return_tensors="np").input_ids[0].tolist()
    correct += int(flash_ids == hf_ids)

accuracy = correct * 100 / len(texts)
print(f"Accuracy: {accuracy:.2f}%")

Support

FlashTokenizer is actively maintained and optimized. For issues, feature requests, or contributions, visit our GitHub repository.


Enjoy the fast and efficient tokenization with FlashTokenizer!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flash_tokenizer-1.2.0.tar.gz (5.2 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flash_tokenizer-1.2.0-cp313-cp313-win_amd64.whl (330.6 kB view details)

Uploaded CPython 3.13Windows x86-64

flash_tokenizer-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl (374.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp313-cp313-macosx_15_0_arm64.whl (198.0 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

flash_tokenizer-1.2.0-cp312-cp312-win_amd64.whl (330.5 kB view details)

Uploaded CPython 3.12Windows x86-64

flash_tokenizer-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (374.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp312-cp312-macosx_15_0_arm64.whl (197.9 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

flash_tokenizer-1.2.0-cp311-cp311-win_amd64.whl (329.9 kB view details)

Uploaded CPython 3.11Windows x86-64

flash_tokenizer-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (376.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp311-cp311-macosx_15_0_arm64.whl (198.4 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

flash_tokenizer-1.2.0-cp310-cp310-win_amd64.whl (328.4 kB view details)

Uploaded CPython 3.10Windows x86-64

flash_tokenizer-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (371.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp310-cp310-macosx_15_0_arm64.whl (197.0 kB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

flash_tokenizer-1.2.0-cp39-cp39-win_amd64.whl (327.9 kB view details)

Uploaded CPython 3.9Windows x86-64

flash_tokenizer-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (370.1 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp39-cp39-macosx_15_0_arm64.whl (197.1 kB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

flash_tokenizer-1.2.0-cp38-cp38-win_amd64.whl (328.6 kB view details)

Uploaded CPython 3.8Windows x86-64

flash_tokenizer-1.2.0-cp38-cp38-manylinux_2_28_x86_64.whl (370.4 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64

flash_tokenizer-1.2.0-cp38-cp38-macosx_15_0_arm64.whl (196.8 kB view details)

Uploaded CPython 3.8macOS 15.0+ ARM64

File details

Details for the file flash_tokenizer-1.2.0.tar.gz.

File metadata

  • Download URL: flash_tokenizer-1.2.0.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for flash_tokenizer-1.2.0.tar.gz
Algorithm Hash digest
SHA256 079e4c0039ab56ccc7605511c816f1cfd04cbc074bc02ace775227398a19b7fa
MD5 e29198049bfc35cec4fe0dbb122f95e0
BLAKE2b-256 d80d2099e955e31901e425420ab20761150a69d07a3534edb10961817ad77a88

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2b4489f0a3abc627af94bfb2df857c172760bd23c1a55bda297027132c855147
MD5 043c34e362412d6641c7e3b5b9329b32
BLAKE2b-256 360c0d66681223b52fc9cdda868ca486f0cf1d1d781b9efc852b783704ed46db

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a8ed41a3e88bac55c58deb109d690731bc95e603cc740c46e119ed1520d9f6a8
MD5 05f9ce78565e5bfe8ce2301682eb9644
BLAKE2b-256 7ec46f0166b4d3ed8110855d63912ff57db5a0ca5a3ceaa6628b2e00ba930966

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6c0095cc4f09a37183b215e47fcd7d006e0600858b754272fd7c49b1887009b6
MD5 0c1db8135a80ef5c7e6a6883b2e4fa43
BLAKE2b-256 924312437ceb4be15fe585c088793ee8b40cbc682496dda6b5bce3550cfaa5c9

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7c3c14ab3aecc912558fb4d1acf38c274dcf5cb82cd65889466e4fb4afebb90f
MD5 1f770115a23c17c71ba76472da20b594
BLAKE2b-256 5888316d17ec44786e9c7fd721905aaad7936cc8a4587400eff2a3070efcc7ad

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ead3522450f97cb2d1291574a423ed231a22b056126713f1e7d425ae68118c51
MD5 365faa72465328bbe99a709e83f3b5c5
BLAKE2b-256 74633f71de5d16e204cee5bdceb39d875378c6a58f01435f758e2d461b89e274

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 0921dce5d3aac0187306e8bc3361f034a5db1999c3d5c42bd1e1314d4d344bcc
MD5 dd31145d11d96ab1e5060b200fd63a9f
BLAKE2b-256 500a0b6cc0cd6e4a9d6b456f962404d09c67c164b8bcf0c9711936914eb98902

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3a1729827c3852eecec9bf993f0e77c6876c5899f439a32c6824b96795fedf4a
MD5 6c1675081480b520b62bbaa2c512ee38
BLAKE2b-256 915928a48b72af1b9fa97e594240ba5651fee6a0c562697e7121422582225b6a

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 326b97208df1c3516d40a1f71bbefd3c79b88a7b0a08cedf30c5e9c3e822032c
MD5 da38f649ae48b0b2c00923f3b2cb614c
BLAKE2b-256 04e33d3b6f5b1d4589c021189bc8c3f5ba2c6d34d879706bd619e2502174692b

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 84281c8cb058d901fdde2f0d9248ef53b2fda889d7dcc7bdb3b4d7456aca3384
MD5 0321ca4fce62549f06d1835dabaefd87
BLAKE2b-256 59747745d51849643a1e76da08b8246debdd57e9ce9dd5f73cc00135cba66c2b

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 5cac9365a26e62e325ecb93cb0433e8a4d9bcc1ad83f05c4b67185de50a4c64b
MD5 8e41e91f4366adc39afc683e688cb3aa
BLAKE2b-256 e934ede484c78e4e0ccf0e5bc761d61961397e56e8853e6b5d792f5a40b4d3ee

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 961c57ca06dca8ffe30f214f772ca1fed1db8aca530d64351b759c29b2310dd8
MD5 df9b2870c4358fde2c9fcfc8e9becf8a
BLAKE2b-256 b6fa43ddc940587f302b8ceb058589b23b05ea62c334b94f865c0a6b5a3fe938

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a2a48598f07e9f16935194f764f4f4fc68705c18d27c10e8b2d32c3007d08a20
MD5 c94cb50364192d5fda72aff6698d4ea5
BLAKE2b-256 1ceab8cb319fc3ff9a6861580f3c761981b48d957d64abb1129fcffdbd70915c

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 00ad2ff5c875f5fcd42b817097db9abd72c7115ca60eff20644048cb905ddae3
MD5 44d7bb2c1473af1a00efbefe78c15763
BLAKE2b-256 7d98f081abdaf8134583c5c30aea621f6937c6a84035e501251190d9042fab06

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 94c6b7ad6cf16998e5d9c5170527326c043d9afe1499cd2eec00bac98ae8a818
MD5 2dbb4d07c7c4fb8a66f8a24bf8cf2717
BLAKE2b-256 0e0346caf34ffff2dd7f62ddbf396c2f9ad3403d45c0e1734cb3f77b5f994306

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6fbf92bf992d7e65437dc3c531142e45f38c82995738e4fce2593de0f4745f62
MD5 e73cb4bb39e32acc513412e27695c3e1
BLAKE2b-256 671061656b0394c42e7959fd3595ac8a7020b80f5831989914dc0a0bf02ef0b0

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 f6ffcb0071950e1d25ae87f1574bb68993af1739012e095427f456732e143fef
MD5 d50530615765425b702a1b6b1183448d
BLAKE2b-256 b0ecb7625164d8e8d388d47e7875b3723730b76584ed05839d9c7f7487b4e7f7

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp38-cp38-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp38-cp38-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1191b44729004fbaae1e0220066120bbe0ed1efeb1a35287e94c27b48041f0f6
MD5 0a9f206f655ecb65ec97b4590f2af627
BLAKE2b-256 241649699118f2561c1f75510ff081c900d1aab4fcde7a3b5ad086e51818e6e0

See more details on using hashes here.

File details

Details for the file flash_tokenizer-1.2.0-cp38-cp38-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for flash_tokenizer-1.2.0-cp38-cp38-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a867cb9b44a81d77e4fde3a8b1eef336cbfa0459dc6e81546825fa3692f3c6c3
MD5 ca5c498e849c79815f5ef1d06f0ff0c4
BLAKE2b-256 36a60f94b2ffd8a48fb37c366b829eda40df1983aec60f8fdd9f2a00e8ea1c22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page