Extremely fast bert tokenizer
Project description
Usage Guide for FlashTokenizer
FlashTokenizer is a high-performance tokenizer implemented in C++ for efficient LLM inference. It's designed to be significantly faster and equally accurate compared to traditional tokenizers.
Installation
Install FlashTokenizer easily using pip:
pip install -U flash-tokenizer
Or from source:
git clone https://github.com/NLPOptimize/flash-tokenizer
cd flash-tokenizer/prj
pip install .
Prerequisites
- Windows (AMD64), MacOS (ARM64), Ubuntu (x86-64)
- Python 3.8 to 3.13
- g++, clang++, or MSVC
Quick Start
FlashTokenizer supports various pretrained models:
from flash_tokenizer import BertTokenizerFlash
print(*BertTokenizerFlash.get_pretrained(), sep="\n")
Output:
bert-base-cased
bert-base-uncased
bert-base-chinese
bert-base-multilingual-cased
bert-base-multilingual-uncased
kcbert-base
llmlingua-2-bert-base-multilingual-cased-meetingbank
Tokenizing Text
FlashTokenizer usage aligns closely with Hugging Face's BertTokenizer:
from flash_tokenizer import BertTokenizerFlash
from transformers import BertTokenizer
titles = [
'is there any doubt about it "None whatsoever"',
"세상 어떤 짐승이 이를 드러내고 사냥을 해? 약한 짐승이나 몸을 부풀리지, 진짜 짐승은 누구보다 침착하지.",
'そのように二番目に死を偽装して生き残るようになったイタドリがどうして初めて見る自分をこんなに気遣ってくれるのかと尋ねると「私が大切にする人たちがあなたを大切にするから」と答えては'
]
# Load tokenizer with vocab file
tokenizer = BertTokenizerFlash('vocab.txt', do_lower_case=False, model_max_length=512)
for title in titles:
tokens = tokenizer.tokenize(title)
token_ids = tokenizer(title, max_length=512, padding="longest").input_ids[0]
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}\n")
Using Pretrained Tokenizers
You can also directly load pretrained tokenizers:
from flash_tokenizer import BertTokenizerFlash
tokenizer = BertTokenizerFlash.from_pretrained('bert-base-multilingual-cased')
Comparing Accuracy
FlashTokenizer allows easy accuracy comparison with Hugging Face's tokenizer:
from flash_tokenizer import BertTokenizerFlash
from transformers import BertTokenizer
texts = ["Chess is Life.", "Dies Spiel ist ein Probierstein des Gehirns."]
flash_tokenizer = BertTokenizerFlash.from_pretrained('bert-base-multilingual-uncased', original=True)
hf_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
correct = 0
for text in texts:
flash_ids = flash_tokenizer(text, max_length=512).input_ids[0]
hf_ids = hf_tokenizer(text, max_length=512, return_tensors="np").input_ids[0].tolist()
correct += int(flash_ids == hf_ids)
accuracy = correct * 100 / len(texts)
print(f"Accuracy: {accuracy:.2f}%")
Support
FlashTokenizer is actively maintained and optimized. For issues, feature requests, or contributions, visit our GitHub repository.
Enjoy the fast and efficient tokenization with FlashTokenizer!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flash_tokenizer-1.2.0.tar.gz.
File metadata
- Download URL: flash_tokenizer-1.2.0.tar.gz
- Upload date:
- Size: 5.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
079e4c0039ab56ccc7605511c816f1cfd04cbc074bc02ace775227398a19b7fa
|
|
| MD5 |
e29198049bfc35cec4fe0dbb122f95e0
|
|
| BLAKE2b-256 |
d80d2099e955e31901e425420ab20761150a69d07a3534edb10961817ad77a88
|
File details
Details for the file flash_tokenizer-1.2.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 330.6 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b4489f0a3abc627af94bfb2df857c172760bd23c1a55bda297027132c855147
|
|
| MD5 |
043c34e362412d6641c7e3b5b9329b32
|
|
| BLAKE2b-256 |
360c0d66681223b52fc9cdda868ca486f0cf1d1d781b9efc852b783704ed46db
|
File details
Details for the file flash_tokenizer-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp313-cp313-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 374.0 kB
- Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8ed41a3e88bac55c58deb109d690731bc95e603cc740c46e119ed1520d9f6a8
|
|
| MD5 |
05f9ce78565e5bfe8ce2301682eb9644
|
|
| BLAKE2b-256 |
7ec46f0166b4d3ed8110855d63912ff57db5a0ca5a3ceaa6628b2e00ba930966
|
File details
Details for the file flash_tokenizer-1.2.0-cp313-cp313-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp313-cp313-macosx_15_0_arm64.whl
- Upload date:
- Size: 198.0 kB
- Tags: CPython 3.13, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c0095cc4f09a37183b215e47fcd7d006e0600858b754272fd7c49b1887009b6
|
|
| MD5 |
0c1db8135a80ef5c7e6a6883b2e4fa43
|
|
| BLAKE2b-256 |
924312437ceb4be15fe585c088793ee8b40cbc682496dda6b5bce3550cfaa5c9
|
File details
Details for the file flash_tokenizer-1.2.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 330.5 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c3c14ab3aecc912558fb4d1acf38c274dcf5cb82cd65889466e4fb4afebb90f
|
|
| MD5 |
1f770115a23c17c71ba76472da20b594
|
|
| BLAKE2b-256 |
5888316d17ec44786e9c7fd721905aaad7936cc8a4587400eff2a3070efcc7ad
|
File details
Details for the file flash_tokenizer-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 374.1 kB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ead3522450f97cb2d1291574a423ed231a22b056126713f1e7d425ae68118c51
|
|
| MD5 |
365faa72465328bbe99a709e83f3b5c5
|
|
| BLAKE2b-256 |
74633f71de5d16e204cee5bdceb39d875378c6a58f01435f758e2d461b89e274
|
File details
Details for the file flash_tokenizer-1.2.0-cp312-cp312-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp312-cp312-macosx_15_0_arm64.whl
- Upload date:
- Size: 197.9 kB
- Tags: CPython 3.12, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0921dce5d3aac0187306e8bc3361f034a5db1999c3d5c42bd1e1314d4d344bcc
|
|
| MD5 |
dd31145d11d96ab1e5060b200fd63a9f
|
|
| BLAKE2b-256 |
500a0b6cc0cd6e4a9d6b456f962404d09c67c164b8bcf0c9711936914eb98902
|
File details
Details for the file flash_tokenizer-1.2.0-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 329.9 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a1729827c3852eecec9bf993f0e77c6876c5899f439a32c6824b96795fedf4a
|
|
| MD5 |
6c1675081480b520b62bbaa2c512ee38
|
|
| BLAKE2b-256 |
915928a48b72af1b9fa97e594240ba5651fee6a0c562697e7121422582225b6a
|
File details
Details for the file flash_tokenizer-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 376.0 kB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
326b97208df1c3516d40a1f71bbefd3c79b88a7b0a08cedf30c5e9c3e822032c
|
|
| MD5 |
da38f649ae48b0b2c00923f3b2cb614c
|
|
| BLAKE2b-256 |
04e33d3b6f5b1d4589c021189bc8c3f5ba2c6d34d879706bd619e2502174692b
|
File details
Details for the file flash_tokenizer-1.2.0-cp311-cp311-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp311-cp311-macosx_15_0_arm64.whl
- Upload date:
- Size: 198.4 kB
- Tags: CPython 3.11, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84281c8cb058d901fdde2f0d9248ef53b2fda889d7dcc7bdb3b4d7456aca3384
|
|
| MD5 |
0321ca4fce62549f06d1835dabaefd87
|
|
| BLAKE2b-256 |
59747745d51849643a1e76da08b8246debdd57e9ce9dd5f73cc00135cba66c2b
|
File details
Details for the file flash_tokenizer-1.2.0-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 328.4 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cac9365a26e62e325ecb93cb0433e8a4d9bcc1ad83f05c4b67185de50a4c64b
|
|
| MD5 |
8e41e91f4366adc39afc683e688cb3aa
|
|
| BLAKE2b-256 |
e934ede484c78e4e0ccf0e5bc761d61961397e56e8853e6b5d792f5a40b4d3ee
|
File details
Details for the file flash_tokenizer-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 371.0 kB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
961c57ca06dca8ffe30f214f772ca1fed1db8aca530d64351b759c29b2310dd8
|
|
| MD5 |
df9b2870c4358fde2c9fcfc8e9becf8a
|
|
| BLAKE2b-256 |
b6fa43ddc940587f302b8ceb058589b23b05ea62c334b94f865c0a6b5a3fe938
|
File details
Details for the file flash_tokenizer-1.2.0-cp310-cp310-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp310-cp310-macosx_15_0_arm64.whl
- Upload date:
- Size: 197.0 kB
- Tags: CPython 3.10, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2a48598f07e9f16935194f764f4f4fc68705c18d27c10e8b2d32c3007d08a20
|
|
| MD5 |
c94cb50364192d5fda72aff6698d4ea5
|
|
| BLAKE2b-256 |
1ceab8cb319fc3ff9a6861580f3c761981b48d957d64abb1129fcffdbd70915c
|
File details
Details for the file flash_tokenizer-1.2.0-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 327.9 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00ad2ff5c875f5fcd42b817097db9abd72c7115ca60eff20644048cb905ddae3
|
|
| MD5 |
44d7bb2c1473af1a00efbefe78c15763
|
|
| BLAKE2b-256 |
7d98f081abdaf8134583c5c30aea621f6937c6a84035e501251190d9042fab06
|
File details
Details for the file flash_tokenizer-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 370.1 kB
- Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94c6b7ad6cf16998e5d9c5170527326c043d9afe1499cd2eec00bac98ae8a818
|
|
| MD5 |
2dbb4d07c7c4fb8a66f8a24bf8cf2717
|
|
| BLAKE2b-256 |
0e0346caf34ffff2dd7f62ddbf396c2f9ad3403d45c0e1734cb3f77b5f994306
|
File details
Details for the file flash_tokenizer-1.2.0-cp39-cp39-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp39-cp39-macosx_15_0_arm64.whl
- Upload date:
- Size: 197.1 kB
- Tags: CPython 3.9, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fbf92bf992d7e65437dc3c531142e45f38c82995738e4fce2593de0f4745f62
|
|
| MD5 |
e73cb4bb39e32acc513412e27695c3e1
|
|
| BLAKE2b-256 |
671061656b0394c42e7959fd3595ac8a7020b80f5831989914dc0a0bf02ef0b0
|
File details
Details for the file flash_tokenizer-1.2.0-cp38-cp38-win_amd64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 328.6 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6ffcb0071950e1d25ae87f1574bb68993af1739012e095427f456732e143fef
|
|
| MD5 |
d50530615765425b702a1b6b1183448d
|
|
| BLAKE2b-256 |
b0ecb7625164d8e8d388d47e7875b3723730b76584ed05839d9c7f7487b4e7f7
|
File details
Details for the file flash_tokenizer-1.2.0-cp38-cp38-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp38-cp38-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 370.4 kB
- Tags: CPython 3.8, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1191b44729004fbaae1e0220066120bbe0ed1efeb1a35287e94c27b48041f0f6
|
|
| MD5 |
0a9f206f655ecb65ec97b4590f2af627
|
|
| BLAKE2b-256 |
241649699118f2561c1f75510ff081c900d1aab4fcde7a3b5ad086e51818e6e0
|
File details
Details for the file flash_tokenizer-1.2.0-cp38-cp38-macosx_15_0_arm64.whl.
File metadata
- Download URL: flash_tokenizer-1.2.0-cp38-cp38-macosx_15_0_arm64.whl
- Upload date:
- Size: 196.8 kB
- Tags: CPython 3.8, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a867cb9b44a81d77e4fde3a8b1eef336cbfa0459dc6e81546825fa3692f3c6c3
|
|
| MD5 |
ca5c498e849c79815f5ef1d06f0ff0c4
|
|
| BLAKE2b-256 |
36a60f94b2ffd8a48fb37c366b829eda40df1983aec60f8fdd9f2a00e8ea1c22
|