Text tokenizers optimized for sparse retrieval.
Project description
text2tok
Text tokenizers optimized for sparse retrieval.
Installation
python -m pip install text2tok
# (optional) enabling ICU-based tokenizers
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: pyicu
Usage
from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer
text_list = [
"去過中國science院,覺得it's pretty good。",
"I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
"I can’t ‘admire’ such a 'beautiful' dog.",
"最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]
cache_dir = "/root/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"
tokenizer_list = [
("REG", reg_tokenize),
("ICU", icu_tokenize),
("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]
for text in text_list:
print(f"{text}")
for name, tokenize in tokenizer_list:
token_list = tokenize(text)
print(f"[{name}] {token_list}")
print()
Result:
去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']
I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']
I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text2tok-1.1.1.tar.gz
(44.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
text2tok-1.1.1-py3-none-any.whl
(31.6 kB
view details)
File details
Details for the file text2tok-1.1.1.tar.gz.
File metadata
- Download URL: text2tok-1.1.1.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1c2c0a10e85e469301bee01bf69c12c57af8ae17a38c4026b3e1b1599346097
|
|
| MD5 |
df9e0223e23a8ac62d0175f43a4f75b2
|
|
| BLAKE2b-256 |
5373748c9e1316851d07a263b3283b6446dbbeb6f2ddd015cf3eac9f8b80765f
|
File details
Details for the file text2tok-1.1.1-py3-none-any.whl.
File metadata
- Download URL: text2tok-1.1.1-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7467f57e545be68a92e79e4b6295950522cccf0a2d4089adab9513383d9c0043
|
|
| MD5 |
c6ad87dae9edf134aeb8ec499ad46b56
|
|
| BLAKE2b-256 |
0b127c53f523185b14b514e702179a96a0c3d5abb92fd76cd248ec4c96a8260b
|