Text tokenizers optimized for sparse retrieval.
Project description
text2tok
Text tokenizers optimized for sparse retrieval.
Installation
python -m pip install text2tok
# (optional) enabling ICU-based tokenizers
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: pyicu
Usage
from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer
text_list = [
"去過中國science院,覺得it's pretty good。",
"I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
"I can’t ‘admire’ such a 'beautiful' dog.",
"最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]
cache_dir = "/volume/medical-llm/cache/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"
tokenizer_list = [
("REG", reg_tokenize),
("ICU", icu_tokenize),
("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]
for text in text_list:
print(f"{text}")
for name, tokenize in tokenizer_list:
token_list = tokenize(text)
print(f"[{name}] {token_list}")
print()
Result:
去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']
I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']
I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text2tok-1.0.0.tar.gz
(44.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
text2tok-1.0.0-py3-none-any.whl
(31.5 kB
view details)
File details
Details for the file text2tok-1.0.0.tar.gz.
File metadata
- Download URL: text2tok-1.0.0.tar.gz
- Upload date:
- Size: 44.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0b05d4a0cd94e99a53a9e218f8f2aa362df262ddc933704c520cfd126d8acc7
|
|
| MD5 |
7a1f753cdfb617694255c1cefea00d26
|
|
| BLAKE2b-256 |
ed76c5dd901b544ec81607824b267c84c00e48642b556f92fa7c7044b4c06a46
|
File details
Details for the file text2tok-1.0.0-py3-none-any.whl.
File metadata
- Download URL: text2tok-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f470b3a5dd728677f4b1b31633f8471896f2775c0007bf559632715d566141d4
|
|
| MD5 |
4e191900eba938043d71674ba6c69977
|
|
| BLAKE2b-256 |
61b21b68bca15ae2fea58f6eab437964e959bb182b102a42d3338cd4fc16c2cd
|