Text tokenizers optimized for sparse retrieval.
Project description
text2tok
Text tokenizers optimized for sparse retrieval.
Installation
apt install pkg-config libicu-dev
python -m pip install text2tok
Usage
from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer
text_list = [
"去過中國science院,覺得it's pretty good。",
"I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
"I can’t ‘admire’ such a 'beautiful' dog.",
"最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]
cache_dir = "/root/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"
tokenizer_list = [
("REG", reg_tokenize),
("ICU", icu_tokenize),
("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]
for text in text_list:
print(f"{text}")
for name, tokenize in tokenizer_list:
token_list = tokenize(text)
print(f"[{name}] {token_list}")
print()
Result:
去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']
I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']
I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text2tok-1.2.1.tar.gz
(69.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
text2tok-1.2.1-py3-none-any.whl
(31.7 kB
view details)
File details
Details for the file text2tok-1.2.1.tar.gz.
File metadata
- Download URL: text2tok-1.2.1.tar.gz
- Upload date:
- Size: 69.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5581e53c2fc67e8c8b9583893f15c1939e5408877a3d59554e32084b0af1928c
|
|
| MD5 |
b41cc3291258722b071cae1b0e963710
|
|
| BLAKE2b-256 |
81a2c744c915c536b9ed373c5aa059f05f29f67b57eb00f387174f8240677207
|
File details
Details for the file text2tok-1.2.1-py3-none-any.whl.
File metadata
- Download URL: text2tok-1.2.1-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b1501c601d3f4fe44e792ded8591130a69cfdd3e32dcb37c7029ec6e0406bc3
|
|
| MD5 |
3f1da6cc64928e4f1319cd85463eede3
|
|
| BLAKE2b-256 |
18743c5a6855988f45965cba3c9ab0a7a59718e767b96597f2f1de528111d516
|