Text tokenizers optimized for sparse retrieval.
Project description
text2tok
Text tokenizers optimized for sparse retrieval.
Installation
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: text2tok
Usage
from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer
text_list = [
"去過中國science院,覺得it's pretty good。",
"I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
"I can’t ‘admire’ such a 'beautiful' dog.",
"最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]
cache_dir = "/root/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"
tokenizer_list = [
("REG", reg_tokenize),
("ICU", icu_tokenize),
("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]
for text in text_list:
print(f"{text}")
for name, tokenize in tokenizer_list:
token_list = tokenize(text)
print(f"[{name}] {token_list}")
print()
Result:
去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']
I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']
I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text2tok-1.2.2.tar.gz
(72.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
text2tok-1.2.2-py3-none-any.whl
(31.2 kB
view details)
File details
Details for the file text2tok-1.2.2.tar.gz.
File metadata
- Download URL: text2tok-1.2.2.tar.gz
- Upload date:
- Size: 72.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec0944151bea3741da7bc62d19ca5c507e20b3349995b059ac8815c0de677ba2
|
|
| MD5 |
29c0e891c0c5fa28616ccb96c8da6038
|
|
| BLAKE2b-256 |
ce9a097dba8065db08f0a94db74b7f83ceb03edb659ec468cd1590e01128afe5
|
File details
Details for the file text2tok-1.2.2-py3-none-any.whl.
File metadata
- Download URL: text2tok-1.2.2-py3-none-any.whl
- Upload date:
- Size: 31.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1680093764340453e6fa80acf8a2f75c5996c9ad921c665c952a3902dd699f9
|
|
| MD5 |
cab07622ef7fd7324cd064cc51f2472b
|
|
| BLAKE2b-256 |
d429b29f4ccf81dec75a78305b1d629e90a02773b17701aed4f0711372f105fb
|