Skip to main content

Text tokenizers optimized for sparse retrieval.

Project description

text2tok

PyPI version PyPI license

Text tokenizers optimized for sparse retrieval.

Installation

python -m pip install text2tok

# (optional) enabling ICU-based tokenizers
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: pyicu

Usage

from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer

text_list = [
    "去過中國science院,覺得it's pretty good。",
    "I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
    "I can’t ‘admire’ such a 'beautiful' dog.",
    "最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]

cache_dir = "/root/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"

tokenizer_list = [
    ("REG", reg_tokenize),
    ("ICU", icu_tokenize),
    ("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
    ("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]

for text in text_list:
    print(f"{text}")
    for name, tokenize in tokenizer_list:
        token_list = tokenize(text)
        print(f"[{name}] {token_list}")
    print()

Result:

去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']

I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']

I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']

最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2tok-1.1.2.tar.gz (44.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text2tok-1.1.2-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file text2tok-1.1.2.tar.gz.

File metadata

  • Download URL: text2tok-1.1.2.tar.gz
  • Upload date:
  • Size: 44.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.1.2.tar.gz
Algorithm Hash digest
SHA256 983212c0694ce0b416890dd7262956962d89cf7c7ffa1f8c0de876a683f1fb55
MD5 b8bb770e31efbc53a2bcffc877b5f47f
BLAKE2b-256 7f036dc822aca99fdf73682122670bdc581c74bb2e8ad3502e28655313e93343

See more details on using hashes here.

File details

Details for the file text2tok-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: text2tok-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 89da29cb7f0b26d8b5800b33e6cc64b2cc7f7e9062c11c9307dff030411bc9f9
MD5 ccd91e089bb49f6a1b1569097ae168ff
BLAKE2b-256 33d85604d6b7e4fb284d63e38a1bd0ef928540a8c45451c2ac43a2581570a920

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page