Skip to main content

Text tokenizers optimized for sparse retrieval.

Project description

text2tok

PyPI version PyPI license

Text tokenizers optimized for sparse retrieval.

Installation

python -m pip install text2tok

# (optional) enabling ICU-based tokenizers
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: pyicu

Usage

from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer

text_list = [
    "去過中國science院,覺得it's pretty good。",
    "I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
    "I can’t ‘admire’ such a 'beautiful' dog.",
    "最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]

cache_dir = "/volume/medical-llm/cache/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"

tokenizer_list = [
    ("REG", reg_tokenize),
    ("ICU", icu_tokenize),
    ("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
    ("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]

for text in text_list:
    print(f"{text}")
    for name, tokenize in tokenizer_list:
        token_list = tokenize(text)
        print(f"[{name}] {token_list}")
    print()

Result:

去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']

I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']

I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']

最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2tok-1.0.0.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text2tok-1.0.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file text2tok-1.0.0.tar.gz.

File metadata

  • Download URL: text2tok-1.0.0.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f0b05d4a0cd94e99a53a9e218f8f2aa362df262ddc933704c520cfd126d8acc7
MD5 7a1f753cdfb617694255c1cefea00d26
BLAKE2b-256 ed76c5dd901b544ec81607824b267c84c00e48642b556f92fa7c7044b4c06a46

See more details on using hashes here.

File details

Details for the file text2tok-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: text2tok-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f470b3a5dd728677f4b1b31633f8471896f2775c0007bf559632715d566141d4
MD5 4e191900eba938043d71674ba6c69977
BLAKE2b-256 61b21b68bca15ae2fea58f6eab437964e959bb182b102a42d3338cd4fc16c2cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page