Skip to main content

Text tokenizers optimized for sparse retrieval.

Project description

text2tok

PyPI version PyPI license

Text tokenizers optimized for sparse retrieval.

Installation

python -m pip install text2tok

# (optional) enabling ICU-based tokenizers
apt install pkg-config libicu-dev
python -m pip install --no-binary=:pyicu: pyicu

Usage

from text2tok import reg_tokenize, icu_tokenize, BPETokenizer, BERTTokenizer

text_list = [
    "去過中國science院,覺得it's pretty good。",
    "I'm having a state-of-the-art \"whopper\" at Mendy's and James'.",
    "I can’t ‘admire’ such a 'beautiful' dog.",
    "最多容納59,000個人,或5.9萬人,坪數對人數為1:3.",
]

cache_dir = "/root/hf_cache"
bpe_model = "Qwen/Qwen3-8B"
bert_model = "google-bert/bert-base-multilingual-cased"

tokenizer_list = [
    ("REG", reg_tokenize),
    ("ICU", icu_tokenize),
    ("BPE", BPETokenizer(bpe_model, cache_dir=cache_dir)),
    ("BRT", BERTTokenizer(bert_model, cache_dir=cache_dir)),
]

for text in text_list:
    print(f"{text}")
    for name, tokenize in tokenizer_list:
        token_list = tokenize(text)
        print(f"[{name}] {token_list}")
    print()

Result:

去過中國science院,覺得it's pretty good。
[REG] ['去過', '過中', '中國', 'science', '院', '覺得', 'pretty', 'good']
[ICU] ['去', '過', '中國', 'science', '院', '覺得', 'pretty', 'good']
[BPE] ['去', '過', '中國', 'science', '院', ',', '覺得', 'it', "'s", 'pretty', 'good', '。']
[BRT] ['去', '過', '中', '國', 'science', '院', ',', '覺', '得', 'it', "'", 's', 'pretty', 'good', '。']

I'm having a state-of-the-art "whopper" at Mendy's and James'.
[REG] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[ICU] ['having', 'state', 'art', 'whopper', 'mendy', 'james']
[BPE] ['I', "'m", 'having', 'a', 'state', '-of', '-the', '-art', '"', 'whopper', '"', 'at', 'Mendy', "'s", 'and', 'James', "'."]
[BRT] ['I', "'", 'm', 'having', 'a', 'state', '-', 'of', '-', 'the', '-', 'art', '"', 'whopper', '"', 'at', 'Mendy', "'", 's', 'and', 'James', "'", '.']

I can’t ‘admire’ such a 'beautiful' dog.
[REG] ['admire', 'beautiful', 'dog']
[ICU] ['admire', 'beautiful', 'dog']
[BPE] ['I', 'can', '’t', '‘', 'admire', '’', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']
[BRT] ['I', 'can', '[UNK]', 't', '[UNK]', 'admire', '[UNK]', 'such', 'a', "'", 'beautiful', "'", 'dog', '.']

最多容納59,000個人,或5.9萬人,坪數對人數為1:3.
[REG] ['最多', '多容', '容納', '59,000', '個人', '或', '5.9', '萬人', '坪數', '數對', '對人', '人數', '數為', '1', '3']
[ICU] ['最多', '容納', '59', '000', '個人', '或', '5', '9', '萬人', '坪', '數', '對', '人數', '為', '1', '3']
[BPE] ['最多', '容', '納', '5', '9', ',', '0', '0', '0', '個人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']
[BRT] ['最', '多', '容', '納', '59', ',', '000', '個', '人', ',', '或', '5', '.', '9', '萬', '人', ',', '坪', '數', '對', '人', '數', '為', '1', ':', '3', '.']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2tok-1.2.0.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text2tok-1.2.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file text2tok-1.2.0.tar.gz.

File metadata

  • Download URL: text2tok-1.2.0.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.2.0.tar.gz
Algorithm Hash digest
SHA256 d66becc2e67506b4ef1284e41e0afffc339aa42db66ae9f5116e0978abb7e4e4
MD5 38b13c3bb2667ac79add260830fd6d83
BLAKE2b-256 3b4608a346f76ea4074874ffa6981983987e40319ff28182a507bccc29fc1673

See more details on using hashes here.

File details

Details for the file text2tok-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: text2tok-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for text2tok-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb3803a8f0a69f415da9dd08ca152c4dd8d5dcda4fd25504a1102092a83dfde8
MD5 b56ac95fb0d05900415e621a773b75de
BLAKE2b-256 043948d7c99dd90e7d7c4bfc29a2c2c4d9a600d462662f9e8b0be328d3758377

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page