UnicodeTokenizer: tokenize all Unicode text
Project description
UnicodeTokenizer
UnicodeTokenizer: tokenize all Unicode text
切词规则 Tokenize Rules
- break line
- Punctuation
- UnicodeScripts
- Split(" ?[^(\s|[.,!?…。,、।۔،])]+"
- break word
use
pip install UnicodeTokenizer
from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()
line = """
首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终 no such name"
的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國).
""".strip()
print(tokenizer.tokenize(line))
print(tokenizer.split_lines(line))
or
git clone https://github.com/laohur/UnicodeTokenizer
cd UnicodeTokenizer # modify
pip install -e .
reference
- PyICU https://gitlab.pyicu.org/main/pyicu
- tokenizers https://github.com/huggingface/tokenizers
- ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
UnicodeTokenizer-0.2.2.tar.gz
(2.8 kB
view hashes)
Built Distribution
Close
Hashes for UnicodeTokenizer-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7116273a9c35d812146c10f9c521c71c47b6b8f3095a558ba7fa1d349e0595b |
|
MD5 | df4001275e17c125ed61570630bc8851 |
|
BLAKE2b-256 | e5234c8a7ed50a7bb75c655f1c62765a0d1c0146937a0fb33f359cf70cc25d40 |