UnicodeTokenizer: tokenize all Unicode text
Project description
UnicodeTokenizer
UnicodeTokenizer: tokenize all Unicode text
切词规则 Tokenize Rules
- break line
- Punctuation
- UnicodeScripts
- Split(" ?[^(\s|[.,!?…。,、।۔،])]+"
- break word
use
pip install UnicodeTokenizer
from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()
doc0 = """
首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终 no such name"
的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國).
"""
print(tokenizer.tokenize(doc0))
print(tokenizer.split_lines(doc0))
reference
- PyICU https://gitlab.pyicu.org/main/pyicu
- tokenizers https://github.com/huggingface/tokenizers
- ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
UnicodeTokenizer-0.2.0.tar.gz
(3.6 kB
view hashes)
Built Distribution
Close
Hashes for UnicodeTokenizer-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93c14b906f5095bb0675838314c82240e2cf9682cab02ab7ab5d547e1218de59 |
|
MD5 | 97667568ef62c7b879b4382791622c80 |
|
BLAKE2b-256 | 7f69f037baa0b5aa7ee92c145fa7a4adc1bebc0459ce28f3171f4b2042cd6a26 |