UnicodeTokenizer: tokenize all Unicode text
Project description
UnicodeTokenizer
UnicodeTokenizer: tokenize all Unicode text
切词规则 Tokenize Rules
- break line
- Punctuation
- UnicodeScripts
- Split(" ?[^(\s|[.,!?…。,、।۔،])]+"
- break word
use
pip install UnicodeTokenizer
from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()
line = """
首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终 no such name"
的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國).
""".strip()
print(tokenizer.tokenize(line))
print(tokenizer.split_lines(line))
or
git clone https://github.com/laohur/UnicodeTokenizer
cd UnicodeTokenizer # modify
pip install -e .
reference
- PyICU https://gitlab.pyicu.org/main/pyicu
- tokenizers https://github.com/huggingface/tokenizers
- ICU-tokenizer https://github.com/mingruimingrui/ICU-tokenizer/tree/master
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file UnicodeTokenizer-0.2.2.tar.gz
.
File metadata
- Download URL: UnicodeTokenizer-0.2.2.tar.gz
- Upload date:
- Size: 2.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 579e817b08b2b01c3d4c4139689d1a9844714de41403ecb936a7a638f7cfb042 |
|
MD5 | 5b63f3cc4622ab47a188a6f2744c7b4a |
|
BLAKE2b-256 | 56c07ec0c8cda52fb720b3579c9e881bf4ce27eb602052bb95cc7c0602f3372b |
File details
Details for the file UnicodeTokenizer-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: UnicodeTokenizer-0.2.2-py3-none-any.whl
- Upload date:
- Size: 3.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7116273a9c35d812146c10f9c521c71c47b6b8f3095a558ba7fa1d349e0595b |
|
MD5 | df4001275e17c125ed61570630bc8851 |
|
BLAKE2b-256 | e5234c8a7ed50a7bb75c655f1c62765a0d1c0146937a0fb33f359cf70cc25d40 |