Skip to main content

UnicodeTokenizer: tokenize all Unicode text

Project description

UnicodeTokenizer

UnicodeTokenizer: tokenize all Unicode text

切词规则 Tokenize Rules

  • break line
  • Punctuation
  • UnicodeScripts
  • Split(" ?[^(\s|[.,!?…。,、।۔،])]+"
  • break word

use

pip install UnicodeTokenizer

from UnicodeTokenizer import UnicodeTokenizer
tokenizer=UnicodeTokenizer()

line = """ 
        首先8.88设置 st。art_new_word=True 和 output=[açaí],output 就是最终‘ no such name"
        的输出คุณจะจัดพิธีแต่งงานเมื่อไรคะ탑승 수속해야pneumonoultramicroscopicsilicovolcanoconiosis"
        하는데 카운터가 어디에 있어요ꆃꎭꆈꌠꊨꏦꏲꅉꆅꉚꅉꋍꂷꂶꌠلأحياء تمارين تتطلب من [MASK] [PAD] [CLS][SEP]
        est 𗴂𗹭𘜶𗴲𗂧, ou "phiow-bjij-lhjij-lhjij", ce que l'on peut traduire par « pays-grand-blanc-élevé » (白高大夏國). 
    """.strip()
print(tokenizer.tokenize(line))
print(tokenizer.split_lines(line))

or

git clone https://github.com/laohur/UnicodeTokenizer
cd UnicodeTokenizer # modify 
pip install -e .

reference

License

Anti-996 License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

UnicodeTokenizer-0.2.2.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

UnicodeTokenizer-0.2.2-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file UnicodeTokenizer-0.2.2.tar.gz.

File metadata

  • Download URL: UnicodeTokenizer-0.2.2.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for UnicodeTokenizer-0.2.2.tar.gz
Algorithm Hash digest
SHA256 579e817b08b2b01c3d4c4139689d1a9844714de41403ecb936a7a638f7cfb042
MD5 5b63f3cc4622ab47a188a6f2744c7b4a
BLAKE2b-256 56c07ec0c8cda52fb720b3579c9e881bf4ce27eb602052bb95cc7c0602f3372b

See more details on using hashes here.

File details

Details for the file UnicodeTokenizer-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for UnicodeTokenizer-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e7116273a9c35d812146c10f9c521c71c47b6b8f3095a558ba7fa1d349e0595b
MD5 df4001275e17c125ed61570630bc8851
BLAKE2b-256 e5234c8a7ed50a7bb75c655f1c62765a0d1c0146937a0fb33f359cf70cc25d40

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page