Skip to main content

ZiTokenizer: tokenize world text as Zi

Project description

ZiTokenizer

Tokineze all languages text into Zi.

support 300+ languages from wikipedia, including global

use

  • pip install ZiTokenizer
from ZiTokenizer.ZiTokenizer import ZiTokenizer

# use
tokenizer = ZiTokenizer()  
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏20222019\U0010ffff"
indexs = tokenizer.encode(line)
tokens = tokenizer.decode(indexs)
line2=tokenizer.tokens2line(tokens)

# build
demo/unit.py

UnicodeTokenizer

basic tokeinzer

ZiCutter

汉字拆字

'瞼' -> ['⿰', '目', '僉']

ZiSegmenter

word => prefix + root + suffix

'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']

languages

default using "golabl" vocob, others from https://laohur.github.io/ZiTokenizer/index.html

tokenizer = ZiTokenizer(vocab_dir)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ZiTokenizer-0.0.8.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

ZiTokenizer-0.0.8-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file ZiTokenizer-0.0.8.tar.gz.

File metadata

  • Download URL: ZiTokenizer-0.0.8.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for ZiTokenizer-0.0.8.tar.gz
Algorithm Hash digest
SHA256 801983c24bc1c860b0c7a7337e958a93fa0867a5a16ada7176ddc19e5915e9df
MD5 d6e44056b89fae47fdef28e70bfc2258
BLAKE2b-256 48f4715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d

See more details on using hashes here.

File details

Details for the file ZiTokenizer-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: ZiTokenizer-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for ZiTokenizer-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 a10b259a4681cc961caf67d209ea8c7038db59f4db9872a904acc10de9d15786
MD5 ada3fca6fb3ac8f52e31123edc54bc7e
BLAKE2b-256 e36d3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page