ZiTokenizer: tokenize world text as Zi
Project description
ZiTokenizer
Tokineze all languages text into Zi.
support 300+ languages from wikipedia, including global
use
- pip install ZiTokenizer
from ZiTokenizer.ZiTokenizer import ZiTokenizer
# use
tokenizer = ZiTokenizer()
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏20222019\U0010ffff"
indexs = tokenizer.encode(line)
tokens = tokenizer.decode(indexs)
line2=tokenizer.tokens2line(tokens)
# build
demo/unit.py
UnicodeTokenizer
basic tokeinzer
ZiCutter
汉字拆字
'瞼' -> ['⿰', '目', '僉']
ZiSegmenter
word => prefix + root + suffix
'modernbritishdo' -> ['mod--', 'er--', 'n--', 'british', '--do']
languages
default using "golabl" vocob, others from https://laohur.github.io/ZiTokenizer/index.html
tokenizer = ZiTokenizer(vocab_dir)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiTokenizer-0.0.8.tar.gz
(14.8 kB
view details)
Built Distribution
File details
Details for the file ZiTokenizer-0.0.8.tar.gz
.
File metadata
- Download URL: ZiTokenizer-0.0.8.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 801983c24bc1c860b0c7a7337e958a93fa0867a5a16ada7176ddc19e5915e9df |
|
MD5 | d6e44056b89fae47fdef28e70bfc2258 |
|
BLAKE2b-256 | 48f4715a0232d01d4dddbe69da973313fcd34b71be165eee14624fb69c62bb2d |
File details
Details for the file ZiTokenizer-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: ZiTokenizer-0.0.8-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a10b259a4681cc961caf67d209ea8c7038db59f4db9872a904acc10de9d15786 |
|
MD5 | ada3fca6fb3ac8f52e31123edc54bc7e |
|
BLAKE2b-256 | e36d3ac57c119e6e5a1927be0680746dfdc222be528962859ac0d9108ac58331 |