ZiCutter: cut character smaller
Project description
ZiCutter
ZiCutter: cut character smaller
use
pip install ZiCutter
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
vocab
minium az 26 number 10 Gram 36 YuanZi 2366 total 2402
cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: Grams, [az][az],[09][09],#[az],#[09]
'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]
cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiCutter-0.0.10.tar.gz
(1.4 MB
view details)
Built Distribution
File details
Details for the file ZiCutter-0.0.10.tar.gz
.
File metadata
- Download URL: ZiCutter-0.0.10.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4d03609d8083a7fd8a57858660089e8786f1764ca4ec024c2ec236349596c9b |
|
MD5 | d91b7d5a5e5931f7a65abb159da7ce57 |
|
BLAKE2b-256 | ca5716025be5e484da835c1f77cce7cb88f47b6d2e6c79645d6a7d0cf359cf9e |
File details
Details for the file ZiCutter-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: ZiCutter-0.0.10-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ee06158216385ca5eed2601c8edb38e76b60c4b0b3a5fbeca95d785500339fc |
|
MD5 | 4d86227600afd9921ffffe801c6eb550 |
|
BLAKE2b-256 | cfdadecdafaa11e25e349dfc2a53caa45f67192a2ed09149bc786c6338a63e80 |