ZiCutter: cut character smaller
Project description
ZiCutter
ZiCutter: cut character smaller
use
pip install ZiCutter
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
vocab
minium az 26 number 10 Gram 36 YuanZi 2366 total 2402
cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: Grams, [az][az],[09][09],#[az],#[09]
'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]
cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiCutter-0.0.10.tar.gz
(1.4 MB
view hashes)
Built Distribution
Close
Hashes for ZiCutter-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ee06158216385ca5eed2601c8edb38e76b60c4b0b3a5fbeca95d785500339fc |
|
MD5 | 4d86227600afd9921ffffe801c6eb550 |
|
BLAKE2b-256 | cfdadecdafaa11e25e349dfc2a53caa45f67192a2ed09149bc786c6338a63e80 |