ZiCutter: cut character smaller
Project description
ZiCutter
ZiCutter: cut character smaller
use
pip install ZiCutter
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
vocab
minium az 26 number 10 Gram 36 YuanZi 2366 total 2402
cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: Grams, [az][az],[09][09],#[az],#[09]
'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]
cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiCutter-0.0.9.tar.gz
(1.4 MB
view hashes)