ZiCutter: cut character smaller
Project description
ZiCutter
ZiCutter: cut character smaller
use
pip install ZiCutter
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
vocab
minium az 26 number 10 bigram 1296 index 26 YuanZi 2365 total 3723
cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: bigrams, [az][az],[09][09],#[az],#[09]
'😀' : name is 'GRINNING FACE'
'😀' -> ["gr","#e"]
cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ZiCutter-0.0.4.tar.gz
(1.4 MB
view hashes)