Skip to main content

ZiCutter: cut character smaller

Project description

ZiCutter

ZiCutter: cut character smaller

use

pip install ZiCutter

from ZiCutter import ZiCutter

line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"

# build
cutter = ZiCutter(dir="")
cutter.build()

# use
cutter = ZiCutter(dir="")
for c in line:
    print(cutter.cutChar(c))

background

Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.

vocab

minium az 26 number 10 bigram 1296 index 26 YuanZi 2365 total 3723

cut name rare character

name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: bigrams, [az][az],[09][09],#[az],#[09]

'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]

cut ids for HanZi

base: YuanZi (minium)

熇	⿰火高    
'熇' -> ['⿰','火','高']    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ZiCutter-0.0.8.tar.gz (1.4 MB view hashes)

Uploaded Source

Built Distribution

ZiCutter-0.0.8-py3-none-any.whl (1.4 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page