Skip to main content

ZiCutter: cut character smaller

Project description

ZiCutter

ZiCutter: cut character smaller

use

pip install ZiCutter

from ZiCutter import ZiCutter

line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"

# build
cutter = ZiCutter(dir="")
cutter.build()

# use
cutter = ZiCutter(dir="")
for c in line:
    print(cutter.cutChar(c))

background

Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.

vocab

minium az 26 number 10 Gram 36 YuanZi 2366 total 2402

cut name rare character

name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: Grams, [az][az],[09][09],#[az],#[09]

'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]

cut ids for HanZi

base: YuanZi (minium)

熇	⿰火高    
'熇' -> ['⿰','火','高']    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ZiCutter-0.0.10.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

ZiCutter-0.0.10-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file ZiCutter-0.0.10.tar.gz.

File metadata

  • Download URL: ZiCutter-0.0.10.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for ZiCutter-0.0.10.tar.gz
Algorithm Hash digest
SHA256 d4d03609d8083a7fd8a57858660089e8786f1764ca4ec024c2ec236349596c9b
MD5 d91b7d5a5e5931f7a65abb159da7ce57
BLAKE2b-256 ca5716025be5e484da835c1f77cce7cb88f47b6d2e6c79645d6a7d0cf359cf9e

See more details on using hashes here.

File details

Details for the file ZiCutter-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: ZiCutter-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for ZiCutter-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8ee06158216385ca5eed2601c8edb38e76b60c4b0b3a5fbeca95d785500339fc
MD5 4d86227600afd9921ffffe801c6eb550
BLAKE2b-256 cfdadecdafaa11e25e349dfc2a53caa45f67192a2ed09149bc786c6338a63e80

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page