ZiCutter: cut character smaller
Project description
ZiCutter
ZiCutter: cut character smaller
use
pip install ZiCutter
from ZiCutter import ZiCutter
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"
# build
cutter = ZiCutter(dir="")
cutter.build()
# use
cutter = ZiCutter(dir="")
for c in line:
print(cutter.cutChar(c))
background
Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.
vocab
minium az 26 number 10 Gram 36 YuanZi 2366 total 2402
cut name rare character
name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: Grams, [az][az],[09][09],#[az],#[09]
'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]
cut ids for HanZi
base: YuanZi (minium)
熇 ⿰火高
'熇' -> ['⿰','火','高']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ZiCutter-0.0.10.tar.gz.
File metadata
- Download URL: ZiCutter-0.0.10.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4d03609d8083a7fd8a57858660089e8786f1764ca4ec024c2ec236349596c9b
|
|
| MD5 |
d91b7d5a5e5931f7a65abb159da7ce57
|
|
| BLAKE2b-256 |
ca5716025be5e484da835c1f77cce7cb88f47b6d2e6c79645d6a7d0cf359cf9e
|
File details
Details for the file ZiCutter-0.0.10-py3-none-any.whl.
File metadata
- Download URL: ZiCutter-0.0.10-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ee06158216385ca5eed2601c8edb38e76b60c4b0b3a5fbeca95d785500339fc
|
|
| MD5 |
4d86227600afd9921ffffe801c6eb550
|
|
| BLAKE2b-256 |
cfdadecdafaa11e25e349dfc2a53caa45f67192a2ed09149bc786c6338a63e80
|