ZiCutter: cut character smaller

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ZiCutter

ZiCutter: cut character smaller

use

pip install ZiCutter

from ZiCutter import ZiCutter

line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'"

# build
cutter = ZiCutter(dir="")
cutter.build()

# use
cutter = ZiCutter(dir="")
for c in line:
    print(cutter.cutChar(c))

background

Unicode 14.0 adds 838 characters, for a total of 144,697 characters. (https://www.unicode.org/versions/Unicode14.0.0/) About 2/3 of them are HanZi. To shrink vocab size, we cut character to smaller.

vocab

minium az 26 number 10 bigram 1296 index 26 YuanZi 2365 total 3723

cut name rare character

name = name of 'x'
tokens=[name[:2],"#"+name[-1]]
base: bigrams, [a~~z][a~~z],[0~~9][0~~9],#[a~~z],#[0~~9]

'😀' : name is 'GRINNING FACE'
'😀' -> ["##gr","ce"]

cut ids for HanZi

base: YuanZi (minium)

熇	⿰火高    
'熇' -> ['⿰','火','高']

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.10

Feb 24, 2023

0.0.9

Feb 19, 2023

This version

0.0.8

Jan 10, 2023

0.0.7

Jan 2, 2023

0.0.6

Jan 2, 2023

0.0.5

Aug 31, 2022

0.0.4

Aug 15, 2022

0.0.3

Jul 9, 2022

0.0.2

Jul 5, 2022

0.0.1

Jun 29, 2022

0.0.0

Jun 28, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ZiCutter-0.0.8.tar.gz (1.4 MB view hashes)

Uploaded Jan 10, 2023 Source

Built Distribution

ZiCutter-0.0.8-py3-none-any.whl (1.4 MB view hashes)

Uploaded Jan 10, 2023 Python 3

Hashes for ZiCutter-0.0.8.tar.gz

Hashes for ZiCutter-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`6a3fd4bab8d7f236a12d1cea117197a14738703611d73b908be0beb7b747c48b`
MD5	`6317c4f64eb3d4b924bdc37c1e6bd4b1`
BLAKE2b-256	`4ebf266a317ba4c6477050ce6b0d438fbb8f66c2e45dd74ee4b42994529292e0`

Hashes for ZiCutter-0.0.8-py3-none-any.whl

Hashes for ZiCutter-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ddc54a929d9573c6359a176432a0eeef15f271d8a15f752cd520be562a1bad7`
MD5	`b3a173d04c3b7de69f5365558a57ad07`
BLAKE2b-256	`8c23298c9b07743c6fc649c53cb9ed985cb3ef3ce2a94702f588c681e9d1ed62`