Skip to main content

Python 3 Han character library for CJKV languages

Project description

Hanzilib

Hanzilib is designed for working with Han characters (Hanzi, Kanji, Hanja, and chu Han) used in Chinese, Japanese, Korean, and Vietnamese (CJKV) languages. It provides a comprehensive set of tools to manage, analyze, and query information about characters based on their visual structure (radicals, glyphs, strokes), pronunciation (readings), and dictionary definitions.

Hanzilib is the successor of cjklib, which has remained dormant for more than a decade and was difficult to install for modern versions of Python. Hence, this fork was created as an attempt to port the entirety of (or most of) cjklib to modern Python.

Hanzilib also provides the cli tool hanzi for convenience.

Important: the interface (both Python and CLI) are not final

Useful sections

Installation

pip install hanzilib

After installing, run hanzi build to build the database

Current state of the project

  • Core ported to Python 3
  • Core now uses SqlAlchemy 2.0
  • There is still a LOT to update/document
  • the legacy EDICT is used instead of EDICT2, this will be changed shortly

CLI examples

Get character information: -i / --information

$ hanzi -i Information for character 個 (Chinese simplified locale, Unicode domain)
Unicode codepoint: U+500B (20491, character form)
In character domains: Unicode, BIG5, IICore, JISX0208_0213, GlyphInformation, JISX0208, BIG5HKSCS
Radical index: 9, radical form: ⼈, variants: ⺅
Stroke count: 10
Phonetic data (CantoneseYale): go
Phonetic data (GR): gee, geh
...

List characters with KangXi radical index: -k / --by-radicalidx

$ hanzi -k 30 # 30 corresponds to 口
+2: 台叽叶叼召叭叾叵叨只史可叺叱叻叫号叧叹句叴另叮叩古
+3: 各吇吏吓吂吖叿吉同吋吐名吆吀叫吗吁吸吊吃吕吒吔合吅
+4: 呗君吩吶呚含呜呃呈呙呅听吲吠呐呖吼呒吙吝呀吣呌吹 ...
...

Get reading from string: -r / --get-reading

$ hanzi -r 凍檸茶 # defaults to Pinyin
dòng níng chá

$ hanzi -r 凍檸茶 --target-reading Jyutping
dung3 ning4 caa4

Get character from reading: -a / --by-reading

$ hanzi -a dòng # defaults to Pinyin
㓊㗢㢥㣚㣫㼯䆚䞒侗倲働冻凍动動勭垌姛娻峒崠恫戙挏栋桐棟洞湩烔狪甬硐筒筩絧胨胴腖衕詷迵酮霘駧騆𠄉𢳾𥫎𧡍𧼩𧽿𩐤𩐵𩭩𪔦

$ hanzi -a dung3 --source-reading Jyutping
冻凍崠胨腖𰎏

Conversion between simplified and traditional Chinese: -f / -convert-form

$ hanzi -f 龍馬精神
Simplified: 龙马精神
Traditional: 龍馬精神

$ hanzi -f 飛機
Simplified: 飞[机機]     # two or more variants
Traditional: 飛機

$ hanzi -f 龙馬精神
Warning: input string has mixed simplified and traditional forms
Simplified: 龙马精神
Traditional: 龍馬精神

Python examples

Character operations

from hanzilib.characterlookup import CharacterLookup
cjk = CharacterLookup("C")

# Many methods; to be documented

Reading conversions

from hanzilib.reading import ReadingFactory
f = ReadingFactory()
f.convert('lǎoshī', 'Pinyin', 'MandarinIPA') # lau˨˩.ʂʅ˥˥

Details of hanzilib

Readings

  • Reading operators provide linguistic operations on a specific reading, such as decomposition ('hok6jyut6ping3' -> ['hok6', 'jyut6', 'ping3'])

  • Reading converters can convert from one reading to another

The supported readings of this library are tabulated below.

Mandarin

Mandarin is a spoken language and the most widely used branch of the Chinese linguistic family. It relies on hanzi for its written expression.

Reading Description Example
Pinyin standard romanization of Mandarin; uses Latin symbols to spell out sounds nǐ hǎo
Mandarin IPA phonetic symbols to represent exact sounds in Mandarin ni˨˩˦ xau˨˩˦
Gwoyeu Romatzyh (GR) romanization of Mandarin; uses letters to represent (e.g., ai (1st), air (2nd), ae (3rd), ay (4th)) tones nii hau; Koong fu tzyy
Wade-Giles predecessor of Pinyin; uses numbers, hyphens and apostrophes ni hao / ni3 hao3; K’ung-fu-tzu / Kʻung3-fu1-tzŭ3
Mandarin Braille tactile writing system of Mandarin; maps Pinyin into braille cells ⠝⠊⠂ ⠓⠖⠐

Cantonese

Cantonese is primarily spoken in Hong Kong, Macau, and the Guangdong province of China. It relies on hanzi for its written expression.

Reading Description Example
JyutPing (粵拼) standard romanization of Cantonese; uses Latin letters and numbers to describe sounds ning4 mung4 caa4
Cantonese Yale more intuitive romanization of Cantonese (mainly for English speakers); uses letters and diacritics lihng mùng chà
Cantonese IPA phonetic symbols to represent exact sounds in Mandarin neŋ˨˩ mʊŋ˨˩ tsʰaː˨˩

Supported reading conversions

Mandarin

  • All inter-conversions are supported except from Mandarin IPA (ie all except MandarinIPA -> X)

Cantonese

  • Inter-conversions within Jyutping and Cantonese Yale
  • No conversion support for Cantonese IPA

Changes since cjklib

  • kRSKangXi (data from 康熙字典) was removed from Unihan in favour of kRSUnicode, which is now the standard for getting radicals of Chinese characters (See: Unicode proposal L2/22-195)

  • kRSKanwa (data from 大漢和辭典), kRSJapanese, kRSKorean were removed from Unihan (See: Unicode proposal L2/19-209)

Therefore, in this project, CharacterKangxiRadical and CharacterKanwaRadical are combined to CharacterChineseRadical

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hanzilib-0.0.6.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hanzilib-0.0.6-py3-none-any.whl (9.1 MB view details)

Uploaded Python 3

File details

Details for the file hanzilib-0.0.6.tar.gz.

File metadata

  • Download URL: hanzilib-0.0.6.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for hanzilib-0.0.6.tar.gz
Algorithm Hash digest
SHA256 613048141bef530ad2339dafb5ffa2b6d75e453e5183b586a530729ff383c26b
MD5 61a226d11e3ac44cd2bed839e7dbb599
BLAKE2b-256 3151fd84080df986efd64c42cca53d287d97ef438ea5d602ddb1febbec040359

See more details on using hashes here.

File details

Details for the file hanzilib-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: hanzilib-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 9.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for hanzilib-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 23674d23701a8a69bb911a0cdd1108adf0563f1e9425c623d4373b8b127a6ce2
MD5 d725dcbb17acf7314a40f026e5eb1a8b
BLAKE2b-256 4bb6de7f31c42fdf992a23753c317a9c609e78c4e7338680697611ad7440f7b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page