Skip to main content

A character vomiting library — Unicode character sets for CJK, Thai, Vietnamese, and Perl uniprops.

Project description

Charguana

A library for "character vommitting".

Works on Python 3.10+ (tested through 3.14).

Install

pip install charguana

What's new in 0.2.0

  • get_charset(name) now returns a list instead of a generator. Use iter_charset(name) if you want the old lazy behavior.
  • all_in_charset(string, charset) added alongside islang — the former requires every character to match; islang remains "any character matches".
  • is_in_charsets(ch, ranges) is now exposed at the top level (previously only in charguana.korean).
  • perluniprops props (IsAlpha, IsAlnum, IsLower, IsUpper, IsSo) and chinese_strokes are now loaded lazily on first access, so import charguana is cheap.
  • Multiple Vietnamese IME bug fixes (VNI U7*, O1..5, U1..5; Telex Uws/Owf/Os/Us families previously produced the wrong vowel).

Usage

CJK characters:

>>> from charguana import get_charset

# Hiragana.
>>> ''.join(list(get_charset('hiragana')))
'\u3040ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ\u3097\u3098゙゚゛゜ゝゞゟ'

# Katakana.
>>> ''.join(list(get_charset('katakana')))
'゠ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ'

# Bopomofo.
>>> ''.join(list(get_charset('bopomofo')))
'\u3100\u3101\u3102\u3103\u3104ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩㄪㄫㄬㄭ\u312e\u312f'

# Punctuations
>>> ''.join(list(get_charset('punctuation')))
'\u3000、。〃〄々〆〇〈〉《》「」『』【】〒〓〔〕〖〗〘〙〚〛〜〝〞〟〠〡〢〣〤〥〦〧〨〩〪〭〮〯〫〬〰〱〲〳〴〵〶〷〸〹〺〻〼〽〾〿'

# Romanji
>>> ''.join(list(get_charset('romanji')))
'\uff00!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン゙゚ᅠᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒ\uffbf\uffc0\uffc1ᅡᅢᅣᅤᅥᅦ\uffc8\uffc9ᅧᅨᅩᅪᅫᅬ\uffd0\uffd1ᅭᅮᅯᅰᅱᅲ\uffd8\uffd9ᅳᅴᅵ\uffdd\uffde\uffdf¢£¬ ̄¦¥₩\uffe7│←↑→↓■○\uffef'


# Chinese.
>>> from charguana import tradify, simplify, chinese_strokes
>>> get_charset('chinese') == get_charset('zh')
True
>>> get_charset('zh') == get_charset('cn')
True
>>> get_charset('simplified_chinese')[:10]
['锕', '皑', '蔼', '碍', '爱', '嗳', '嫒', '瑷', '暧', '霭']
>>> get_charset('traditional_chinese')[:10]
['錒', '皚', '藹', '礙', '愛', '噯', '嬡', '璦', '曖', '靄']
>>> simplify('錒')
'锕'
>>> tradify('锕')
'錒'
>>> chinese_strokes['绝']
9
>>> chinese_strokes['絕']
12

# Japanese.
>>> ''.join(list(get_charset('japanese'))) == ''.join(list(get_charset('ja')))
True
>>> ''.join(list(get_charset('ja'))) == ''.join(list(get_charset('jp')))
True

# Korean.
>>> ''.join(list(get_charset('korean'))) == ''.join(list(get_charset('ko'))) == ''.join(list(get_charset('kr')))
True
>>> ''.join(list(get_charset('ko'))) == ''.join(list(get_charset('kr')))
True

# All Chinese, Korean, Japanese and Romanji.
>>> ''.join(list(get_charset('cjk')))

Perluniprops Characters:

>>> from charguana import get_charset

# Open Punctuation.
>>> ''.join(get_charset('Open_Punctuation'))
'([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「'

# Close Punctuation.
>>> ''.join(get_charset('Close_Punctuation'))
')]}༻༽᚜⁆⁾₎〉❩❫❭❯❱❳❵⟆⟧⟩⟫⟭⟯⦄⦆⦈⦊⦌⦎⦐⦒⦔⦖⦘⧙⧛⧽⸣⸥⸧⸩〉》」』】〕〗〙〛〞〟﴿︘︶︸︺︼︾﹀﹂﹄﹈﹚﹜﹞)]}⦆」'

# Currency Symbols.
>>> ''.join(get_charset('Currency_Symbol'))
'$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩'

# Numbers.
>>> ''.join(list(get_charset('IsN'))[:50])
'0123456789²³¹¼½¾٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३'

# Alphabetic
>>> ''.join(list(get_charset('IsAlpha'))[:50])
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwx'

# Lowercase.
>>> ''.join(list(get_charset('IsLower'))[:50])
'abcdefghijklmnopqrstuvwxyzªµºßàáâãäåæçèéêëìíîïðñòó'

# Uppercase.

>>> ''.join(list(get_charset('IsUpper'))[:50])
'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØ'
# Alphanumeric
>>> ''.join(list(get_charset('IsAlnum'))[:50])
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmn'

Thai

# Thai.
>>> ''.join(list(get_charset('thai')))[:50]
'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮ\u0e7f฿ะั'
# Thai consonants.
>>> from charguana import get_charset_ranges
>>> from charguana.thai import thai_consonants
>>> list(get_charset_ranges([thai_consonants]))[:10]
['ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', 'ช']
# Thai Vowels
>>> from charguana.thai import thai_vowels_1, thai_vowels_2
>>> list(get_charset_ranges([thai_vowels_1, thai_vowels_2]))[:10]
['ะ', 'ั', 'า', 'ำ', 'ิ', 'ี', 'ึ', 'ื', 'ุ', 'ู']

Vietnamese

# Vietnamese
>>> from charguana import get_charset
>>> ''.join(list(get_charset('viet'))[:50])
'AĂÂBCChDĐEÊGGhGiHIKKhLMNNgNghNhOÔƠPPhQRSTThTrUƯVXYFJWZaăâbcchd'

>>> from charguana import get_charset
>>> ''.join(list(get_charset('viet'))[:50])
'AĂÂBCChDĐEÊGGhGiHIKKhLMNNgNghNhOÔƠPPhQRSTThTrUƯVXYFJWZaăâbcchd'

# Vietnamese tones.
>>> from charguana.viet import viet_tones
>>> viet_tones.huyen
'̀'
>>> 'o' + viet_tones.huyen
'ò'
>>> 'o' + viet_tones.sac
'ó'
>>> 'o' + viet_tones.hoi
'ỏ'
>>> 'o' + viet_tones.nga
'õ'
>>> 'o' + viet_tones.nang
'ọ'
>>> 'o' + viet_tones.ngang
'o'

# Vietnamese consonants.
>>> from charguana.viet import viet_consonants
>>> list(viet_consonants)
['A', 'Ă', 'Â', 'B', 'C', 'Ch', 'D', 'Đ', 'E', 'Ê', 'G', 'Gh', 'Gi', 'H', 'I', 'K', 'Kh', 'L', 'M', 'N', 'Ng', 'Ngh', 'Nh', 'O', 'Ô', 'Ơ', 'P', 'Ph', 'Q', 'R', 'S', 'T', 'Th', 'Tr', 'U', 'Ư', 'V', 'X', 'Y', 'F', 'J', 'W', 'Z', 'a', 'ă', 'â', 'b', 'c', 'ch', 'd', 'đ', 'e', 'ê', 'g', 'gh', 'gi', 'h', 'i', 'k', 'kh', 'l', 'm', 'n', 'ng', 'ngh', 'nh', 'o', 'ô', 'ơ', 'p', 'ph', 'q', 'r', 's', 't', 'th', 'tr', 'u', 'ư', 'v', 'x', 'y', 'f', 'j', 'w', 'z']

# Vietnamese vowels with diacritics.
>>> from charguana.viet import a, a6, a8
>>> a
['A', 'Á', 'À', 'Ả', 'Ã', 'Ạ', 'a', 'á', 'à', 'ả', 'ã', 'ạ']
>>> a6
['Â', 'Ấ', 'Ầ', 'Ẩ', 'Ẫ', 'Ậ', 'â', 'ấ', 'ầ', 'ẩ', 'ẫ', 'ậ']
>>> a8
['Ă', 'Ắ', 'Ằ', 'Ẳ', 'Ẵ', 'Ặ', 'ă', 'ắ', 'ằ', 'ẳ', 'ẵ', 'ặ']

# Vietnamese tones.
>>> from charguana.viet import viet_tones
>>> viet_tones
Tones(ngang='', huyen='̀', sac='́', hoi='̉', nga='̃', nang='̣')
>>> 'o' + viet_tones.sac
'ó'
>>> 'o' + viet_tones.nang
'ọ'

# Vietnamese IME.
>>> from charguana.viet import viet_ime
>>> viet_ime('Nguye64n Tra62n Anh Thu7')
'Nguyễn Trần Anh Thư'
# IME typo.
>>> viet_ime('Nguye64n Tra62n Anh Thu8') # uncheck.
'Nguyễn Trần Anh Thu8'
>>> viet_ime('Nguye64n Tra62n Anh Thu8', raise_keyerror=True) # check.
...
KeyError: 'u8'
# Telex
>>> viet_ime('Nguyeefn Traafn Anh Thuw', mapping='telex')
'Nguyền Trần Anh Thư'
# Short cut for TELEX ime with functools.partial
>>> from functools import partial
>>> from charguana.viet import viet_ime
>>> telex_ime = partial(viet_ime, mapping='telex')
>>> telex_ime('Nguyeefn Traafn Anh Thuw')
'Nguyền Trần Anh Thư'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charguana-0.3.0.tar.gz (220.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

charguana-0.3.0-py3-none-any.whl (222.5 kB view details)

Uploaded Python 3

File details

Details for the file charguana-0.3.0.tar.gz.

File metadata

  • Download URL: charguana-0.3.0.tar.gz
  • Upload date:
  • Size: 220.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for charguana-0.3.0.tar.gz
Algorithm Hash digest
SHA256 df36e64baae4450c4ddf91ae6a8cbd1fa41d2664d364e0010f341052c2425905
MD5 5d3bf6b5e48fa01d89bb61d2fa62799d
BLAKE2b-256 8df2fe5bdc5bf3d6dd574301adfabe22189b8853efa8308fa8aad98bf58e8c86

See more details on using hashes here.

File details

Details for the file charguana-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: charguana-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 222.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for charguana-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 08af67e6a135cbd8ec448b77fa483808f52af1047c2524729b875c90759727c5
MD5 4dfa1f15025550b1587b36cba4ea673c
BLAKE2b-256 79a4f8f364fd62138c4c26e54249f973cdf11865dd009a3ca8c92955ed6507cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page