Skip to main content

Adds pinyin to lists of chinese strings (utf-8 only)

Project description

Setup

import pinyiniser as pyer

pyer.add_pinyin(zh_string, dict, special={}, do_not_parse=do_not_parse_set)

Adds Pinyin to a utf-8 chinese string.
returns string + \n + pinyin + \n

special

a dictionary of strings like:

{
    '卡妮雅': 'Ka3ni1ya3',
    '伊雷米': 'Yi1lei3mi3',
    '乌蕾妮': 'Wu1lei3ni1',
}

It will search for the keys and output the value of the kvp.
This is a 1:1 mapping, if the string doesn't match the left hand side exactly, it will not match. This could be more than just a way to map names, any string can be wholly replaced using this method.

do_not_parse

do_not_parse is a dictionary that default looks like so:

do_not_parse_set = {
    #Chinese special chars
    '?', ',', '!', '。', ';', '“', '”', ':', '–', '—', '*',
    '…', '、', '~', '-', '(', ')', '─', '<', '>', '.', '《', '》',
    '%', '·', '’', '‘', '……', '【', '】',
    #Standard special chars
    '`', '~', '!', '@', '#', '^', '&', '*', '(', ')', '-', '_',
    '[', ']', '{', '}', '\\', '|', ';', ':', '\'', '"', ',', '<', '.',
    '>', '/', '?',
    #Maths
    '=', '+', '-', '/', '%',
    #Currency chars
    '$', '¥', '£', '€'}

Jieba returns a list of words that it has detected. For english words or punctuation, they are returned as well as an entry in the list.

We cut up the sentence using Jieba to generate a list of characters, we then step through this list and add the pinyin to the sentence.

we need to add spaces between the elements of the list when they are added to the sentence, but if it is in do_not_parse it will be added without a space, as punctuation should be.

i.e. ['ni3hao3', '.'], if we don't use this do_not_parse set, becomes: 'ni3hao3 .', with the set: 'ni3hao3.'

so in order to extend this, you can create your own do_not_parse_set (called whatever you like) and union it with the original do_not_parse_set.

my_do_not_parse_set = my_do_not_parse_set.union(pyer.do_not_parse_set)

zh_dict

zh_dict = pyer.get_dictionary(True)

True for numerals - shuo1
False for diacritics - shuō

Personally I prefer numerals as it makes it harder to read, but depending on your application this may not be what you want.

zh_dict details

zh_dict is a dictionary of dictionaries, where the first key is the character, and the second key is 'pinyin' e.g. zh_dict[zh_char]['pinyin']

Any dictionary that has this set of kvp's will work, allowing you flexibility in what you use, so you can have a dict with English too zh_dict[zh_char]['english'] for further processing.

pyer.get_pinyin(zh_string, zh_dict, do_not_parse=do_not_parse_set)

Gets pinyin as a list.

zh_string is just any utf-8 string of Chinese characters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Pinyiniser-1.0.3.tar.gz (10.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Pinyiniser-1.0.3-py3-none-any.whl (10.5 MB view details)

Uploaded Python 3

File details

Details for the file Pinyiniser-1.0.3.tar.gz.

File metadata

  • Download URL: Pinyiniser-1.0.3.tar.gz
  • Upload date:
  • Size: 10.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for Pinyiniser-1.0.3.tar.gz
Algorithm Hash digest
SHA256 533d1e9231f5493476add1e14bcc3c1b47152fc37f33e2a823df6fc5b5a7810f
MD5 3c519a6c6954ead2091d5996d25fc9b0
BLAKE2b-256 7b5c1d49df7ecdf8a2d28783f7339c7569b34f2cda5da0902f2e00e3fe0741a8

See more details on using hashes here.

File details

Details for the file Pinyiniser-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: Pinyiniser-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 10.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for Pinyiniser-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e5cc5532eafabbe7f2939f07a224f44f57d34f803bfd703d909422897c4c3e49
MD5 2b7046b191cef6d64b05986347859fc9
BLAKE2b-256 95d6d9ac4544e564ad930e69c337d2d531482d3ed7caa540ed54b2b3905d0183

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page