Skip to main content

Adds pinyin to lists of chinese strings (utf-8 only)

Project description

Setup

import pinyiniser as pyer

pyer.add_pinyin(zh_string, dict, special={}, do_not_parse=do_not_parse_set)

Adds Pinyin to a utf-8 chinese string.
returns string + \n + pinyin + \n

special

a dictionary of strings like:

{
    '卡妮雅': 'Ka3ni1ya3',
    '伊雷米': 'Yi1lei3mi3',
    '乌蕾妮': 'Wu1lei3ni1',
}

It will search for the keys and output the value of the kvp.
This is a 1:1 mapping, if the string doesn't match the left hand side exactly, it will not match. This could be more than just a way to map names, any string can be wholly replaced using this method.

do_not_parse

do_not_parse is a dictionary that default looks like so:

do_not_parse_set = {
    #Chinese special chars
    '?', ',', '!', '。', ';', '“', '”', ':', '–', '—', '*',
    '…', '、', '~', '-', '(', ')', '─', '<', '>', '.', '《', '》',
    '%', '·', '’', '‘', '……', '【', '】',
    #Standard special chars
    '`', '~', '!', '@', '#', '^', '&', '*', '(', ')', '-', '_',
    '[', ']', '{', '}', '\\', '|', ';', ':', '\'', '"', ',', '<', '.',
    '>', '/', '?',
    #Maths
    '=', '+', '-', '/', '%',
    #Currency chars
    '$', '¥', '£', '€'}

Jieba returns a list of words that it has detected. For english words or punctuation, they are returned as well as an entry in the list.

We cut up the sentence using Jieba to generate a list of characters, we then step through this list and add the pinyin to the sentence.

we need to add spaces between the elements of the list when they are added to the sentence, but if it is in do_not_parse it will be added without a space, as punctuation should be.

i.e. ['ni3hao3', '.'], if we don't use this do_not_parse set, becomes: 'ni3hao3 .', with the set: 'ni3hao3.'

so in order to extend this, you can create your own do_not_parse_set (called whatever you like) and union it with the original do_not_parse_set.

my_do_not_parse_set = my_do_not_parse_set.union(pyer.do_not_parse_set)

zh_dict

zh_dict = pyer.get_dictionary(True)

True for numerals - shuo1
False for diacritics - shuō

Personally I prefer numerals as it makes it harder to read, but depending on your application this may not be what you want.

zh_dict details

zh_dict is a dictionary of dictionaries, where the first key is the character, and the second key is 'pinyin' e.g. zh_dict[zh_char]['pinyin']

Any dictionary that has this set of kvp's will work, allowing you flexibility in what you use, so you can have a dict with English too zh_dict[zh_char]['english'] for further processing.

pyer.get_pinyin(zh_string, zh_dict, do_not_parse=do_not_parse_set)

Gets pinyin as a list.

zh_string is just any utf-8 string of Chinese characters.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Pinyiniser-1.0.3.tar.gz (10.5 MB view hashes)

Uploaded Source

Built Distribution

Pinyiniser-1.0.3-py3-none-any.whl (10.5 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page