Adds pinyin to lists of chinese strings (utf-8 only)
Project description
Setup
import pinyiniser as pyer
pyer.add_pinyin(zh_string, dict, special={}, do_not_parse=do_not_parse_set)
Adds Pinyin to a utf-8 chinese string.
returns string + \n + pinyin + \n
special
a dictionary of strings like:
{
'卡妮雅': 'Ka3ni1ya3',
'伊雷米': 'Yi1lei3mi3',
'乌蕾妮': 'Wu1lei3ni1',
}
It will search for the keys and output the value of the kvp.
This is a 1:1 mapping, if the string doesn't match the left hand side exactly, it will not match.
This could be more than just a way to map names, any string can be wholly replaced using this method.
do_not_parse
do_not_parse is a dictionary that default looks like so:
do_not_parse_set = {
#Chinese special chars
'?', ',', '!', '。', ';', '“', '”', ':', '–', '—', '*',
'…', '、', '~', '-', '(', ')', '─', '<', '>', '.', '《', '》',
'%', '·', '’', '‘', '……', '【', '】',
#Standard special chars
'`', '~', '!', '@', '#', '^', '&', '*', '(', ')', '-', '_',
'[', ']', '{', '}', '\\', '|', ';', ':', '\'', '"', ',', '<', '.',
'>', '/', '?',
#Maths
'=', '+', '-', '/', '%',
#Currency chars
'$', '¥', '£', '€'}
Jieba
returns a list of words that it has detected. For english words or punctuation, they are returned as well
as an entry in the list.
We cut up the sentence using Jieba to generate a list of characters, we then step through this list and add the pinyin to the sentence.
we need to add spaces between the elements of the list when they are added to the sentence, but if it is in do_not_parse it will be added without a space, as punctuation should be.
i.e. ['ni3hao3', '.'], if we don't use this do_not_parse set, becomes: 'ni3hao3 .', with the set: 'ni3hao3.'
so in order to extend this, you can create your own do_not_parse_set (called whatever you like) and union it with the original do_not_parse_set.
my_do_not_parse_set = my_do_not_parse_set.union(pyer.do_not_parse_set)
zh_dict
zh_dict = pyer.get_dictionary(True)
True for numerals - shuo1
False for diacritics - shuō
Personally I prefer numerals as it makes it harder to read, but depending on your application this may not be what you want.
zh_dict details
zh_dict
is a dictionary of dictionaries, where the first key is the
character, and the second key is 'pinyin' e.g. zh_dict[zh_char]['pinyin']
Any dictionary that has this set of kvp's will work, allowing you flexibility in
what you use, so you can have a dict with English too
zh_dict[zh_char]['english']
for further processing.
pyer.get_pinyin(zh_string, zh_dict, do_not_parse=do_not_parse_set)
Gets pinyin as a list.
zh_string
is just any utf-8 string of Chinese characters.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for Pinyiniser-1.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5cc5532eafabbe7f2939f07a224f44f57d34f803bfd703d909422897c4c3e49 |
|
MD5 | 2b7046b191cef6d64b05986347859fc9 |
|
BLAKE2b-256 | 95d6d9ac4544e564ad930e69c337d2d531482d3ed7caa540ed54b2b3905d0183 |