Adds pinyin to lists of chinese strings (utf-8 only)
Project description
Pinyiniser
A Python library that converts Chinese strings (UTF-8) to pinyin using rjieba for word segmentation and CC-CEDICT for pinyin lookup.
Installation
pip install pinyiniser
Quick Example
import pinyiniser as pyer
# Load the dictionary (True = numeral tones, False = diacritic tones)
zh_dict = pyer.get_dictionary(True)
# Get pinyin as a list
pinyin = pyer.get_pinyin('你好,世界!', zh_dict)
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']
# Get word segments and pinyin together
segments, pinyin = pyer.get_segments_and_pinyin('你好,世界!', zh_dict)
print(segments)
# ['你好', ',', '世界', '!']
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']
API
pyer.get_dictionary(numeric=True)
Loads the CC-CEDICT pinyin dictionary.
zh_dict = pyer.get_dictionary(True)
True for numerals e.g. shuo1
False for diacritics e.g. shuō
Numerals are useful for language learning as they introduce friction to reading solely pinyin, while diacritics are more natural to read and more professional. Choose whichever suits your application.
zh_dict details
zh_dict is a dictionary of dictionaries, where the first key is the
character, and the second key is 'pinyin' e.g. zh_dict[zh_char]['pinyin']
Any dictionary that has this structure will work, allowing you flexibility in
what you use — for example, you could add English definitions:
zh_dict[zh_char]['english']
pyer.get_pinyin(zh_string, zh_dict, punctuation=special_tokens)
Returns pinyin as a flat list. Punctuation is preserved in place.
pinyin = pyer.get_pinyin('你好,世界!', zh_dict)
# ['ni3hao3', ',', 'shi4jie4', '!']
pyer.get_segments_and_pinyin(zh_string, zh_dict, punctuation=special_tokens)
Returns a tuple of (segments, pinyin), both list[str]:
segments— the word-level tokens as segmented by rjieba, with punctuation preserved as individual elements.pinyin— a list of pinyin strings, one per segment.
segments, pinyin = pyer.get_segments_and_pinyin('你好,世界!', zh_dict)
print(segments)
# ['你好', ',', '世界', '!']
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']
punctuation / special_tokens
Both get_pinyin and get_segments_and_pinyin accept a punctuation parameter. This is a set of characters that are split on and passed through as-is rather than being looked up in the dictionary.
The default set (pyer.special_tokens) includes Chinese punctuation, standard ASCII punctuation, maths operators, and currency symbols.
To extend the default set, union your own set with the built-in one:
my_punctuation = pyer.special_tokens | {'†', '‡'}
pinyin = pyer.get_pinyin('你好', zh_dict, punctuation=my_punctuation)
Performance
Benchmarked against v1.0.3 (PyPI) on the Shadowrun Returns Chinese corpus (~100k words):
Default punctuation (precompiled regex):
| v2.0.0 | v1.0.3 | Speedup | |
|---|---|---|---|
| Avg over 10 runs | 0.071s | 0.305s | 4.3x faster |
| Per entry | 0.011ms | 0.046ms | 4.2x faster |
Custom punctuation (regex built per call):
| v2.0.0 | v1.0.3 | Speedup | |
|---|---|---|---|
| Avg over 10 runs | 0.128s | 0.305s | 2.4x faster |
| Per entry | 0.019ms | 0.046ms | 2.4x faster |
Dependencies
- rjieba (>= 0.2.0) — Chinese text segmentation (Rust implementation of Jieba)
Attribution
This project uses dictionary data derived from CC-CEDICT, licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pinyiniser-2.0.1.tar.gz.
File metadata
- Download URL: pinyiniser-2.0.1.tar.gz
- Upload date:
- Size: 11.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cb9bf3a794938907874e5288a1970750f5211c3fc1ff01b066180e2d9e7ef87
|
|
| MD5 |
d400b89177ef8022f72bd6e4f21849e8
|
|
| BLAKE2b-256 |
71357d1cf23d7f7b3729da3b3c0a73cc71ad1201f8e40ca395ade8489dd9009f
|
File details
Details for the file pinyiniser-2.0.1-py3-none-any.whl.
File metadata
- Download URL: pinyiniser-2.0.1-py3-none-any.whl
- Upload date:
- Size: 11.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fec2a6f3449033d1695461300cdfa148d549b2b14c0617b17f1573d0252d3902
|
|
| MD5 |
929582ed8be87d19a8bca9d4c76b3fda
|
|
| BLAKE2b-256 |
6996159945fb6a84557661f0bb693f11d4121f75dcbd6d238306732231d5b06b
|