Skip to main content

Adds pinyin to lists of chinese strings (utf-8 only)

Project description

Pinyiniser

PyPI version License: MIT Python 3.9+

A Python library that converts Chinese strings (UTF-8) to pinyin using rjieba for word segmentation and CC-CEDICT for pinyin lookup.

Installation

pip install pinyiniser

Quick Example

import pinyiniser as pyer

# Load the dictionary (True = numeral tones, False = diacritic tones)
zh_dict = pyer.get_dictionary(True)

# Get pinyin as a list
pinyin = pyer.get_pinyin('你好,世界!', zh_dict)
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']

# Get word segments and pinyin together
segments, pinyin = pyer.get_segments_and_pinyin('你好,世界!', zh_dict)
print(segments)
# ['你好', ',', '世界', '!']
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']

API

pyer.get_dictionary(numeric=True)

Loads the CC-CEDICT pinyin dictionary.

zh_dict = pyer.get_dictionary(True)

True for numerals e.g. shuo1
False for diacritics e.g. shuō

Numerals are useful for language learning as they introduce friction to reading solely pinyin, while diacritics are more natural to read and more professional. Choose whichever suits your application.

zh_dict details

zh_dict is a dictionary of dictionaries, where the first key is the character, and the second key is 'pinyin' e.g. zh_dict[zh_char]['pinyin']

Any dictionary that has this structure will work, allowing you flexibility in what you use — for example, you could add English definitions: zh_dict[zh_char]['english']

pyer.get_pinyin(zh_string, zh_dict, punctuation=special_tokens)

Returns pinyin as a flat list. Punctuation is preserved in place.

pinyin = pyer.get_pinyin('你好,世界!', zh_dict)
# ['ni3hao3', ',', 'shi4jie4', '!']

pyer.get_segments_and_pinyin(zh_string, zh_dict, punctuation=special_tokens)

Returns a tuple of (segments, pinyin), both list[str]:

  • segments — the word-level tokens as segmented by rjieba, with punctuation preserved as individual elements.
  • pinyin — a list of pinyin strings, one per segment.
segments, pinyin = pyer.get_segments_and_pinyin('你好,世界!', zh_dict)
print(segments)
# ['你好', ',', '世界', '!']
print(pinyin)
# ['ni3hao3', ',', 'shi4jie4', '!']

punctuation / special_tokens

Both get_pinyin and get_segments_and_pinyin accept a punctuation parameter. This is a set of characters that are split on and passed through as-is rather than being looked up in the dictionary.

The default set (pyer.special_tokens) includes Chinese punctuation, standard ASCII punctuation, maths operators, and currency symbols.

To extend the default set, union your own set with the built-in one:

my_punctuation = pyer.special_tokens | {'†', '‡'}
pinyin = pyer.get_pinyin('你好', zh_dict, punctuation=my_punctuation)

Performance

Benchmarked against v1.0.3 (PyPI) on the Shadowrun Returns Chinese corpus (~100k words):

Default punctuation (precompiled regex):

v2.0.0 v1.0.3 Speedup
Avg over 10 runs 0.071s 0.305s 4.3x faster
Per entry 0.011ms 0.046ms 4.2x faster

Custom punctuation (regex built per call):

v2.0.0 v1.0.3 Speedup
Avg over 10 runs 0.128s 0.305s 2.4x faster
Per entry 0.019ms 0.046ms 2.4x faster

Dependencies

  • rjieba (>= 0.2.0) — Chinese text segmentation (Rust implementation of Jieba)

Attribution

This project uses dictionary data derived from CC-CEDICT, licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinyiniser-2.0.1.tar.gz (11.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pinyiniser-2.0.1-py3-none-any.whl (11.0 MB view details)

Uploaded Python 3

File details

Details for the file pinyiniser-2.0.1.tar.gz.

File metadata

  • Download URL: pinyiniser-2.0.1.tar.gz
  • Upload date:
  • Size: 11.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pinyiniser-2.0.1.tar.gz
Algorithm Hash digest
SHA256 9cb9bf3a794938907874e5288a1970750f5211c3fc1ff01b066180e2d9e7ef87
MD5 d400b89177ef8022f72bd6e4f21849e8
BLAKE2b-256 71357d1cf23d7f7b3729da3b3c0a73cc71ad1201f8e40ca395ade8489dd9009f

See more details on using hashes here.

File details

Details for the file pinyiniser-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: pinyiniser-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pinyiniser-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fec2a6f3449033d1695461300cdfa148d549b2b14c0617b17f1573d0252d3902
MD5 929582ed8be87d19a8bca9d4c76b3fda
BLAKE2b-256 6996159945fb6a84557661f0bb693f11d4121f75dcbd6d238306732231d5b06b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page