Library for splitting Hanyu Pinyin phrases into all valid syllable combinations
Project description
py-pinyin-split
A Python library for splitting Hanyu Pinyin words into syllables. Built on NLTK's tokenizer interface, it handles standard syllables defined in the Pinyin Table and supports tone marks.
Based originally on pinyinsplit by @tomlee.
PyPI: https://pypi.org/project/py-pinyin-split/
Installation
pip install py-pinyin-split
Usage
Instantiate a tokenizer and split away.
The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a ValueError
The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.
from py_pinyin_split import PinyinTokenizer
tokenizer = PinyinTokenizer()
# Basic splitting
tokenizer.tokenize("nǐhǎo") # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng") # ['Běi', 'jīng']
# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?") # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!") # ['Wǒ', 'hěn', 'hǎo', '!']
# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]
# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān") # ['xī', 'ān']
tokenizer.tokenize("xián") # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]
# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello") # ValueError
# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang") # ['duang']
Related Projects
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_pinyin_split-5.0.0.tar.gz.
File metadata
- Download URL: py_pinyin_split-5.0.0.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19ebd71af5bc136ea78d8b124d1155fca91498809be464861f7273e848719e97
|
|
| MD5 |
584718c4b8198ebd509a2fabccefd553
|
|
| BLAKE2b-256 |
319e4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd
|
Provenance
The following attestation bundles were made for py_pinyin_split-5.0.0.tar.gz:
Publisher:
publish.yml on lstrobel/py-pinyin-split
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_pinyin_split-5.0.0.tar.gz -
Subject digest:
19ebd71af5bc136ea78d8b124d1155fca91498809be464861f7273e848719e97 - Sigstore transparency entry: 154146588
- Sigstore integration time:
-
Permalink:
lstrobel/py-pinyin-split@e6f5c311266e3e314519132ec5191ddd3bf75624 -
Branch / Tag:
refs/tags/5.0.0 - Owner: https://github.com/lstrobel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e6f5c311266e3e314519132ec5191ddd3bf75624 -
Trigger Event:
push
-
Statement type:
File details
Details for the file py_pinyin_split-5.0.0-py3-none-any.whl.
File metadata
- Download URL: py_pinyin_split-5.0.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05b1f74ad50a27f43977be1aab0570028146213d3ec86b2b40403d1f8f040fb9
|
|
| MD5 |
d707d599cbf54689543ac6ab08be182b
|
|
| BLAKE2b-256 |
c7b64e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9
|
Provenance
The following attestation bundles were made for py_pinyin_split-5.0.0-py3-none-any.whl:
Publisher:
publish.yml on lstrobel/py-pinyin-split
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
py_pinyin_split-5.0.0-py3-none-any.whl -
Subject digest:
05b1f74ad50a27f43977be1aab0570028146213d3ec86b2b40403d1f8f040fb9 - Sigstore transparency entry: 154146589
- Sigstore integration time:
-
Permalink:
lstrobel/py-pinyin-split@e6f5c311266e3e314519132ec5191ddd3bf75624 -
Branch / Tag:
refs/tags/5.0.0 - Owner: https://github.com/lstrobel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e6f5c311266e3e314519132ec5191ddd3bf75624 -
Trigger Event:
push
-
Statement type: