Skip to main content

Library for splitting Hanyu Pinyin phrases into all valid syllable combinations

Project description

py-pinyin-split

A Python library for splitting Hanyu Pinyin words into syllables. Built on NLTK's tokenizer interface, it handles standard syllables defined in the Pinyin Table and supports tone marks.

Based originally on pinyinsplit by @tomlee.

PyPI: https://pypi.org/project/py-pinyin-split/

Installation

pip install py-pinyin-split

Usage

Instantiate a tokenizer and split away.

The tokenizer can handle standard Hanyu Pinyin with whitespaces and punctuation. However, invalid pinyin syllables will raise a ValueError

The tokenizer uses some basic heuristics to determine the most likely split - number of syllables, presence of vowels, and syllable frequency data.

from py_pinyin_split import PinyinTokenizer

tokenizer = PinyinTokenizer()

# Basic splitting
tokenizer.tokenize("nǐhǎo")  # ['nǐ', 'hǎo']
tokenizer.tokenize("Běijīng")  # ['Běi', 'jīng']

# Handles whitespace and punctuation
tokenizer.tokenize("Nǐ hǎo ma?")  # ['Nǐ', 'hǎo', 'ma', '?']
tokenizer.tokenize("Wǒ hěn hǎo!")  # ['Wǒ', 'hěn', 'hǎo', '!']

# Handles ambiguous splits using heuristics
tokenizer.tokenize("kěnéng") == ["kě", "néng"]
tokenizer.tokenize("rènào") == ["rè", "nào"]
tokenizer.tokenize("xīan") == ["xī", "an"]
tokenizer.tokenize("xián") == ["xián"]
tokenizer.tokenize("wǎn'ān") == ["wǎn", "'", "ān"]

# Tone marks or punctuation help resolve ambiguity
tokenizer.tokenize("xīān")  # ['xī', 'ān']
tokenizer.tokenize("xián")  # ['xián']
tokenizer.tokenize("Xī'ān") # ["Xī", "'", "ān"]

# Raises ValueError for invalid pinyin
tokenizer.tokenize("hello")  # ValueError

# Optional support for non-standard syllables
tokenizer = PinyinTokenizer(include_nonstandard=True)
tokenizer.tokenize("duang")  # ['duang']

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_pinyin_split-5.0.0.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_pinyin_split-5.0.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file py_pinyin_split-5.0.0.tar.gz.

File metadata

  • Download URL: py_pinyin_split-5.0.0.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for py_pinyin_split-5.0.0.tar.gz
Algorithm Hash digest
SHA256 19ebd71af5bc136ea78d8b124d1155fca91498809be464861f7273e848719e97
MD5 584718c4b8198ebd509a2fabccefd553
BLAKE2b-256 319e4f6134653f4fcc04cbeec57506ced3d3eb5c63ecab5c5a2de626576e45cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_pinyin_split-5.0.0.tar.gz:

Publisher: publish.yml on lstrobel/py-pinyin-split

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file py_pinyin_split-5.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for py_pinyin_split-5.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05b1f74ad50a27f43977be1aab0570028146213d3ec86b2b40403d1f8f040fb9
MD5 d707d599cbf54689543ac6ab08be182b
BLAKE2b-256 c7b64e068cff1bdf59625b7691c8c8fceb33dda0e01cd3facb6b69fe42f6e7e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for py_pinyin_split-5.0.0-py3-none-any.whl:

Publisher: publish.yml on lstrobel/py-pinyin-split

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page