Skip to main content

Cantonese Linguistics and NLP in Python

Project description

https://jacksonllee.com/logos/pycantonese-logo.png

Full Documentation: https://pycantonese.org


PyPI version Conda version

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features:

  • Accessing and searching corpus data

  • Parsing and conversion tools for Jyutping romanization

  • Parsing Cantonese text

  • Stop words

  • Word segmentation

  • Part-of-speech tagging

The design of PyCantonese prioritizes ease of use and linguistic knowledge. It has been successfully used by both academic and commercial organizations, including major US tech companies.

Since v4.0.0 (March 2026), PyCantonese depends on Rustling, a library for efficient CHAT data handling, word segmentation, and part-of-speech tagging.

Download and Install

Using pip:

pip install --upgrade pycantonese

Using conda:

conda install -c conda-forge pycantonese

For Pyodide, install the WASM wheels (the .whl files with emscripten in the filename) from the GitHub releases of Rustling and PyCantonese.

Ready for more? Check out the Quickstart page.

How to Cite

Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.

@inproceedings{lee-etal-2022-pycantonese,
   title = "PyCantonese: Cantonese Linguistics and NLP in Python",
   author = "Lee, Jackson L.  and
      Chen, Litong  and
      Lam, Charles  and
      Lau, Chaak Ming  and
      Tsui, Tsz-Him",
   booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
   month = jun,
   year = "2022",
   publisher = "European Language Resources Association",
}

License

MIT License.

Please note that PyCantonese includes data from the following sources, all of which are permissively licensed:

  • Hong Kong Cantonese Corpus (CC BY)

  • rime-cantonese (CC BY 4.0)

  • Common Voice Cantonese (Mozilla Public License 2.0)

  • Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)

For details about these datasets, please see their documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycantonese-4.1.0.tar.gz (39.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycantonese-4.1.0-cp310-abi3-win_amd64.whl (41.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

pycantonese-4.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

pycantonese-4.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (41.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

pycantonese-4.1.0-cp310-abi3-macosx_11_0_arm64.whl (41.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

pycantonese-4.1.0-cp310-abi3-macosx_10_12_x86_64.whl (41.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file pycantonese-4.1.0.tar.gz.

File metadata

  • Download URL: pycantonese-4.1.0.tar.gz
  • Upload date:
  • Size: 39.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycantonese-4.1.0.tar.gz
Algorithm Hash digest
SHA256 f56cab5ac0b001ce95b614a9a5b6de75c35a1b625959c79195f64e257457e482
MD5 38e96b551084af8b9f0181045b3f0ad2
BLAKE2b-256 524f81d3cde8479f567234b7a8d8e81896e089232fe1b2faad465ab151b04b90

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0.tar.gz:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.1.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: pycantonese-4.1.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 41.0 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycantonese-4.1.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fa81d78e9242bec194fb1bf3febe6fd631b96fb26d1cd932031beab40d925034
MD5 8fd96f6d0ea0152287f44afa8a030ebc
BLAKE2b-256 5813eafac15da6368654633644a7e70ab756226e8b531303bea4ef51423685d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pycantonese-4.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c6a14a7407aa80fb6781eb08fcf7f1bb133227d83a31a7bc8b8be350cd8e9c8
MD5 a160cb135cff39d65ea43641a8937585
BLAKE2b-256 4b01ceba3e6dfd05db0ec4333b076dff51ef7fca7efe4e96789cfc1df5a68742

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pycantonese-4.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2d1f8c0eb8c3f1017678529ef226ba3e8982c96701670705b800dcc60f81bf94
MD5 a6b15faad477d16c89b0e94b56b77c3b
BLAKE2b-256 6f3c40e9b20ed47cf20a458ed9f0d3cc75dddb30ab1dde7cde202d68b2ba314b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.1.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pycantonese-4.1.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2d061bdbee4b5748ff4caf8be1e72cd7e8bd720026c01ef1e7a609d158095d73
MD5 11c074f54b213502c9e0e3d579cd7ddf
BLAKE2b-256 8e02ef4cb6b8a26ea391a261f3406a42c568680036e557b570492bc5496d43df

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.1.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pycantonese-4.1.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 af3cc2cb78134eee14fbc5d0a58404fd04f8e79a7f8ca290eb83a36812bf3e02
MD5 3c2d82f3cbc38ee328d6bf8fabf93953
BLAKE2b-256 cd5ed674817c9fd1a098683a2ca1b57255f9ca205b49c5b81619979602121f3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.1.0-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page