Skip to main content

Cantonese Linguistics and NLP in Python

Project description

https://jacksonllee.com/logos/pycantonese-logo.png

Full Documentation: https://pycantonese.org


PyPI version Conda version

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features:

  • Accessing and searching corpus data

  • Parsing and conversion tools for Jyutping romanization

  • Parsing Cantonese text

  • Stop words

  • Word segmentation

  • Part-of-speech tagging

The design of PyCantonese prioritizes ease of use and linguistic knowledge. It has been successfully used by both academic and commercial organizations, including major US tech companies.

Since v4.0.0 (March 2026), PyCantonese depends on Rustling, a library for efficient CHAT data handling, word segmentation, and part-of-speech tagging.

Download and Install

Using pip:

pip install --upgrade pycantonese

Using conda:

conda install -c conda-forge pycantonese

PyCantonese also works in JavaScript.

Ready for more? Check out Quickstart.

How to Cite

Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.

@inproceedings{lee-etal-2022-pycantonese,
   title = "PyCantonese: Cantonese Linguistics and NLP in Python",
   author = "Lee, Jackson L.  and
      Chen, Litong  and
      Lam, Charles  and
      Lau, Chaak Ming  and
      Tsui, Tsz-Him",
   booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
   month = jun,
   year = "2022",
   publisher = "European Language Resources Association",
}

License

MIT License.

Please note that PyCantonese includes data from the following sources, all of which are permissively licensed:

  • Hong Kong Cantonese Corpus (CC BY)

  • CantoMap (GPL-3.0)

  • rime-cantonese (CC BY 4.0)

  • Common Voice Cantonese (Mozilla Public License 2.0)

  • Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)

For details about these datasets, please see their documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycantonese-4.2.0.tar.gz (39.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycantonese-4.2.0-cp310-abi3-win_amd64.whl (42.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (42.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl (42.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl (42.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file pycantonese-4.2.0.tar.gz.

File metadata

  • Download URL: pycantonese-4.2.0.tar.gz
  • Upload date:
  • Size: 39.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycantonese-4.2.0.tar.gz
Algorithm Hash digest
SHA256 5a9ec7a8a10b08ef0ff12772b51c6f7a22727503bfbeb9edfea4eb90fc77d68c
MD5 80d7871080e8757ddfebc4f4e3ed1622
BLAKE2b-256 0a0ade68c2fe8c952e9f57ab0237e1dcc834f9bd95602121330900cdeb1d6f10

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0.tar.gz:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.2.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: pycantonese-4.2.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 42.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pycantonese-4.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0a75f2f3ea7c2ec7ef1ab0929d4ee2a436cdb9e4e7de2acedaa012d2b024e725
MD5 88678c63815085745a36e0cf13d95332
BLAKE2b-256 3f73e3c061d1ec2439cbc844b9718c6aa2a7866a91e6eeaeb13f3c9ae3af64db

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c203c263dbf312626100a371f9acfb62ef937d31f3541e3798ca0f64bdd2a062
MD5 858879630ad6f9cea928dbd4ba18e1c1
BLAKE2b-256 4dd828cc1517cd2e76129ccae79b7b11c2af323d7a64ad3ee42d8837a76fd784

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bd556cfd75dc73d5a6f2258bbb5c51f043a155d2a52bffaacaf3702f6923831d
MD5 7f26398db8a8a78cce554a85a96214db
BLAKE2b-256 10d88f7c9b1c0574c0bf5e3a0c9321920a51745e388b52563a6be2eb41cdc2f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5170664f5159228eda3f8246e63fe02ea69fa2caae252e590c9fa955f67ee9b6
MD5 11e80f405ad79852296835b8bed2735a
BLAKE2b-256 b627be172065ab4381d0e4afa400f8bb3221e74eee3c873aa3ccdfe887139a40

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1380081aedbf0a776be4ad5d6a74fcbf0427e0c5fad5b7676dc93979f623035e
MD5 3e8ec49456cae89c77e2bcd191d60d94
BLAKE2b-256 0d8178b726265e29588008e70ac54235275e90d9ff7f77a3a809e38a4b03b5b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on jacksonllee/pycantonese

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page