Cantonese Linguistics and NLP in Python
Project description
Full Documentation: https://pycantonese.org
PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features:
Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging
The design of PyCantonese prioritizes ease of use and linguistic knowledge. It has been successfully used by both academic and commercial organizations, including major US tech companies.
Since v4.0.0 (March 2026), PyCantonese depends on Rustling, a library for efficient CHAT data handling, word segmentation, and part-of-speech tagging.
Download and Install
Using pip:
pip install --upgrade pycantonese
Using conda:
conda install -c conda-forge pycantonese
PyCantonese also works in JavaScript.
Ready for more? Check out Quickstart.
Links
Author: Jackson L. Lee
Source code: https://github.com/jacksonllee/pycantonese
Social media: Facebook
How to Cite
Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.
@inproceedings{lee-etal-2022-pycantonese,
title = "PyCantonese: Cantonese Linguistics and NLP in Python",
author = "Lee, Jackson L. and
Chen, Litong and
Lam, Charles and
Lau, Chaak Ming and
Tsui, Tsz-Him",
booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
month = jun,
year = "2022",
publisher = "European Language Resources Association",
}
License
MIT License.
Please note that PyCantonese includes data from the following sources, all of which are permissively licensed:
Hong Kong Cantonese Corpus (CC BY)
CantoMap (GPL-3.0)
rime-cantonese (CC BY 4.0)
Common Voice Cantonese (Mozilla Public License 2.0)
Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)
For details about these datasets, please see their documentation.
Logo
The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycantonese-4.2.0.tar.gz.
File metadata
- Download URL: pycantonese-4.2.0.tar.gz
- Upload date:
- Size: 39.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a9ec7a8a10b08ef0ff12772b51c6f7a22727503bfbeb9edfea4eb90fc77d68c
|
|
| MD5 |
80d7871080e8757ddfebc4f4e3ed1622
|
|
| BLAKE2b-256 |
0a0ade68c2fe8c952e9f57ab0237e1dcc834f9bd95602121330900cdeb1d6f10
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0.tar.gz:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0.tar.gz -
Subject digest:
5a9ec7a8a10b08ef0ff12772b51c6f7a22727503bfbeb9edfea4eb90fc77d68c - Sigstore transparency entry: 1188960195
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pycantonese-4.2.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: pycantonese-4.2.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 42.1 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a75f2f3ea7c2ec7ef1ab0929d4ee2a436cdb9e4e7de2acedaa012d2b024e725
|
|
| MD5 |
88678c63815085745a36e0cf13d95332
|
|
| BLAKE2b-256 |
3f73e3c061d1ec2439cbc844b9718c6aa2a7866a91e6eeaeb13f3c9ae3af64db
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-win_amd64.whl:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0-cp310-abi3-win_amd64.whl -
Subject digest:
0a75f2f3ea7c2ec7ef1ab0929d4ee2a436cdb9e4e7de2acedaa012d2b024e725 - Sigstore transparency entry: 1188960206
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 42.7 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c203c263dbf312626100a371f9acfb62ef937d31f3541e3798ca0f64bdd2a062
|
|
| MD5 |
858879630ad6f9cea928dbd4ba18e1c1
|
|
| BLAKE2b-256 |
4dd828cc1517cd2e76129ccae79b7b11c2af323d7a64ad3ee42d8837a76fd784
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
c203c263dbf312626100a371f9acfb62ef937d31f3541e3798ca0f64bdd2a062 - Sigstore transparency entry: 1188960204
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 42.6 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd556cfd75dc73d5a6f2258bbb5c51f043a155d2a52bffaacaf3702f6923831d
|
|
| MD5 |
7f26398db8a8a78cce554a85a96214db
|
|
| BLAKE2b-256 |
10d88f7c9b1c0574c0bf5e3a0c9321920a51745e388b52563a6be2eb41cdc2f6
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
bd556cfd75dc73d5a6f2258bbb5c51f043a155d2a52bffaacaf3702f6923831d - Sigstore transparency entry: 1188960209
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 42.3 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5170664f5159228eda3f8246e63fe02ea69fa2caae252e590c9fa955f67ee9b6
|
|
| MD5 |
11e80f405ad79852296835b8bed2735a
|
|
| BLAKE2b-256 |
b627be172065ab4381d0e4afa400f8bb3221e74eee3c873aa3ccdfe887139a40
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
5170664f5159228eda3f8246e63fe02ea69fa2caae252e590c9fa955f67ee9b6 - Sigstore transparency entry: 1188960215
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 42.5 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1380081aedbf0a776be4ad5d6a74fcbf0427e0c5fad5b7676dc93979f623035e
|
|
| MD5 |
3e8ec49456cae89c77e2bcd191d60d94
|
|
| BLAKE2b-256 |
0d8178b726265e29588008e70ac54235275e90d9ff7f77a3a809e38a4b03b5b4
|
Provenance
The following attestation bundles were made for pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on jacksonllee/pycantonese
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pycantonese-4.2.0-cp310-abi3-macosx_10_12_x86_64.whl -
Subject digest:
1380081aedbf0a776be4ad5d6a74fcbf0427e0c5fad5b7676dc93979f623035e - Sigstore transparency entry: 1188960201
- Sigstore integration time:
-
Permalink:
jacksonllee/pycantonese@2dff909520db251966e0ec033334bb2ad6a02672 -
Branch / Tag:
refs/tags/v4.2.0 - Owner: https://github.com/jacksonllee
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2dff909520db251966e0ec033334bb2ad6a02672 -
Trigger Event:
release
-
Statement type: