PyCantonese: Cantonese Linguistics and NLP in Python
Project description
Full Documentation: https://pycantonese.org
PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):
Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Stop words
Word segmentation
Part-of-speech tagging
Quick Examples
With PyCantonese imported:
>>> import pycantonese
Word segmentation
>>> pycantonese.segment("廣東話好難學?") # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '?']
Conversion from Cantonese characters to Jyutping
>>> pycantonese.characters_to_jyutping('香港人講廣東話') # Hongkongers speak Cantonese
[("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]
Finding all verbs in the HKCanCor corpus
In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with “V” in the original HKCanCor annotations:
>>> corpus = pycantonese.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs) # number of all verbs
29726
>>> all_verbs[:10] # print 10 results
[Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gra=None),
Token(word='有冇', pos='V1', jyutping='jau5mou5', mor=None, gra=None),
Token(word='要', pos='VU', jyutping='jiu3', mor=None, gra=None),
Token(word='有得', pos='VU', jyutping='jau5dak1', mor=None, gra=None),
Token(word='冇得', pos='VU', jyutping='mou5dak1', mor=None, gra=None),
Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None),
Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None)]
Parsing Jyutping for the onset, nucleus, coda, and tone
>>> pycantonese.parse_jyutping('gwong2dung1waa2') # 廣東話
[Jyutping(onset='gw', nucleus='o', coda='ng', tone='2'),
Jyutping(onset='d', nucleus='u', coda='ng', tone='1'),
Jyutping(onset='w', nucleus='aa', coda='', tone='2')]
Download and Install
To download and install the stable, most recent version:
$ pip install --upgrade pycantonese
To test your installation in the Python interpreter:
>>> import pycantonese
>>> pycantonese.__version__ # show version number
Links
Source code: https://github.com/jacksonllee/pycantonese
Bug tracker, feature requests: https://github.com/jacksonllee/pycantonese/issues
Email: Please contact Jackson Lee.
How to Cite
PyCantonese is authored and mainteined by Jackson L. Lee.
A talk introducing PyCantonese:
Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides
License
MIT License. Please see LICENSE.txt in the GitHub source code for details.
The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.
The rime-cantonese data (release 2020.09.09) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.
Logo
The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).
Acknowledgments
Wonderful resources with a permissive license that have been incorporated into PyCantonese:
HKCanCor
rime-cantonese
Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):
@cathug
Litong Chen
Jenny Chim
@g-traveller
Rachel Han
Ryan Lai
Charles Lam
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Tsz-Him Tsui
Robin Yuen
Changelog
Please see CHANGELOG.md.
Setting up a Development Environment
The latest code under development is available on Github at jacksonllee/pycantonese. You need to have Git LFS installed on your system. To obtain this version for experimental features or for development:
$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ git lfs pull
$ pip install -r dev-requirements.txt
$ pip install -e .
To run tests and styling checks:
$ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs
$ flake8 pycantonese
$ black --check pycantonese
To build the documentation website files:
$ python build_docs.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pycantonese-3.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a866c9021841741265c1b1793c4ffd859936715a118f5edb3f1c399a3da9c5c7 |
|
MD5 | 700136089912abe7a8ede2176ea6c7b2 |
|
BLAKE2b-256 | 15efe4e3c9a639671bbed3448a0b5e636fbf6539fc972888d8e20644af1b0402 |