pycantonese

PyCantonese: Cantonese Linguistics and NLP in Python

These details have not been verified by PyPI

Project links

Project description

Full Documentation: https://pycantonese.org

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). The goal is to provide general-purpose tools to work with Cantonese language data:

Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Stop words
Word segmentation
Part-of-speech tagging

Quick Examples

With PyCantonese imported:

>>> import pycantonese as pc

Word segmentation

>>> pc.segment("廣東話好難學？")  # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '？']

Conversion from Cantonese characters to Jyutping

>>> pc.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
[("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]

Finding all verbs in the HKCanCor corpus

In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with “V” in the original HKCanCor annotations:

>>> corpus = pc.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs)  # number of all verbs
29012
>>> from pprint import pprint
>>> pprint(all_verbs[:10])  # print 10 results
[('去', 'V', 'heoi3', ''),
 ('去', 'V', 'heoi3', ''),
 ('旅行', 'VN', 'leoi5hang4', ''),
 ('有冇', 'V1', 'jau5mou5', ''),
 ('要', 'VU', 'jiu3', ''),
 ('有得', 'VU', 'jau5dak1', ''),
 ('冇得', 'VU', 'mou5dak1', ''),
 ('去', 'V', 'heoi3', ''),
 ('係', 'V', 'hai6', ''),
 ('係', 'V', 'hai6', '')]

Parsing Jyutping for (onset, nucleus, coda, tone)

>>> pc.parse_jyutping('gwong2dung1waa2')  # 廣東話
[('gw', 'o', 'ng', '2'), ('d', 'u', 'ng', '1'), ('w', 'aa', '', '2')]

Download and Install

PyCantonese requires Python 3.6 or above. To download and install the stable, most recent version:

$ pip install --upgrade pycantonese

For bug fixes and new features not yet available through a released version (they are documented under the “Unreleased” section of the changelog), you can get this (possibly unstable, still in development) version of PyCantonese by installing directly from the source code hosted on GitHub:

If you haven’t done so already, install Git LFS on your system. You only have to do this step once per system. Git LFS is to enable the proper fetching of model files stored differently due to its file size and/or binary nature.

Download and install PyCantonese from the GitHub source:

$ pip install git+https://github.com/jacksonllee/pycantonese.git@master#egg=pycantonese

To test your installation in the Python interpreter:

>>> import pycantonese as pc
>>> pc.__version__  # show version number

How to Cite

PyCantonese is authored and mainteined by Jackson L. Lee.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2020.09.09) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Acknowledgments

Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names if known):

@cathug
Litong Chen
@g-traveller
Rachel Han
Ryan Lai
Charles Lam
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Tsz-Him Tsui

Logo design by albino.snowman (Instagram handle).

Changelog

Please see CHANGELOG.md.

Setting up a Development Environment

The latest code under development is available on Github at jacksonllee/pycantonese. You need to have Git LFS installed on your system. To obtain this version for experimental features or for development:

$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ pip install -r requirements.txt
$ pip install -e .

To run tests and styling checks:

$ py.test -vv --cov pycantonese pycantonese
$ flake8 pycantonese
$ black --check --line-length=79 pycantonese

To build the documentation website files:

$ python build_docs.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.4.0

Dec 28, 2021

3.3.1

May 15, 2021

3.3.0

May 14, 2021

3.2.4

May 8, 2021

3.2.3

Apr 12, 2021

3.2.2

Mar 23, 2021

3.2.1

Mar 21, 2021

3.2.0

Mar 20, 2021

3.1.1

Mar 18, 2021

3.1.0

Feb 21, 2021

3.1.0.dev3 pre-release

Dec 6, 2020

This version

3.1.0.dev2 pre-release

Nov 9, 2020

3.0.0

Oct 26, 2020

2.4.1

Oct 11, 2020

2.4.0

Oct 11, 2020

2.3.0

Jul 24, 2020

2.2.0

Jul 1, 2018

2.1.0

Jun 11, 2018

2.0.0

Feb 7, 2016

1.0

Sep 6, 2015

1.0dev pre-release

Sep 6, 2015

0.2.1

Jan 25, 2015

0.2

Jan 22, 2015

0.1

Dec 17, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycantonese-3.1.0.dev2.tar.gz (3.7 MB view hashes)

Uploaded Nov 9, 2020 Source

Built Distribution

pycantonese-3.1.0.dev2-py3-none-any.whl (3.7 MB view hashes)

Uploaded Nov 9, 2020 Python 3

Hashes for pycantonese-3.1.0.dev2.tar.gz

Hashes for pycantonese-3.1.0.dev2.tar.gz
Algorithm	Hash digest
SHA256	`646b6af1a7a405ba8776180ae3681ad8674ca49d3ea204e02f147e184320aa25`
MD5	`b8edc001d4424a4ddb8925f93b23cd98`
BLAKE2b-256	`c35def974ccaa52433f2695404594b55f327d19224cbdeff7ebf8cf20bed1639`

Hashes for pycantonese-3.1.0.dev2-py3-none-any.whl

Hashes for pycantonese-3.1.0.dev2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e75f1e7f32e96187f8e4607ca37bb3c92d7655481b66984fd0f74b8ea083e23`
MD5	`fac4b33c5a66fbc545c309c577c2ef82`
BLAKE2b-256	`0ff5b4ba830dad60f8b883f45e6bc0ed57bec975f42f4ebf8580322de47d1aec`