Skip to main content

Tools and algorithms for phonology-aware Early Chinese NLP.

Project description

dphon

ci docs codecov pyup pypi pyversions

installation

this software is tested on the latest versions of macOS, windows, and ubuntu. you will need a supported version of python (above), along with pip.

$ pip install dphon

if you're on windows and are seeing incorrectly formatted output in your terminal, have a look at this stackoverflow answer.

usage

the main function of dphon is to look for instances of text reuse in a corpus of old chinese texts. instead of relying purely on graphemes, it does this by performing grapheme-to-phoneme conversion, and determining possible reuse based on whether passages are likely to have sounded similar (or rhymed) when spoken aloud.

you will need to have files stored locally as utf-8 encoded plain-text (.txt) or json-lines (.jsonl) format. for the former, one file is assumed to represent one document. for the latter, one file can contain any number of lines, each of which is a document, with required keys id (a unique identifier) and text (text content) and any number of optional keys. you can obtain a representative corpus of old chinese sourced from the kanseki repository via direct-phonology/ect-krp.

a simple invocation of dphon might look like:

$ dphon text_a.txt text_b.txt

which would look for phonetically similar passages between text_a and text_b. the output will be a list of sequences, with an identifier based on the file's name and an indicator of where in the text the sequence occurs:

score 9, weighted 1.0
趙怱及齊將顏聚代之 (text_a 107505–107512)
趙蔥及齊將顏聚代李 (text_b 95016–95024)

the numbers next to the identifiers are token indices, and may vary depending on how the text is tokenized – dphon currently uses character-based tokenization. whitespace will be removed, and the output will be aligned to make it easier to spot differences between the two sequences.

the score is an indicator of how many characters in the sequences were a phonetic match, while the weighted score normalizes the score by the length of the match. results are sorted by score, which results in the longest contiguous matches being listed first.

by default, dphon only returns matches that display at least one instance of graphic variation – a case where two different graphemes are used in the same place to represent the same sound. if you're interested in all instances of reuse, regardless of graphic variation, you can use the --all flag:

$ dphon text_a.txt text_b.txt --all

you can view the full list of command options with:

$ dphon --help

this tool is under active development, and results may vary. to find the version you are running:

$ dphon --version

methodology

matching sequences are determined by a "dictionary" file that represents a particular reconstruction of old chinese phonology. these data structures perform grapheme-to-phoneme conversion, yielding the associated sound for each character:

"埃": "qˤə"
"哀": "ʔˤəj"
"藹": "qˤats"
...

for characters with multiple readings, dphon currently chooses the first available reading for comparison. more work is planned for version 3.0 to address this shortcoming.

in version 1.0, dphon's default reconstruction was based on Schuessler 20071, but used a single "dummy" character to represent all the lexemes in a particular sound class. the dictionary was compiled by John O'Leary (@valgrinderror) and Gian Duri Rominger (@GDRom). since version 2.0, dphon uses a dictionary based on the Baxter-Sagart 2014 reconstruction2, with additional work by Gian Duri Rominger.

the matching algorithm is based on Paul Vierthaler's chinesetextreuse project, with some modifications. it uses a BLAST-like strategy to identify initial match candidates, and then extend them via phonetic edit distance comparison. finally, the results are aligned using a version of the Smith-Waterman algorithm that operates on phonemes.

development setup

python >=3.6 is required.

first, clone the repository:

$ git clone https://github.com/direct-phonology/dphon.git
$ cd dphon

then, to create and activate a virtual environment (recommended):

$ python -m venv venv
$ source venv/bin/activate

install dependencies:

$ pip install -r dev-requirements.txt

finally, install the package itself in development mode:

$ pip install -e .

now your changes will be automatically picked up when you run dphon.

pull requests should be made against the develop branch.

code documentation

code documentation is available on github pages and is automatically generated with pdoc3 on pushes to main.

to build documentation locally:

$ pdoc --html --output-dir docs dphon

tests

unit tests are written with unittest. you can run them with:

$ python -m unittest

releases

make sure the version number in dphon/__init__.py is correct!

if there are any built files in dist/ from older releases, remove them before you start this process:

$ rm dist/*

to build a source archive and distribution for a release:

$ python setup.py sdist bdist_wheel

to publish the release on test PyPI (useful for making sure everything worked):

$ twine upload --repository-url https://test.pypi.org/legacy/ dist/*

if everything is OK, publish the package to PyPI:

$ twine upload dist/*

1 Schuessler, Axel (2007), ABC Etymological Dictionary of Old Chinese, Honolulu: University of Hawaii Press, ISBN 978-0-8248-2975-9.

2 Baxter, William H.; Sagart, Laurent (2014), Old Chinese: A New Reconstruction, Oxford University Press, ISBN 978-0-19-994537-5.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dphon-2.0.0b0.tar.gz (153.2 kB view hashes)

Uploaded Source

Built Distribution

dphon-2.0.0b0-py3-none-any.whl (172.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page