Skip to main content

Tools and algorithms for phonology-aware Early Chinese NLP.

Project description

dphon

ci zenodo spaCy Ruff

Installation

This software is tested on the latest versions of macOS, Windows, and Ubuntu. You will need a supported version of Python (see .python-version), along with pip (or another dependency manager).

pip install dphon

If you're on windows and are seeing incorrectly formatted output in your terminal, have a look at this stackoverflow answer.

Usage

Basics

The main function of dphon is to look for instances of text reuse in a corpus of old Chinese texts. Instead of relying purely on graphemes, it does this by performing grapheme-to-phoneme conversion, and determining possible reuse based on whether passages are likely to have sounded similar (or rhymed) when spoken aloud.

You will need to have files stored locally as UTF-8 encoded plain-text (.txt) or JSON-lines (.jsonl) format. For the former, one file is assumed to represent one document. For the latter, one file can contain any number of lines, each of which is a document, with required keys id (a unique identifier) and text (text content) and any number of optional keys. You can obtain a representative corpus of old Chinese sourced from the Kanseki repository via direct-phonology/ect-krp.

A simple invocation of dphon might look like:

dphon text_a.txt text_b.txt

which would look for phonetically similar passages between text_a and text_b. The output will be a list of sequences and their phonemic transcriptions, with an identifier based on the file's name and an indicator of where in the text the sequence occurs:

1.  text_a (2208–2216)    夏后啟曰以為可為故為之為之天下弗能
    *ləʔ ɢʷraj kʰˤajʔ ɢʷraj kˤaʔs ɢʷraj  ɢʷraj 2.  text_b (3340–3348)    不可弗爲以爲可 故爲之爲之繇其道物
    *ləʔ ɢʷraj kʰˤajʔ kˤaʔs ɢʷraj  ɢʷraj  pit

The numbers next to the identifiers are token indices, and may vary depending on how the text is tokenized – dphon currently uses character-based tokenization. Whitespace will be removed, and the output will be aligned to make it easier to spot differences between the two sequences. By default, insertions are highlighted in green, and mismatches (differences between the two sequences) are highlighted in red. Additional (non-matching) context added to either side of match sequences is displayed using a dimmed color (see "advanced usage" below for more information on colorization).

Matches are sorted by the ratio of their phonemic similarity to their graphic similarity – in other words, matches between texts that sound highly similar but were written very differently will be at the top of the list.

By default, dphon only returns matches that display at least one instance of graphic variation – a case where two different graphemes are used in the same place to represent the same sound. These cases are highlighted in blue. If you're interested in all instances of reuse, regardless of graphic variation, you can use the --all flag:

dphon --all text_a.txt text_b.txt

You can view the full list of command options with:

dphon --help

This tool is under active development, and results may vary. To find the version you are running:

dphon --version

Advanced usage

By default, dphon uses your system's $PAGER to display output, since the results can be quite long. On MacOS and Linux, this will likely be less, which supports additional options like searching through the output once it's displayed. For more information, see the man page:

$ man less

dphon can colorize output for nicer display in the terminal if your pager supports it. To enable this behavior on MacOS and Linux, set LESS=R:

$ export LESS=R

if you want to save the results of the run to a file, you can use redirection. This is useful when writing structured formats like .csv and .jsonl. You can also write html to preserve colors:

$ dphon -o html files/*.txt > results.html

alternatively, you can pipe the output of dphon to another utility like sed for filtering the results further. For example, you could strip out the ideographic space   from results to remove the alignments:

$ dphon files/*.txt | sed 's/ //g'

Methodology

Matching sequences are determined by a "dictionary" file that represents a particular reconstruction of Old Chinese phonology. These data structures perform grapheme-to-phoneme conversion, yielding the associated sound for each character:

"埃": "qˤə"
"哀": "ʔˤəj"
"藹": "qˤats"
...

If two characters have the same phonemes, they're treated as a match. For characters with multiple readings, dphon currently chooses the first available reading for comparison. More work is planned for version 3.0 to address this shortcoming.

In version 1.0, dphon's default reconstruction was based on Schuessler 20071, but used a single "dummy" character to represent all the lexemes in a rhyming group. The dictionary was compiled by John O'Leary (@valgrinderror) and Gian Duri Rominger (@GDRom). Since version 2.0, dphon uses a dictionary based on the Baxter-Sagart 2014 reconstruction2, with additional work by Rominger.

The matching algorithm is based on Paul Vierthaler's chinesetextreuse project3, with some modifications. It uses a BLAST-like strategy to identify initial match candidates, and then extend them via phonetic edit distance comparison. Finally, the results are aligned using a version of the Smith-Waterman algorithm that operates on phonemes, powered by the lingpy library4.

Development setup

You need uv installed to set up the development environment.

First, clone the repository:

git clone https://github.com/direct-phonology/dphon.git
cd dphon

Then, you can install dependencies using uv:

uv sync

Pull requests can be made against main.

Code documentation

Code documentation is available on github pages and is generated with pdoc3.

To build the docs:

uv run pdoc --html --output-dir docs dphon

Tests

Unit tests are written with unittest. you can run them with:

uv run -m unittest

Releases

Prior to release, you can bump the version number with:

uv version --bump minor # or major, patch, etc.

The package is built and published to pyPI automatically when using GitHub's release functionality.


1 Schuessler, Axel (2007), _ABC Etymological Dictionary of Old Chinese_, Honolulu: University of Hawaii Press, ISBN 978-0-8248-2975-9.

2 Baxter, William H.; Sagart, Laurent (2014), Old Chinese: A New Reconstruction, Oxford University Press, ISBN 978-0-19-994537-5.

3 Vierthaler, Paul, and Mees Gelein. “A BLAST-Based, Language-Agnostic Text Reuse Algorithm with a MARKUS Implementation and Sequence Alignment Optimized for Large Chinese Corpora,” April 26, 2019. https://doi.org/10.31235/osf.io/7xpqe.

4 List, Johann-Mattis; Greenhill, Simon; Tresoldi, Tiago; and Forkel, Robert (2019): LingPy. A Python library for historical linguistics. Version 2.6.5. URL: http://lingpy.org, DOI: https://zenodo.org/badge/latestdoi/5137/lingpy/lingpy. With contributions by Christoph Rzymski, Gereon Kaiping, Steven Moran, Peter Bouda, Johannes Dellert, Taraka Rama, Frank Nagel. Jena: Max Planck Institute for the Science of Human History.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dphon-2.2.0.tar.gz (159.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dphon-2.2.0-py3-none-any.whl (164.2 kB view details)

Uploaded Python 3

File details

Details for the file dphon-2.2.0.tar.gz.

File metadata

  • Download URL: dphon-2.2.0.tar.gz
  • Upload date:
  • Size: 159.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dphon-2.2.0.tar.gz
Algorithm Hash digest
SHA256 29c1f9f6e592e45b7d9ce4577a3cb22b779d7b5c48a10d1c6a1726fd14657fa9
MD5 8696e934c54036e5267a53bb040e31fa
BLAKE2b-256 3eaf16447743bb07c95f21aec2eab3a926b9f3f2d3baec27eee8498032cd94ab

See more details on using hashes here.

File details

Details for the file dphon-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: dphon-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 164.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dphon-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c681c4bdc0f4c84094d4f73c1c42475004e5b3e8386543f6eacae7eb2c8cd5e
MD5 47c69bbb3e97d6d46f57ad3fbe229333
BLAKE2b-256 bfd2cf44b43d589f19b52bbde4f8632e4c10c8f11568ed19b2b439b367144343

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page