Benchmark Datasets for Computational Historical Linguistics

These details have not been verified by PyPI

Project links

Project description

pylexibench

Installing

Install pylexibench via pip, preferably in a new virtual environment:

pip install pylexibench

This will also install the cli command lexibench.

Usage

pylexibench provides a set of cli commands to curate a data repository suitable as benchmark for computational methods for cognate detection and phylogenetic reconstruction. At the core of such a repository is a list of suitable lexical datasets from the Lexibank collection. Each command computes artefacts derived from these datasets which are suitable as input for various computational methods. The output of each command is put in a sub-directory of the repository, named after the command, so lexibench download will populate the download directory and so on. In addition, summary statistics about the computed artefacts are written to a TSV file stats.tsv as well as to a table in a Markdown formatted file README.md in this directory. The README.md also contains information about the options passed when running the command.

The commands are implemented as sub-commands of the main lexibench command, which is installed when installing the pylexibench package.

Before running any pylexibench command, create a direcotry [my_lexibench_repos]) where you want to store the lexibench data. Set up a dataset list lexibank.tsv providing the datasets you aim to examine (for a list with all datasets contained in lexibench see here). Place in in [my_lexibench_repos]/etc/.

Note that download needs to be executed first, followed by lingpy_wordlists. The remaining commands can be executed afterwards.

$ lexibench -h
usage: lexibench [-h] [--log-level LOG_LEVEL] [--repos REPOS]
                 COMMAND ...

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)
  --repos REPOS         Directory where dataset list can be found and results stored. (default: None)

available commands:
  Run "COMAMND -h" to get help for a specific command.

  COMMAND
    character_matrices  Write character matrix_formats for cognate data encoded in the lingpy_wordlists.
    download            Download lexibank datasets as specified in the repository's dataset list and write a corresponding
    glottolog_trees     Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
    lingpy_cognates     Compute cognate sets.
    lingpy_wordlists    Extract LingPy lingpy_wordlists from lexibank datasets.

`lexibench download`

The lexibench download command downloads the CLDF datasets listed in etc/lexibank.tsv from Zenodo into download/.

$ lexibench download -h
usage: lexibench download [-h] [-f] [-u]

Download lexibank datasets as specified in the repository's dataset list and write a corresponding
BibTeX file for reference.

options:
  -h, --help     show this help message and exit
  -f, --force    Force download of a dataset even if it already exists. (default: False)
  -u, --upgrade  Download newest release of a dataset. (default: False)

Exemplary usage with upgrade option:

lexibench --repos [my_lexibench_repos] download --upgrade

`lexibench lingpy_wordlists`

The lexibench lingpy_wordlists command extracts single-family wordlists from the datasets and writes these in LingPy's Wordlist format to lingpy_wordlists/.

$ lexibench lingpy_wordlists -h
usage: lexibench lingpy_wordlists [-h] [--language-threshold LANGUAGE_THRESHOLD] [--concept-threshold CONCEPT_THRESHOLD] [--coverage-threshold COVERAGE_THRESHOLD]

Extract LingPy lingpy_wordlists from lexibank datasets.

options:
  -h, --help            show this help message and exit
  --language-threshold LANGUAGE_THRESHOLD
                        Number of different varieties a wordlist must contain to be considered (default: 4)
  --concept-threshold CONCEPT_THRESHOLD
                        Number of different concepts a wordlist must contain to be considered (default: 85)
  --coverage-threshold COVERAGE_THRESHOLD
                        Minimum coverage (computed as `lingpy.sanity.average_coverage`) a wordlist must have to be considered (default: 0.45)

Exemplary usage:

lexibench --repos [my_lexibench_repos] lingpy_wordlists

`lexibench glottolog_trees`

The lexibench glottolog_trees command computes topological trees based on the Glottolog classification, i.e. the doculects in a wordlist are matched to Glottolog languoids and the associated Glottolog family tree is then pruned only contain these doculects as leaf nodes.

$ lexibench glottolog_trees -h
usage: lexibench glottolog_trees [-h] [--wordlist WORDLIST] [--glottolog GLOTTOLOG] [--glottolog-version GLOTTOLOG_VERSION]

Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
and pruned and renamed to the varieties in the wordlist.

options:
  -h, --help            show this help message and exit
  --wordlist WORDLIST   Name of a specific wordlist to process (default: None)
  --glottolog GLOTTOLOG
                        Path to repository clone of Glottolog data (default: None)
  --glottolog-version GLOTTOLOG_VERSION
                        Version of Glottolog data to checkout (default: None)

Exemplary usage:

lexibench --repos [my_lexibench_repos] glottolog_trees --glottolog [my_glottolog]

`lexibench character_matrices`

The lexibench character_matrices creates character matrices in the specified formats for the wordlist in lingpy_wordlists/ and saves them to character_matrices/. Additionally, there are character matrices created, which contain only those languages, for which there is a glottocode available. They are save to character_matrices_compatible/. Trees inferred on these character matrices can be compared to the glottolog tree.

$ lexibench character_matrices -h
usage: lexibench character_matrices [-h] [--missing-is-zero] [--polymorphism-is-zero] --formats FORMATS [FORMATS ...] [--wordlist WORDLIST]

Write character matrices in specified formats for cognate data encoded in the lingpy_wordlists.

options:
  -h, --help            show this help message and exit
  --missing-is-zero     Code a missing counterpart for a concept in a doculect as 0 rather than as missing data (default: False)
  --polymorphism-is-zero
                        Code the case of multiple counterparts (in different cognate sets) for a concept in a doculect as 0 (default: False)
  --formats FORMATS [FORMATS ...]
                        Character matrix formats which are to be constructed (default: ['bin.catg', 'multi.catg', 'bin.phy', 'multi.phy', 'bin.nex', 'multi.nex'])

  --wordlist WORDLIST   Name of a specific wordlist to process (default: None)
  --glottolog GLOTTOLOG
                        Path to repository clone of Glottolog data (default: None)
  --glottolog-version GLOTTOLOG_VERSION
                        Version of Glottolog data to checkout (default: None)

Exemplary usage (for creating character matrices in bin.nex and multi.nex format):

lexibench --repos [my_lexibench_repos] character_matrices --format bin.nex multi.nex --glottolog [my_glottolog]

`lexibench lingpy_cognates`

$ lexibench lingpy_cognates -h
usage: lexibench lingpy_cognates [-h] [--cognate-threshold COGNATE_THRESHOLD] [--sca-threshold SCA_THRESHOLD] [--lexstat-threshold LEXSTAT_THRESHOLD] {lexstat,sca}

Compute cognate sets.

positional arguments:
  {lexstat,sca}

options:
  -h, --help            show this help message and exit
  --cognate-threshold COGNATE_THRESHOLD
  --sca-threshold SCA_THRESHOLD
  --lexstat-threshold LEXSTAT_THRESHOLD

Exemplary usage for cognate clustering with lexstat:

lexibench --repos [my_lexibench_repos] lingpy_cognates lexstat

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylexibench-1.0.0.tar.gz (32.6 kB view details)

Uploaded Apr 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pylexibench-1.0.0-py3-none-any.whl (34.9 kB view details)

Uploaded Apr 22, 2025 Python 3

File details

Details for the file pylexibench-1.0.0.tar.gz.

File metadata

Download URL: pylexibench-1.0.0.tar.gz
Upload date: Apr 22, 2025
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pylexibench-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8b8e3a41c0d1b6421d9ed76c2c478e5263a773819d98b606ea7b50918a979dab`
MD5	`8087d1c1dda73fc0b0c14deddfeaa8ab`
BLAKE2b-256	`cb3d946e5d20a1217a0c1ef82cea93f519c3fe9ebbdb13b7a1daa9a7e9565636`

See more details on using hashes here.

File details

Details for the file pylexibench-1.0.0-py3-none-any.whl.

File metadata

Download URL: pylexibench-1.0.0-py3-none-any.whl
Upload date: Apr 22, 2025
Size: 34.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pylexibench-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa62bfa3f7e55301b234c58ceda77304bcec8de129cb5086cf92e7bd048c1eef`
MD5	`af0793568fb2f107f8089e8f617a1150`
BLAKE2b-256	`f0e0d2bcef9f2752872721fee8e5494c5129ca7c21d689b84ced3b93cd9572e0`

See more details on using hashes here.

pylexibench 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pylexibench

Installing

Usage

`lexibench download`

`lexibench lingpy_wordlists`

`lexibench glottolog_trees`

`lexibench character_matrices`

`lexibench lingpy_cognates`

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes