Skip to main content

Benchmark Datasets for Computational Historical Linguistics

Project description

pylexibench

Installing

Install pylexibench via pip, preferably in a new virtual environment:

pip install pylexibench

This will also install the cli command lexibench.

Usage

pylexibench provides a set of cli commands to curate a data repository suitable as benchmark for computational methods for cognate detection and phylogenetic reconstruction. At the core of such a repository is a list of suitable lexical datasets from the Lexibank collection. Each command computes artefacts derived from these datasets which are suitable as input for various computational methods. The output of each command is put in a sub-directory of the repository, named after the command, so lexibench download will populate the download directory and so on. In addition, summary statistics about the computed artefacts are written to a TSV file stats.tsv as well as to a table in a Markdown formatted file README.md in this directory. The README.md also contains information about the options passed when running the command.

The commands are implemented as sub-commands of the main lexibench command, which is installed when installing the pylexibench package.

Before running any pylexibench command, create a direcotry [my_lexibench_repos]) where you want to store the lexibench data. Set up a dataset list lexibank.tsv providing the datasets you aim to examine (for a list with all datasets contained in lexibench see here). Place in in [my_lexibench_repos]/etc/.

Note that download needs to be executed first, followed by lingpy_wordlists. The remaining commands can be executed afterwards.

$ lexibench -h
usage: lexibench [-h] [--log-level LOG_LEVEL] [--repos REPOS]
                 COMMAND ...

options:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)
  --repos REPOS         Directory where dataset list can be found and results stored. (default: None)

available commands:
  Run "COMAMND -h" to get help for a specific command.

  COMMAND
    character_matrices  Write character matrix_formats for cognate data encoded in the lingpy_wordlists.
    download            Download lexibank datasets as specified in the repository's dataset list and write a corresponding
    glottolog_trees     Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
    lingpy_cognates     Compute cognate sets.
    lingpy_wordlists    Extract LingPy lingpy_wordlists from lexibank datasets.

lexibench download

The lexibench download command downloads the CLDF datasets listed in etc/lexibank.tsv from Zenodo into download/.

$ lexibench download -h
usage: lexibench download [-h] [-f] [-u]

Download lexibank datasets as specified in the repository's dataset list and write a corresponding
BibTeX file for reference.

options:
  -h, --help     show this help message and exit
  -f, --force    Force download of a dataset even if it already exists. (default: False)
  -u, --upgrade  Download newest release of a dataset. (default: False)

Exemplary usage with upgrade option:

lexibench --repos [my_lexibench_repos] download --upgrade

lexibench lingpy_wordlists

The lexibench lingpy_wordlists command extracts single-family wordlists from the datasets and writes these in LingPy's Wordlist format to lingpy_wordlists/.

$ lexibench lingpy_wordlists -h
usage: lexibench lingpy_wordlists [-h] [--language-threshold LANGUAGE_THRESHOLD] [--concept-threshold CONCEPT_THRESHOLD] [--coverage-threshold COVERAGE_THRESHOLD]

Extract LingPy lingpy_wordlists from lexibank datasets.

options:
  -h, --help            show this help message and exit
  --language-threshold LANGUAGE_THRESHOLD
                        Number of different varieties a wordlist must contain to be considered (default: 4)
  --concept-threshold CONCEPT_THRESHOLD
                        Number of different concepts a wordlist must contain to be considered (default: 85)
  --coverage-threshold COVERAGE_THRESHOLD
                        Minimum coverage (computed as `lingpy.sanity.average_coverage`) a wordlist must have to be considered (default: 0.45)

Exemplary usage:

lexibench --repos [my_lexibench_repos] lingpy_wordlists

lexibench glottolog_trees

The lexibench glottolog_trees command computes topological trees based on the Glottolog classification, i.e. the doculects in a wordlist are matched to Glottolog languoids and the associated Glottolog family tree is then pruned only contain these doculects as leaf nodes.

$ lexibench glottolog_trees -h
usage: lexibench glottolog_trees [-h] [--wordlist WORDLIST] [--glottolog GLOTTOLOG] [--glottolog-version GLOTTOLOG_VERSION]

Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
and pruned and renamed to the varieties in the wordlist.

options:
  -h, --help            show this help message and exit
  --wordlist WORDLIST   Name of a specific wordlist to process (default: None)
  --glottolog GLOTTOLOG
                        Path to repository clone of Glottolog data (default: None)
  --glottolog-version GLOTTOLOG_VERSION
                        Version of Glottolog data to checkout (default: None)

Exemplary usage:

lexibench --repos [my_lexibench_repos] glottolog_trees --glottolog [my_glottolog]

lexibench character_matrices

The lexibench character_matrices creates character matrices in the specified formats for the wordlist in lingpy_wordlists/ and saves them to character_matrices/. Additionally, there are character matrices created, which contain only those languages, for which there is a glottocode available. They are save to character_matrices_compatible/. Trees inferred on these character matrices can be compared to the glottolog tree.

$ lexibench character_matrices -h
usage: lexibench character_matrices [-h] [--missing-is-zero] [--polymorphism-is-zero] --formats FORMATS [FORMATS ...] [--wordlist WORDLIST]

Write character matrices in specified formats for cognate data encoded in the lingpy_wordlists.

options:
  -h, --help            show this help message and exit
  --missing-is-zero     Code a missing counterpart for a concept in a doculect as 0 rather than as missing data (default: False)
  --polymorphism-is-zero
                        Code the case of multiple counterparts (in different cognate sets) for a concept in a doculect as 0 (default: False)
  --formats FORMATS [FORMATS ...]
                        Character matrix formats which are to be constructed (default: ['bin.catg', 'multi.catg', 'bin.phy', 'multi.phy', 'bin.nex', 'multi.nex'])

  --wordlist WORDLIST   Name of a specific wordlist to process (default: None)
  --glottolog GLOTTOLOG
                        Path to repository clone of Glottolog data (default: None)
  --glottolog-version GLOTTOLOG_VERSION
                        Version of Glottolog data to checkout (default: None)

Exemplary usage (for creating character matrices in bin.nex and multi.nex format):

lexibench --repos [my_lexibench_repos] character_matrices --format bin.nex multi.nex --glottolog [my_glottolog]

lexibench lingpy_cognates

$ lexibench lingpy_cognates -h
usage: lexibench lingpy_cognates [-h] [--cognate-threshold COGNATE_THRESHOLD] [--sca-threshold SCA_THRESHOLD] [--lexstat-threshold LEXSTAT_THRESHOLD] {lexstat,sca}

Compute cognate sets.

positional arguments:
  {lexstat,sca}

options:
  -h, --help            show this help message and exit
  --cognate-threshold COGNATE_THRESHOLD
  --sca-threshold SCA_THRESHOLD
  --lexstat-threshold LEXSTAT_THRESHOLD

Exemplary usage for cognate clustering with lexstat:

lexibench --repos [my_lexibench_repos] lingpy_cognates lexstat

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylexibench-1.0.0.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pylexibench-1.0.0-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file pylexibench-1.0.0.tar.gz.

File metadata

  • Download URL: pylexibench-1.0.0.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pylexibench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8b8e3a41c0d1b6421d9ed76c2c478e5263a773819d98b606ea7b50918a979dab
MD5 8087d1c1dda73fc0b0c14deddfeaa8ab
BLAKE2b-256 cb3d946e5d20a1217a0c1ef82cea93f519c3fe9ebbdb13b7a1daa9a7e9565636

See more details on using hashes here.

File details

Details for the file pylexibench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pylexibench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pylexibench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa62bfa3f7e55301b234c58ceda77304bcec8de129cb5086cf92e7bd048c1eef
MD5 af0793568fb2f107f8089e8f617a1150
BLAKE2b-256 f0e0d2bcef9f2752872721fee8e5494c5129ca7c21d689b84ced3b93cd9572e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page