Benchmark Datasets for Computational Historical Linguistics
Project description
pylexibench
Installing
Install pylexibench via pip, preferably in a new virtual environment:
pip install pylexibench
This will also install the cli command lexibench.
Usage
pylexibench provides a set of cli commands to curate a data repository suitable as benchmark for
computational methods for cognate detection and phylogenetic reconstruction. At the core of such a
repository is a list of suitable lexical datasets from the Lexibank collection. Each command
computes artefacts derived from these datasets which are suitable as input for various computational
methods. The output of each command is put in a sub-directory of the repository, named after the
command, so lexibench download will populate the download directory and so on. In addition,
summary statistics about the computed artefacts are written to a TSV file stats.tsv as well as to
a table in a Markdown formatted file README.md in this directory. The README.md also contains
information about the options passed when running the command.
The commands are implemented as sub-commands of the main lexibench command, which is installed
when installing the pylexibench package.
Before running any pylexibench command, create a direcotry [my_lexibench_repos]) where you want to store the lexibench data.
Set up a dataset list lexibank.tsv providing the datasets you aim to examine
(for a list with all datasets contained in lexibench see here).
Place in in [my_lexibench_repos]/etc/.
Note that download needs to be executed first, followed by lingpy_wordlists. The remaining commands can be executed afterwards.
$ lexibench -h
usage: lexibench [-h] [--log-level LOG_LEVEL] [--repos REPOS]
COMMAND ...
options:
-h, --help show this help message and exit
--log-level LOG_LEVEL
log level [ERROR|WARN|INFO|DEBUG] (default: 20)
--repos REPOS Directory where dataset list can be found and results stored. (default: None)
available commands:
Run "COMAMND -h" to get help for a specific command.
COMMAND
character_matrices Write character matrix_formats for cognate data encoded in the lingpy_wordlists.
download Download lexibank datasets as specified in the repository's dataset list and write a corresponding
glottolog_trees Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
lingpy_cognates Compute cognate sets.
lingpy_wordlists Extract LingPy lingpy_wordlists from lexibank datasets.
lexibench download
The lexibench download command downloads the CLDF datasets listed in etc/lexibank.tsv from
Zenodo into download/.
$ lexibench download -h
usage: lexibench download [-h] [-f] [-u]
Download lexibank datasets as specified in the repository's dataset list and write a corresponding
BibTeX file for reference.
options:
-h, --help show this help message and exit
-f, --force Force download of a dataset even if it already exists. (default: False)
-u, --upgrade Download newest release of a dataset. (default: False)
Exemplary usage with upgrade option:
lexibench --repos [my_lexibench_repos] download --upgrade
lexibench lingpy_wordlists
The lexibench lingpy_wordlists command extracts single-family wordlists from the datasets and
writes these in LingPy's Wordlist format to lingpy_wordlists/.
$ lexibench lingpy_wordlists -h
usage: lexibench lingpy_wordlists [-h] [--language-threshold LANGUAGE_THRESHOLD] [--concept-threshold CONCEPT_THRESHOLD] [--coverage-threshold COVERAGE_THRESHOLD]
Extract LingPy lingpy_wordlists from lexibank datasets.
options:
-h, --help show this help message and exit
--language-threshold LANGUAGE_THRESHOLD
Number of different varieties a wordlist must contain to be considered (default: 4)
--concept-threshold CONCEPT_THRESHOLD
Number of different concepts a wordlist must contain to be considered (default: 85)
--coverage-threshold COVERAGE_THRESHOLD
Minimum coverage (computed as `lingpy.sanity.average_coverage`) a wordlist must have to be considered (default: 0.45)
Exemplary usage:
lexibench --repos [my_lexibench_repos] lingpy_wordlists
lexibench glottolog_trees
The lexibench glottolog_trees command computes topological trees based on the Glottolog
classification, i.e. the doculects in a wordlist are matched to Glottolog languoids and the
associated Glottolog family tree is then pruned only contain these doculects as leaf nodes.
$ lexibench glottolog_trees -h
usage: lexibench glottolog_trees [-h] [--wordlist WORDLIST] [--glottolog GLOTTOLOG] [--glottolog-version GLOTTOLOG_VERSION]
Create trees for the families referenced in the lingpy_wordlists, based on the Glottolog classification
and pruned and renamed to the varieties in the wordlist.
options:
-h, --help show this help message and exit
--wordlist WORDLIST Name of a specific wordlist to process (default: None)
--glottolog GLOTTOLOG
Path to repository clone of Glottolog data (default: None)
--glottolog-version GLOTTOLOG_VERSION
Version of Glottolog data to checkout (default: None)
Exemplary usage:
lexibench --repos [my_lexibench_repos] glottolog_trees --glottolog [my_glottolog]
lexibench character_matrices
The lexibench character_matrices creates character matrices in the specified formats for the wordlist in lingpy_wordlists/ and saves them to character_matrices/. Additionally, there are character matrices created, which contain only those languages, for which there is a glottocode available. They are save to character_matrices_compatible/. Trees inferred on these character matrices can be compared to the glottolog tree.
$ lexibench character_matrices -h
usage: lexibench character_matrices [-h] [--missing-is-zero] [--polymorphism-is-zero] --formats FORMATS [FORMATS ...] [--wordlist WORDLIST]
Write character matrices in specified formats for cognate data encoded in the lingpy_wordlists.
options:
-h, --help show this help message and exit
--missing-is-zero Code a missing counterpart for a concept in a doculect as 0 rather than as missing data (default: False)
--polymorphism-is-zero
Code the case of multiple counterparts (in different cognate sets) for a concept in a doculect as 0 (default: False)
--formats FORMATS [FORMATS ...]
Character matrix formats which are to be constructed (default: ['bin.catg', 'multi.catg', 'bin.phy', 'multi.phy', 'bin.nex', 'multi.nex'])
--wordlist WORDLIST Name of a specific wordlist to process (default: None)
--glottolog GLOTTOLOG
Path to repository clone of Glottolog data (default: None)
--glottolog-version GLOTTOLOG_VERSION
Version of Glottolog data to checkout (default: None)
Exemplary usage (for creating character matrices in bin.nex and multi.nex format):
lexibench --repos [my_lexibench_repos] character_matrices --format bin.nex multi.nex --glottolog [my_glottolog]
lexibench lingpy_cognates
$ lexibench lingpy_cognates -h
usage: lexibench lingpy_cognates [-h] [--cognate-threshold COGNATE_THRESHOLD] [--sca-threshold SCA_THRESHOLD] [--lexstat-threshold LEXSTAT_THRESHOLD] {lexstat,sca}
Compute cognate sets.
positional arguments:
{lexstat,sca}
options:
-h, --help show this help message and exit
--cognate-threshold COGNATE_THRESHOLD
--sca-threshold SCA_THRESHOLD
--lexstat-threshold LEXSTAT_THRESHOLD
Exemplary usage for cognate clustering with lexstat:
lexibench --repos [my_lexibench_repos] lingpy_cognates lexstat
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pylexibench-1.0.0.tar.gz.
File metadata
- Download URL: pylexibench-1.0.0.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b8e3a41c0d1b6421d9ed76c2c478e5263a773819d98b606ea7b50918a979dab
|
|
| MD5 |
8087d1c1dda73fc0b0c14deddfeaa8ab
|
|
| BLAKE2b-256 |
cb3d946e5d20a1217a0c1ef82cea93f519c3fe9ebbdb13b7a1daa9a7e9565636
|
File details
Details for the file pylexibench-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pylexibench-1.0.0-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa62bfa3f7e55301b234c58ceda77304bcec8de129cb5086cf92e7bd048c1eef
|
|
| MD5 |
af0793568fb2f107f8089e8f617a1150
|
|
| BLAKE2b-256 |
f0e0d2bcef9f2752872721fee8e5494c5129ca7c21d689b84ced3b93cd9572e0
|