Skip to main content

Python library implementing a CLDF workbench

Project description

cldfbench

Tooling to create CLDF datasets from existing data

Build Status codecov PyPI

Overview

This package provides tools to curate cross-linguistic data, with the goal of packaging it as CLDF dataset.

In particular, it supports a workflow where

  • "raw" source data is downloaded to a raw subdirectory,
  • and subsequently converted to a CLDF dataset in a cldf subdirectory, with the help of
    • configuration data in a etc directory
    • custom Python code (a subclass of cldfbench.Dataset which implements the workflow actions)

This workflow is supported via

  • a commandline interface cldfbench which calls the workflow actions via subcommands,
  • a cldfbench.Dataset base class, which must be overwritten in a custom module to hook custom code into the workflow.

Install

cldfbench can be installed via pip - preferably in a virtual environment - running

pip install cldfbench[excel]

Note: The [excel] extra specification will also install support for reading spreadsheet data.

The command line interface cldfbench

Installing the python package will also install a command cldfbench available on the command line:

$ cldfbench -h
usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ...

optional arguments:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)

available commands:
  Run "COMAMND -h" to get help for a specific command.

  COMMAND
    check               Run generic CLDF checks
    ...

As shown above, run cldfbench -h to get help, and cldfbench COMMAND -h to get help on individual subcommands, e.g. cldfbench new -h to read about the usage of the new subcommand.

Dataset discovery

Most cldfbench commands operate on an existing dataset (unlike new, which creates a new one). Datasets can be discovered in two ways:

  1. Via the python module (i.e. the *.py file, containing the Dataset subclass). To use this mode of discovery, pass the path to the python module whenever as DATASET argument, when required by a command.

  2. Via entry point and dataset ID. To use this mode, specify the name of the entry point as value of the --entry-point option (or use the default name cldfbench.dataset) and the Dataset.id as DATASET argument.

Discovery via entry point is particularly useful for commands that can operate on multiple datasets. To select all datasets advertising a given entry point, pass "_" (i.e. an underscore) as DATASET argument.

Workflow

For a full example of the cldfbench curation workflow, see the tutorial.

Creating a skeleton for a new dataset directory

A directory containing stub entries for a dataset can be created running

cldfbench new cldfbench OUTDIR

This will create the following layout (where <ID> stands for the chosen dataset ID):

<ID>/
├── cldf               # A stub directory for the CLDF data
│   └── README.md
├── cldfbench_<ID>.py  # The python module, providing the Dataset subclass
├── etc                # A stub directory for the configuration data
│   └── README.md
├── metadata.json      # The metadata provided to the subcommand serialized as JSON
├── raw                # A stub directory for the raw data
│   └── README.md
├── setup.cfg          # Python setup config, providing defaults for test integration
├── setup.py           # Python setup file, making the dataset "installable" 
├── test.py            # The python code to run for dataset validation
└── .travis.yml        # Integrate the validation with Travis-CI

Implementing CLDF creation

cldfbench provides tools to make CLDF creation simple. Still, each dataset is different, and so each dataset will have to provide its own custom code to do so. This custom code goes into the cmd_makecldf method of the Dataset subclass in the dataset's python module.

Typically, this code will make use of one or more

  • cldfbench.CLDFWriter instances, which can be obtained by calling Dataset.cldf_writer, passing in a
  • cldfbench.CLDFSpec instance, which describes what kind of CLDF to create.

cldfbench supports several scenarios of CLDF creation:

  • The typical use case is turning raw data into a single CLDF dataset. This would require instantiating one CLDFWriter writer in the cmd_makecldf method, and the defaults of CLDFSpec will probably be ok.
  • But it is also possible to create multiple CLDF datasets:
    • For a dataset containing both, lexical and typological data, it may be appropriate to create a Ẁordlist and a StructureDataset. To do so, one would have to call cldf_writer twice, passing in an approriate CLDFSpec. Note that if both CLDF datasets are created in the same directory, they can share the LanguageTable - but would have to specify distinct file names for the ParameterTable, passing distinct values to CLDFSpec.data_fnames
    • When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name of the metadata file, passing distinct values to CLDFSpec.metadata_fname.

When creating CLDF, it is also often useful to have standard reference catalogs accessible, in particular Glottolog. See the section on Catalogs for a description of how this is supported by cldfbench.

Catalogs

Linking data to reference catalogs is a major goal of CLDF dataset, thus cldfbench provides tools to make catalog access and maintenance easier. Catalog data must be accessible in local clones of the data repository. cldfbench provides commands

  • catconfig to create the clones and make them known through a configuration file,
  • catinfo to get an overview of the installed catalogs and their versions,
  • catupdate to update local clones from the upstream repositories.

Curating a dataset on GitHub

One of the design goals of CLDF was to specify a data format that plays well with version control. Thus, it's natural - and actually recommended - to curate a CLDF dataset in a version controled repository. The most popular way to do this in a collaborative fashion is by using a git repository hosted on GitHub.

The directory layout supported by cldfbench caters to this use case in several ways:

  • Each directory contains a file README.md, which will be rendered as human readable description when browsing the repository at GitHub.
  • The file .travis.yml contains the configuration for hooking up a repository with Travis CI, to provide continuous consistency checking of the data.

Archiving a dataset with Zenodo

Curating a dataset on GitHub also provides a simple way to archiving and publishing released versions of the data. You can hook up your repository with Zenodo (following this guide). Then, Zenodo will pick up any released package, assign a DOI to it, archive it and make it accessible in the long-term.

Some notes:

  • Hook-up with Zenodo requires the repository to be public (not private).
  • You should consider using an institutional account on GitHub and Zenodo to associate the repository with. Currently, only the user account registering a repository on Zenodo can change any metadata of releases lateron.
  • Once released and archived with Zenodo, it's a good idea to add the DOI assigned by Zenodo to the release description on GitHub.
  • To make sure a release is picked up by Zenodo, the version number must start with a letter, e.g. "v1.0" - not "1.0".

Thus, with a setup as described here, you can make sure you create FAIR data.

Extending cldfbench

Custom dataset templates

A python package can provide alternative dataset templates to be run with cldfbench new.

TODO

Commands

A python package can provide additional subcommands to be run from cldfbench.

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cldfbench-0.4.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

cldfbench-0.4.0-py2.py3-none-any.whl (34.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file cldfbench-0.4.0.tar.gz.

File metadata

  • Download URL: cldfbench-0.4.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2

File hashes

Hashes for cldfbench-0.4.0.tar.gz
Algorithm Hash digest
SHA256 c4a06804794be09d4c2bc14231a322988b31c603a379ab9bb399cb9576225f52
MD5 931d2a9b9420384163f8837762999704
BLAKE2b-256 6ca5661231421811560049b76b04e5e063bb2021e383817bf74181c073c91e3d

See more details on using hashes here.

Provenance

File details

Details for the file cldfbench-0.4.0-py2.py3-none-any.whl.

File metadata

  • Download URL: cldfbench-0.4.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/3.5.2

File hashes

Hashes for cldfbench-0.4.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5d91676adb2221411835a25e67bdd4c4de7a39db5c2b2e178b59df0419b16db4
MD5 1fc5f8710220f91cdda5f2a72fb485b2
BLAKE2b-256 8639859f3b073af30b3d9606d71d24b1a8f2277d8263852014c2db68aed0a453

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page