Skip to main content

A python library to read and write CLDF datasets

Project description

pycldf

A python package to read and write CLDF datasets.

Build Status codecov Requirements Status Documentation Status PyPI

Install

Install pycldf from PyPI:

pip install pycldf

Command line usage

Installing the pycldf package will also install a command line interface cldf, which provides some sub-commands to manage CLDF datasets.

Summary statistics

$ cldf stats mydataset/Wordlist-metadata.json 
<cldf:v1.0:Wordlist at mydataset>

Path                   Type          Rows
---------------------  ----------  ------
forms.csv              Form Table       1
mydataset/sources.bib  Sources          1

Validation

Arguably the most important functionality of pycldf is validating CLDF datasets.

By default, data files are read in strict-mode, i.e. invalid rows will result in an exception being raised. To validate a data file, it can be read in validating-mode.

For example the following output is generated

$ cldf validate mydataset/forms.csv
WARNING forms.csv: duplicate primary key: (u'1',)
WARNING forms.csv:4:Source missing source key: Mei2005

when reading the file

ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]
1,stan1295,1277,hand,,,Meier2005[3-7]
2,stan1295,1277,hand,,,Mei2005[3-7]

Extracting human readable metadata

The information in a CLDF metadata file can be converted to markdown (a human readable markup language) running

cldf markdown PATH/TO/metadata.json

A typical usage of this feature is to create a README.md for your dataset (which, when uploaded to e.g. GitHub will be rendered nicely in the browser).

Converting a CLDF dataset to an SQLite database

A very useful feature of CSVW in general and CLDF in particular is that it provides enough metadata for a set of CSV files to load them into a relational database - including relations between tables. This can be done running the cldf createdb command:

$ cldf createdb -h
usage: cldf createdb [-h] [--infer-primary-keys] DATASET SQLITE_DB_PATH

Load a CLDF dataset into a SQLite DB

positional arguments:
  DATASET               Dataset specification (i.e. path to a CLDF metadata
                        file or to the data file)
  SQLITE_DB_PATH        Path to the SQLite db file

For a specification of the resulting database schema refer to the documentation in src/pycldf/db.py.

Python API

For a detailed documentation of the Python API, refer to the docs on ReadTheDocs.

Reading CLDF

As an example, we'll read data from WALS Online, v2020:

>>> from pycldf import Dataset
>>> wals2020 = Dataset.from_metadata('https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json')

For exploratory purposes, accessing a remote dataset over HTTP is fine. But for real analysis, you'd want to download the datasets first and then access them locally, passing a local file path to Dataset.from_metadata.

Let's look at what we got:

>>> print(wals2020)
<cldf:v1.0:StructureDataset at https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json>
>>> for c in wals2020.components:
  ...     print(c)
...
ValueTable
ParameterTable
CodeTable
LanguageTable
ExampleTable

As expected, we got a StructureDataset, and in addition to the required ValueTable, we also have a couple more components.

We can investigate the values using pycldf's ORM functionality, i.e. mapping rows in the CLDF data files to convenient python objects. (Take note of the limitations describe in orm.py, though.)

>>> for value in wals2020.objects('ValueTable'):
  ...     break
...
>>> value
<pycldf.orm.Value id="81A-aab">
>>> value.language
<pycldf.orm.Language id="aab">
>>> value.language.cldf
Namespace(glottocode=None, id='aab', iso639P3code=None, latitude=Decimal('-3.45'), longitude=Decimal('142.95'), macroarea=None, name='Arapesh (Abu)')
>>> value.parameter
<pycldf.orm.Parameter id="81A">
>>> value.parameter.cldf
Namespace(description=None, id='81A', name='Order of Subject, Object and Verb')
>>> value.references
(<Reference Nekitel-1985[94]>,)
>>> value.references[0]
<Reference Nekitel-1985[94]>
>>> print(value.references[0].source.bibtex())
@misc{Nekitel-1985,
    olac_field = {syntax; general_linguistics; typology},
    school     = {Australian National University},
    title      = {Sociolinguistic Aspects of Abu', a Papuan Language of the Sepik Area, Papua New Guinea},
    wals_code  = {aab},
    year       = {1985},
    author     = {Nekitel, Otto I. M. S.}
}

If performance is important, you can just read rows of data as python dicts, in which case the references between tables must be resolved "by hand":

>>> params = {r['id']: r for r in wals2020.iter_rows('ParameterTable', 'id', 'name')}
>>> for v in wals2020.iter_rows('ValueTable', 'parameterReference'):
    ...     print(params[v['parameterReference']]['name'])
...     break
...
Order of Subject, Object and Verb

Note that we passed names of CLDF terms to Dataset.iter_rows (e.g. id) specifying which columns we want to access by CLDF term - rather than by the column names they are mapped to in the dataset.

Writing CLDF

Warning: Writing CLDF with pycldf does not automatically result in valid CLDF! It does result in data that can be checked via cldf validate (see below), though, so you should always validate after writing.

from pycldf import Wordlist, Source

dataset = Wordlist.in_dir('mydataset')
dataset.add_sources(Source('book', 'Meier2005', author='Hans Meier', year='2005', title='The Book'))
dataset.write(FormTable=[
    {
        'ID': '1', 
        'Form': 'word', 
        'Language_ID': 'abcd1234', 
        'Parameter_ID': '1277', 
        'Source': ['Meier2005[3-7]'],
    }])

results in

$ ls -1 mydataset/
forms.csv
sources.bib
Wordlist-metadata.json
  • mydataset/forms.csv
ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]
  • mydataset/sources.bib
@book{Meier2005,
    author = {Meier, Hans},
    year = {2005},
    title = {The Book}
}
  • mydataset/Wordlist-metadata.json

Advanced writing

To add predefined CLDF components to a dataset, use the add_component method:

from pycldf import StructureDataset, term_uri

dataset = StructureDataset.in_dir('mydataset')
dataset.add_component('ParameterTable')
dataset.write(
    ValueTable=[{'ID': '1', 'Language_ID': 'abc', 'Parameter_ID': '1', 'Value': 'x'}],
	ParameterTable=[{'ID': '1', 'Name': 'Grammatical Feature'}])

It is also possible to add generic tables:

dataset.add_table('contributors.csv', term_uri('id'), term_uri('name'))

which can also be linked to other tables:

dataset.add_columns('ParameterTable', 'Contributor_ID')
dataset.add_foreign_key('ParameterTable', 'Contributor_ID', 'contributors.csv', 'ID')

Addressing tables and columns

Tables in a dataset can be referenced using a Dataset's __getitem__ method, passing

  • a full CLDF Ontology URI for the corresponding component,
  • the local name of the component in the CLDF Ontology,
  • the url of the table.

Columns in a dataset can be referenced using a Dataset's __getitem__ method, passing a tuple (<TABLE>, <COLUMN>) where <TABLE> specifies a table as explained above and <COLUMN> is

  • a full CLD Ontolgy URI used as propertyUrl of the column,
  • the name property of the column.

Object oriented access to CLDF data

The pycldf.orm module implements functionality to access CLDF data via an ORM. Read its docstring for details.

Accessing CLDF data via SQL

The pycldf.db module implements functionality to load CLDF data into a SQLite database. Read its docstring for details.

See also

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycldf-1.21.2.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

pycldf-1.21.2-py2.py3-none-any.whl (62.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pycldf-1.21.2.tar.gz.

File metadata

  • Download URL: pycldf-1.21.2.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for pycldf-1.21.2.tar.gz
Algorithm Hash digest
SHA256 bc4d649c34aa11bcdb6e773d5e754fcd137dc7cef484c7e72df34861f0f744f6
MD5 432bf2859f795c42eff3d1fb90861767
BLAKE2b-256 57916db5fd5e3f1ea3ef68795e4d51459e1e19134b86c6a280b7f8ae2e3b2bb8

See more details on using hashes here.

File details

Details for the file pycldf-1.21.2-py2.py3-none-any.whl.

File metadata

  • Download URL: pycldf-1.21.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 62.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for pycldf-1.21.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ab7e6d0e096ac781ec80187cd78137e81e2491eda6ea2049901b6151d877ba88
MD5 39636006430364d9bf0b1ce58ddbc1ca
BLAKE2b-256 68669451789b71d3db9ec76db8b7a28d2457dd1366d7b6b14d868a641ea5721a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page