A python library to read and write CLDF datasets

These details have not been verified by PyPI

Project links

Project description

pycldf

A python package to read and write CLDF datasets.

Install

Install pycldf from PyPI:

pip install pycldf

Command line usage

Installing the pycldf package will also install a command line interface cldf, which provides some sub-commands to manage CLDF datasets.

Dataset discovery

cldf subcommands support dataset discovery as specified in the standard.

So a typical workflow involving a remote dataset could look as follows.

Create a local directory to which to download the dataset (ideally including version info):

$ mkdir wacl-1.0.0

Accessing CLDF datasets on Zenodo requires installing cldfzenodo (via pip install cldfzenodo). Validating a dataset from Zenodo will implicitly download it, so running

$ cldf validate https://zenodo.org/record/7322688#rdf:ID=wacl --download-dir wacl-1.0.0/

will download the dataset to wacl-1.0.0.

Subsequently we can access the data locally for better performance:

$ cldf stats wacl-1.0.0/#rdf:ID=wacl
<cldf:v1.0:StructureDataset at wacl-1.0.0/cldf>
                          value
------------------------  --------------------------------------------------------------------
dc:bibliographicCitation  Her, One-Soon, Harald Hammarström and Marc Allassonnière-Tang. 2022.
dc:conformsTo             http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
dc:identifier             https://wacl.clld.org
dc:license                https://creativecommons.org/licenses/by/4.0/
dc:source                 sources.bib
dc:title                  World Atlas of Classifier Languages
dcat:accessURL            https://github.com/cldf-datasets/wacl
rdf:ID                    wacl
rdf:type                  http://www.w3.org/ns/dcat#Distribution

                Type              Rows
--------------  --------------  ------
values.csv      ValueTable        3338
parameters.csv  ParameterTable       1
languages.csv   LanguageTable     3338
codes.csv       CodeTable            2
sources.bib     Sources           2000

(Note that locating datasets on Zenodo requires installation of cldfzenodo.)

Summary statistics

$ cldf stats tests/data/wordlist_with_cognates/metadata.json 
<cldf:v1.0:Wordlist at tests/data/wordlist_with_cognates>
               value
-------------  --------------------------------------------
dc:conformsTo  http://cldf.clld.org/v1.0/terms.rdf#Wordlist
dc:source      sources.bib

                 Type               Rows
---------------  ---------------  ------
languages.csv    LanguageTable         2
parameters.csv   ParameterTable        2
forms.csv        FormTable             3
cognates.csv     CognateTable          2
cognatesets.csv  CognatesetTable       1
sources.bib      Sources               1

Validation

Arguably the most important functionality of pycldf is validating CLDF datasets.

By default, data files are read in strict-mode, i.e. invalid rows will result in an exception being raised. To validate a data file, it can be read in validating-mode.

For example the following output is generated

$ cldf validate mydataset/forms.csv
WARNING forms.csv: duplicate primary key: (u'1',)
WARNING forms.csv:4:Source missing source key: Mei2005

when reading the file

ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]
1,stan1295,1277,hand,,,Meier2005[3-7]
2,stan1295,1277,hand,,,Mei2005[3-7]

Extracting human readable metadata

The information in a CLDF metadata file can be converted to markdown (a human readable markup language) running

cldf markdown PATH/TO/metadata.json

A typical usage of this feature is to create a README.md for your dataset (which, when uploaded to e.g. GitHub will be rendered nicely in the browser).

Downloading media listed in a dataset's MediaTable

Typically, CLDF datasets only reference media items. The MediaTable provides enough information, though, to download and save an item's content. This can be done running

cldf downloadmedia PATH/TO/metadata.json PATH/TO/DOWNLOAD/DIR

To minimize bandwidth usage, relevant items can be filtered by passing selection criteria in the form COLUMN_NAME=SUBSTRING as optional arguments. E.g. downloading could be limited to audio files passing Media_Type=audio/ (provided, Media_Type is the name of the column with propertyUrl http://cldf.clld.org/v1.0/terms.rdf#mediaType)

Converting a CLDF dataset to an SQLite database

A very useful feature of CSVW in general and CLDF in particular is that it provides enough metadata for a set of CSV files to load them into a relational database - including relations between tables. This can be done running the cldf createdb command:

$ cldf createdb -h
usage: cldf createdb [-h] [--infer-primary-keys] DATASET SQLITE_DB_PATH

Load a CLDF dataset into a SQLite DB

positional arguments:
  DATASET               Dataset specification (i.e. path to a CLDF metadata
                        file or to the data file)
  SQLITE_DB_PATH        Path to the SQLite db file

For a specification of the resulting database schema refer to the documentation in src/pycldf/db.py.

Handling large media files

Often, platforms like GitHub impose limits on the size of individual files in a repository. Thus, in order to facilitate curation of datasets with large media files on such platforms, pycldf provides a pragmatic solution as follows:

Running

cldf splitmedia <dataset-locator>

on a dataset will split all media files with sizes bigger than a configurable threshold into multiple files, just like UNIX' split command would. A file named audio.wav will be split into files audio.wav.aa, audio.wav.ab and so on.

[!CAUTION] With large files split (and removed) the dataset will not validate anymore.

In order to restore the files, the corresponding command

cldf catmedia <dataset-locator>

can be used.

Thus, in a typical workflow each commit to the repository would be wrapped in a cldf splitmedia and a cldf catmedia call (possibly automated via git hooks).

Python API

For a detailed documentation of the Python API, refer to the docs on ReadTheDocs.

Reading CLDF

As an example, we'll read data from WALS Online, v2020:

>>> from pycldf import Dataset
>>> wals2020 = Dataset.from_metadata('https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json')

For exploratory purposes, accessing a remote dataset over HTTP is fine. But for real analysis, you'd want to download the datasets first and then access them locally, passing a local file path to Dataset.from_metadata.

Let's look at what we got:

>>> print(wals2020)
<cldf:v1.0:StructureDataset at https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json>
>>> for c in wals2020.components:
  ...     print(c)
...
ValueTable
ParameterTable
CodeTable
LanguageTable
ExampleTable

As expected, we got a StructureDataset, and in addition to the required ValueTable, we also have a couple more components.

We can investigate the values using pycldf's ORM functionality, i.e. mapping rows in the CLDF data files to convenient python objects. (Take note of the limitations describe in orm.py, though.)

>>> for value in wals2020.objects('ValueTable'):
  ...     break
...
>>> value
<pycldf.orm.Value id="81A-aab">
>>> value.language
<pycldf.orm.Language id="aab">
>>> value.language.cldf
Namespace(glottocode=None, id='aab', iso639P3code=None, latitude=Decimal('-3.45'), longitude=Decimal('142.95'), macroarea=None, name='Arapesh (Abu)')
>>> value.parameter
<pycldf.orm.Parameter id="81A">
>>> value.parameter.cldf
Namespace(description=None, id='81A', name='Order of Subject, Object and Verb')
>>> value.references
(<Reference Nekitel-1985[94]>,)
>>> value.references[0]
<Reference Nekitel-1985[94]>
>>> print(value.references[0].source.bibtex())
@misc{Nekitel-1985,
    olac_field = {syntax; general_linguistics; typology},
    school     = {Australian National University},
    title      = {Sociolinguistic Aspects of Abu', a Papuan Language of the Sepik Area, Papua New Guinea},
    wals_code  = {aab},
    year       = {1985},
    author     = {Nekitel, Otto I. M. S.}
}

If performance is important, you can just read rows of data as python dicts, in which case the references between tables must be resolved "by hand":

>>> params = {r['id']: r for r in wals2020.iter_rows('ParameterTable', 'id', 'name')}
>>> for v in wals2020.iter_rows('ValueTable', 'parameterReference'):
    ...     print(params[v['parameterReference']]['name'])
...     break
...
Order of Subject, Object and Verb

Note that we passed names of CLDF terms to Dataset.iter_rows (e.g. id) specifying which columns we want to access by CLDF term - rather than by the column names they are mapped to in the dataset.

Writing CLDF

Warning: Writing CLDF with pycldf does not automatically result in valid CLDF! It does result in data that can be checked via cldf validate (see below), though, so you should always validate after writing.

from pycldf import Wordlist, Source

dataset = Wordlist.in_dir('mydataset')
dataset.add_sources(Source('book', 'Meier2005', author='Hans Meier', year='2005', title='The Book'))
dataset.write(FormTable=[
    {
        'ID': '1', 
        'Form': 'word', 
        'Language_ID': 'abcd1234', 
        'Parameter_ID': '1277', 
        'Source': ['Meier2005[3-7]'],
    }])

results in

$ ls -1 mydataset/
forms.csv
sources.bib
Wordlist-metadata.json

mydataset/forms.csv

ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
1,abcd1234,1277,word,,,Meier2005[3-7]

mydataset/sources.bib

@book{Meier2005,
    author = {Meier, Hans},
    year = {2005},
    title = {The Book}
}

mydataset/Wordlist-metadata.json

Advanced writing

To add predefined CLDF components to a dataset, use the add_component method:

from pycldf import StructureDataset, term_uri

dataset = StructureDataset.in_dir('mydataset')
dataset.add_component('ParameterTable')
dataset.write(
    ValueTable=[{'ID': '1', 'Language_ID': 'abc', 'Parameter_ID': '1', 'Value': 'x'}],
	ParameterTable=[{'ID': '1', 'Name': 'Grammatical Feature'}])

It is also possible to add generic tables:

dataset.add_table('contributors.csv', term_uri('id'), term_uri('name'))

which can also be linked to other tables:

dataset.add_columns('ParameterTable', 'Contributor_ID')
dataset.add_foreign_key('ParameterTable', 'Contributor_ID', 'contributors.csv', 'ID')

Addressing tables and columns

Tables in a dataset can be referenced using a Dataset's __getitem__ method, passing

a full CLDF Ontology URI for the corresponding component,
the local name of the component in the CLDF Ontology,
the url of the table.

Columns in a dataset can be referenced using a Dataset's __getitem__ method, passing a tuple (<TABLE>, <COLUMN>) where <TABLE> specifies a table as explained above and <COLUMN> is

a full CLDF Ontolgy URI used as propertyUrl of the column,
the name property of the column.

Object oriented access to CLDF data

The pycldf.orm module implements functionality to access CLDF data via an ORM. See https://pycldf.readthedocs.io/en/latest/orm.html for details.

Accessing CLDF data via SQL

The pycldf.db module implements functionality to load CLDF data into a SQLite database. See https://pycldf.readthedocs.io/en/latest/ext_sql.html for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.1

May 5, 2026

2.0.0

May 5, 2026

1.43.1

Mar 25, 2026

1.43.0

Aug 4, 2025

1.42.0

Apr 7, 2025

1.41.0

Feb 15, 2025

1.40.4

Jan 15, 2025

1.40.3

Jan 3, 2025

1.40.2

Dec 23, 2024

1.40.1

Dec 16, 2024

1.40.0

Dec 13, 2024

1.39.0

Sep 9, 2024

1.38.1

May 6, 2024

1.38.0

Apr 26, 2024

1.37.1

Mar 18, 2024

1.37.0

Jan 22, 2024

1.36.0

Nov 14, 2023

1.35.1

Oct 23, 2023

1.35.0

Jul 10, 2023

1.34.1

Mar 15, 2023

1.34.0

Dec 5, 2022

1.33.0

Nov 24, 2022

1.32.0

Nov 23, 2022

1.31.0

Nov 22, 2022

1.30.0

Nov 22, 2022

1.29.0

Oct 28, 2022

1.28.0

Oct 11, 2022

1.27.0

Jul 7, 2022

1.26.1

May 23, 2022

1.26.0

May 19, 2022

1.25.1

Feb 6, 2022

1.25.0

Feb 5, 2022

1.24.0

Nov 24, 2021

1.23.0

Aug 15, 2021

1.22.0

Jun 4, 2021

1.21.2

May 28, 2021

1.21.1

May 26, 2021

1.21.0

May 10, 2021

1.20.2

May 3, 2021

1.20.1

Apr 30, 2021

1.20.0

Apr 28, 2021

1.19.0

Apr 3, 2021

1.18.1

Mar 9, 2021

1.18.0

Jan 13, 2021

1.17.0

Oct 31, 2020

1.16.0

Oct 13, 2020

1.15.2

Oct 12, 2020

1.15.1

Oct 7, 2020

1.15.0

Aug 19, 2020

1.14.1

Mar 7, 2020

1.14.0

Mar 7, 2020

1.13.0

Mar 4, 2020

1.12.1

Feb 14, 2020

1.12.0

Feb 13, 2020

1.11.0

Feb 12, 2020

1.10.0

Jan 10, 2020

1.9.0

Nov 26, 2019

1.8.2

Oct 24, 2019

1.8.1

Oct 14, 2019

1.8.0

Sep 17, 2019

1.7.0

Aug 16, 2019

1.6.4

Jun 12, 2019

1.6.3

Jun 3, 2019

1.6.2

May 9, 2019

1.6.1

May 6, 2019

1.6.0

May 2, 2019

1.5.3

Apr 1, 2019

1.5.2

Nov 16, 2018

1.5.1

Aug 2, 2018

1.5.0

Jul 31, 2018

1.4.1

May 2, 2018

1.4.0

May 2, 2018

1.3.0

Apr 24, 2018

1.2.0

Apr 18, 2018

1.1.1

Apr 18, 2018

1.1.0

Apr 18, 2018

1.0.10

Jan 13, 2018

1.0.9

Dec 20, 2017

1.0.8

Dec 1, 2017

1.0.7

Nov 29, 2017

1.0.6

Oct 19, 2017

1.0.5

Oct 16, 2017

1.0.4

Oct 12, 2017

1.0.3

Aug 16, 2017

1.0.2

Jul 28, 2017

1.0.1

Jul 27, 2017

1.0r2

Jul 17, 2017

1.0r1

Jul 14, 2017

1.0.0

Jul 27, 2017

1.0rc1 pre-release

Jul 24, 2017

1.0b2 pre-release

Jul 17, 2017

0.6.4

Dec 21, 2016

0.6.3

Dec 15, 2016

0.6.2

Sep 7, 2016

0.6.1

Sep 7, 2016

0.6.0

Jul 6, 2016

0.5.2

Jun 28, 2016

0.5.1

Jun 28, 2016

0.5.0

Jun 28, 2016

0.4.2

Jun 23, 2016

0.4.1

Jun 23, 2016

0.4.0

Jun 22, 2016

0.3.0

Jun 22, 2016

0.2.1

Jun 20, 2016

0.2.0

Jun 20, 2016

0.1.0

Jun 16, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycldf-2.0.1.tar.gz (111.5 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycldf-2.0.1-py2.py3-none-any.whl (105.2 kB view details)

Uploaded May 5, 2026 Python 2Python 3

File details

Details for the file pycldf-2.0.1.tar.gz.

File metadata

Download URL: pycldf-2.0.1.tar.gz
Upload date: May 5, 2026
Size: 111.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for pycldf-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f46c91444927f580530589f40d825b5ac71c62653698151a6a8c529b323bda72`
MD5	`5bcc8e7a35074da8670612dd4bea8a22`
BLAKE2b-256	`fef65ce778e5e4cba37749f9fa1e8777f78bcacb1267ba1a41481691b2cd03ab`

See more details on using hashes here.

File details

Details for the file pycldf-2.0.1-py2.py3-none-any.whl.

File metadata

Download URL: pycldf-2.0.1-py2.py3-none-any.whl
Upload date: May 5, 2026
Size: 105.2 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for pycldf-2.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb9f2e7f1d1238dae929b4bc754675189d23c65c03573e967b621fa0fa2d38f1`
MD5	`187483d900b877ac83d2ee21cf09dcaa`
BLAKE2b-256	`68aa469b4d81a84a59ff4f68758b132b85477d046b1837f3eb7308fe750a477c`

See more details on using hashes here.

pycldf 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pycldf

Install

Command line usage

Dataset discovery

Summary statistics

Validation

Extracting human readable metadata

Downloading media listed in a dataset's MediaTable

Converting a CLDF dataset to an SQLite database

Handling large media files

Python API

Reading CLDF

Writing CLDF

Advanced writing

Addressing tables and columns

Object oriented access to CLDF data

Accessing CLDF data via SQL

See also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes