A cldfbench plugin to create vizualisations of CLDF datasets
Project description
cldfviz
Python library providing tools to visualize CLDF datasets.
Install
Run
pip install cldfviz
If you want create maps in image formats (PNG, JPG, PDF), the cartopy
package is needed,
which will be installed with
pip install cldfviz[cartopy]
Note: Since cartopy
has quite a few system-level requirements, installation may be somewhat tricky. Should
problems arise, https://scitools.org.uk/cartopy/docs/v0.15/installing.html may help.
CLI
cldfviz
is implemented as cldfbench
plugin, i.e. it provides subcommands for the cldfbench
command.
After installation you should see subcommands with a cldfviz.
prefix
listed when running
cldfbench -h
cldfviz.map
A common way to visualize data from a CLDF StructureDataset is as "dots on a map", i.e. as WALS-like geographic maps.
This can be done using the cldfviz.map
command. If you need to look up geo-coordinates
for languages in Glottolog (because the dataset you are interested in does not provide coordinates,
but has Glottocodes), this command needs
- access to a local clone or export of the glottolog/glottolog repository,
- Glottocodes for all languages in the set, either given as
languageReference
in theValueTable
or asglottocode
inLanguageTable
.
We'll explain the usage of the command by using it with the WALS CLDF data.
(Run cldfbench cldfviz.map -h
to list all options of the command.)
You can download the WALS data - for example - using another cldfbench
plugin: cldfzenodo:
cldfbench zenodo.download 10.5281/zenodo.4683137 --directory wals-2020.1/
HTML maps
With the leaflet library, we can create interactive maps which can be explored in a browser.
Running
cldfbench cldfviz.map wals-2020.1/StructureDataset-metadata.json --base-layer Esri_WorldPhysical --pacific-centered
will create an HTML page map.html
and open it in the browser, thus rendering an interactive
map of the languages in the dataset.
For smaller language samples, it may be suitable to display the language names on the map, too. Here's WALS' feature 10B:
cldfbench cldfviz.map wals-2020.1/StructureDataset-metadata.json --parameters 10B --colormaps tol --markersize 20 --language-labels
cldfviz.map
can detect and display continuous variables, too. There are no continuous features in WALS, but since
cldfviz.map
also works with
metadata-free CLDF datasets, let's
quickly create one. Using the UNIX shell tools sed
and awk
and the
tools of thecsvkit toolbox, we
can run
csvgrep -c Latitude,Glottocode -r".+" wals-2020.1/languages.csv | \
csvcut -c ID,Glottocode,Latitude | \
awk '{if(NR==1){print $0",Parameter_ID"}else{print $0",latitude"}}' | \
sed 's/ID,Glottocode,Latitude,Parameter_ID/ID,Language_ID,Value,Parameter_ID/g' > values.csv
Let's break this down: The first line selects all WALS languages for which latitude and Glottocode is given.
The next line narrows the resulting CSV to just three columns - the future ID
, Language_ID
and Value
columns of our metadata-free StructureDataset. The awk
command adds a constant column Parameter_ID
,
and the sed
command renames the columns appropriately.
The resulting CSV looks as follows:
$ head -n 4 values.csv
ID,Language_ID,Value,Parameter_ID
aar,aari1239,6,latitude
aba,abau1245,-4,latitude
abb,chad1249,13.8333333333,latitude
Now we can run
cldfbench cldfviz.map values.csv --parameters latitude --glottolog PATH/TO/GLOTTOLOG
Note that for metadata-free datasets, cldfviz.map
needs to lookup coordinates in Glottolog. Thus, languages
may be displayed at slightly different locations than above (when the coordinates in WALS differ).
Now we could have done this in a simpler way, too, because cldfviz.map
has a special option to display language
properties encoded as columns in the LanguageTable
as if they were parameters of the dataset. We can use this
option to visualize a claim from WALS' chapter 129 that there is a
strong correlation between values [for feature 129] and latitudinal location
cldfbench cldfviz.map wals-2020.1/cldf/StructureDataset-metadata.json --parameters 129A --colormaps tol \
--markersize 20 --language-properties Latitude --pacific-centered
As seen above, cldfviz.map
can visualize multiple parameters at once. E.g. we can explore the related WALS
features 129A, 130A and 130B, selecting suitable colormaps for the two boolean parameters:
cldfbench cldfviz.map wals-2020.1/cldf/StructureDataset-metadata.json --parameters 129A,130A,130B \
--colormaps base,base,tol --pacific-centered --markersize 30
Printable maps via cartopy
If cldfviz
is installed with cartopy
similar maps to the ones shown above can also be created
in various image formats:
cldfbench cldfviz.map wals-2020.1/StructureDataset-metadata.json --parameters 129A --colormaps tol \
--language-properties Latitude --pacific-centered \
--format jpg --width 20 --height 10 --dpi 300 --markersize 40
While these maps lack the interactivity of the HTML maps, they may be better suited for inclusion in print formats than screen shots of maps in the browser. They also provide some additional options like a choice between various map projections.
Advanced dataset pre-processing
Going one step further, we might visualize data that has been synthesized on the fly. E.g. we can visualize the AES endangerment information given in the Glottolog CLDF data for the WALS languages:
Since we will alter the WALS CLDF data, we make a copy of it first:
cp -r wals-2020.1 wals-copy
Now we extract the AES data from Glottolog ...
csvgrep -c Parameter_ID -m"aes" glottolog-cldf-4.4/cldf/values.csv |\
csvgrep -c Value -m"NA" -i |\
csvcut -c Language_ID,Parameter_ID,Code_ID > aes1.csv
... and massage it into a form that can be appended to the WALS ValueTable
:
csvjoin -y 0 -c Glottocode,Language_ID wals-2020.1/cldf/languages.csv aes1.csv |\
csvcut -c Parameter_ID,Code_ID,ID |\
awk '{if(NR==1){print $0",ID"}else{print $0",aes-"NR}}' |\
sed 's/Parameter_ID,Code_ID,ID,ID/Parameter_ID,Value,Language_ID,ID/g' |\
csvcut -c ID,Language_ID,Parameter_ID,Value |\
awk '{if(NR==1){print $0",Code_ID,Comment,Source,Example_ID"}else{print $0",,,,"}}' > aes2.csv
Notes:
- The first
awk
call adds a unique valueID
. We cannot re-use the valueID
from Glottolog, because the mapping between WALS and Glottolog languages is many-to-one. - Using
awk
to manipulate CSV data is somewhat fragile, since it will break if the data contains multi-line cell content. To guard against that, you may compare the row count reported bycsvstat
with the line count fromwc -l
before usingawk
.
Now we append the values and a row for the ParameterTable
...
csvstack aes2.csv wals-copy/cldf/values.csv > values.csv
cp values.csv wals-copy/cldf
echo "ID,Name,Description,Chapter_ID" > aes_param.csv
echo "aes,AES,," >> aes_param.csv
csvstack aes_param.csv wals-copy/cldf/parameters.csv > parameters.csv
cp parameters.csv wals-copy/cldf
... and make sure the resulting dataset is valid:
cldf validate wals-copy/cldf/StructureDataset-metadata.json
Finally, we can plot the map:
cldfbench cldfviz.map wals-copy/cldf/StructureDataset-metadata.json --pacific-centered --colormaps seq --parameters aes
cldfviz.text
A rather traditional visualization of linguistic data is the practice of interspersing bits of data in descriptive texts, most obviously perhaps as examples formatted as Interlinear Glossed Text. Other examples of data in text include forms, either in running text or in a table.
To support this use case, the cldfviz.text
command can fill data from a CLDF dataset into a markdown
document, where references to CLDF data objects (rows of tables or complete tables) are marked using the
markdown link format with a special URL syntax. To reference a single row:
[An arbitrary label](some/path/<component-name-or-csv-filename>#cldf:<obect-id>)
To reference a whole table:
[An arbitrary label](some/path/<component-name-or-csv-filename>#cldf:__all__)
Note: Only the last component of the URL path is used to determine a CLDF component or table of the dataset, while the first part is ignored. This allows using URLs that are even somewhat functional in the unrendered document. E.g.
[Meier 2020](cldf/sources.bib#cldf:Meier2020)
will render as Meier 2020
, linking to the BibTeX file when the document is simply rendered as markdown by
a service like GitHub, while the enhanced document created from cldfviz.text
will replace the link with
the reference data expanded to a full citation according to the Unified Stylesheet for Linguistics.
Rendering of data objects is controled with templates using the
Jinja template language. Sometimes, templates can be parametrized,
e.g. to choose only cognates belonging to the same cognate set from a CognateTable
. These parameters can
be specified as query string of the reference URL, e.g.
[cognateset X](some/path/CognateTable?cognatesetReference=X#cldf:__all__)
In addition to data objects you can also specify maps to be created with cldfviz.map
and included in the
resulting markdown document; e.g.:
![](map.jpg?parameters=1A#cldfviz.map)
An example of a document rendered with cldfviz.text
is docs/text_example/README.md,
several paragraphs of WALS' chapter 21, rewritten in
"CLDF markdown" and rendered by "filling in" data from
WALS as CLDF dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cldfviz-0.6.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d74d420ee1dbdfde5113dd361779f812027d0f482a22e323ee5fa562659083 |
|
MD5 | 7468e16a09b09caae185bf9a20c1897e |
|
BLAKE2b-256 | b2b9c99bd1cc319dc3324b0d7e93e63093cd6e02403644c3857e56c0e1faf937 |