Skip to main content

Functionality to access and curate GeoJSON in CLDF datasets

Project description

cldfgeojson

Build Status PyPI

cldfgeojson provides tools to work with geographic data structures encoded as GeoJSON in the context of CLDF datasets.

Install

pip install cldfgeojson

Creating CLDF datasets with speaker area data in GeoJSON

The functionality in cldfgeojson.create helps adding speaker area information when creating CLDF datasets (e.g. with cldfbench).

Working around Antimeridian problems

Tools like shapely allow doing geometry with shapes derived from GeoJSON, e.g. computing intersections or centroids. But shapely considers coordinates to be in the cartesian plane rather than on the surface of the earth. While this works generally well enough close to the equator, it fails for geometries crossing the antimeridian. To prepare GeoJSON objects for investigation with shapely, we provide a function that "moves" objects on a - somewhat linguistically informed - pacific-centered cartesian plane: longitudes less than 26°W are adapted by adding 360°, basically moving the interval of valid longitudes from -180°..180° to -26°..334°. While this just moves the antimeridian problems to 26°W, it's still useful because most spatial data about languages does not cross 26°W - which cannot be said for 180°E because this longitude is crosssed by the speaker area of the Austronesian family.

>>> from cldfgeojson.geojson import pacific_centered
>>> from shapely.geometry import shape
>>> p1 = shape({"type": "Point", "coordinates": [179, 0]})
>>> p2 = shape({"type": "Point", "coordinates": [-179, 0]})
>>> p1.distance(p2)
358.0
>>> p1 = shape(pacific_centered({"type": "Point", "coordinates": [179, 0]}))
>>> p2 = shape(pacific_centered({"type": "Point", "coordinates": [-179, 0]}))
>>> p1.distance(p2)
2.0

Manipulating geo-referenced images in GeoTIFF format

The cldfgeojson.geotiff module provides functionality related to images in GeoTIFF format.

Commandline interface

cldfgeojson also provides cldfbench sub-commands. These are particularly useful to validate GeoJSON speaker areas during dataset creation/curation.

geojson.validate

Thegeojson.validate command can be used to make sure GeoJSON Polygon and MultiPolygon geometries for speaker areas are valid. (For a short explanation what this validity entails, see https://postgis.net/workshops/postgis-intro/validity.html .)

The dataset used for testing this package contains one invalid geometry, which can be detected and reported running

$ cldfbench geojson.validate tests/fixtures/dataset/
id        glottocode    reason                        fixable
--------  ------------  ----------------------------  ---------
bare1276  abcd1234      Ring Self-intersection[15 5]  True

geojson.glottolog_distance

While speakers of languages, and thus the "language area", can move around over time, for most languages and the timespan since the languages have been described in the literature this has not happened on a large scale. Thus, comparing speaker areas reported in a dataset to the corresponding point coordinates for the languages as reported by Glottolog is a good plausibility check which will detect issues such as mistyped Glottocodes, etc.

Such a comparison is provided by the geojson.glottolog_distance command, which computes the distances between speaker area (Multi)Polygons and Glottolog's point coordinate. To keep the implementation simple, this computation is done with shapely, which allows for "analysis of geometric objects in the Cartesian plane". Thus, distances are reported in "grid units" of a geographic coordinate system and require some interpretation. Close to the equator, where we find the biggest linguistic diversity, one grid unit roughly equals a distance of 110km on the globe, whereas closer to the poles it may be less. A distance of 0 means that the speaker area (or its convex hull, taking into account that Glottolog's point coordinate is often chosen as some sort of midpoint in cases of spread out, disjoint speaker populations) contains the Glottolog coordinate.

The command needs to access the Glottolog data. To do so, a local clone or export of a specific version must be available. The path to clone or export must be passed as value of the --glottolog option. If a full clone is available, a particular release may be selected by passing the relese tag as value of the --glottolog-version option.

As an example, we can compute Glottolog distances for the speaker areas of Uralic languages as reported in the the CLDF dataset derived from Rantanen et al.'s "Geographical database of the Uralic languages". Assuming this dataset is downloaded to rantanenurageo and Glottolog data is available at glottolog, we can run

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog

and get a (long) listing the results printed to the screen. Since we are typically interested in the outliers, i.e. cases where the Glottolog coordinate is not contained in the area, we can just use grep to filter the result list:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog  | grep False
Ingrian                                0.00  False              13
Karelian                               0.00  False              16
...

But we can also make use of the --format option to create TSV output which we can then manipulate with the csvkit tools to give a better overview:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog --format tsv | csvformat -t | csvsort -c Distance | csvcut -c ID,Distance
...
KomiYazva,1.0292353508840884
EasternMari,1.6462432136183747
KarelianLivvi,2.431383058632656
TomskregionSelkupSouthernSelkup,2.4567889514058106

geojson.multipolygon_spread

Thegeojson.multipolygon_spread command provides another check for the plausibility of Glottocode assignments to the speaker areas reported in a dataset. Sometimes datasets assign dialect-level Glottocodes to polygons and later aggregate these polygons to compute the area for the parent language. Incorrect Glottocode assignments may then result in a language area containing one outlier polygon, i.e. one polygon which is far away from the rest of the area.

With geojson.multipolygon_spread we compute the spread of polygons which are aggregated into a single language area. High spread may be a symptom of wrong Glottocode assignment. Of course, the spread numbers need interpretation as well. For languages spoken on multiple islands in the pacific a spread > 5 may be expected, while for languages spoken in Morobe province (Papua New Guinea) a spread > 2 already means that polygons of the area are probably separated by at least one different language area.

The options and output of geojson.multipolygon_spread are largely the same as for geojson.glottolog_distance. A row in the dataset's LanguageTable is considered to represent a language-level Glottolog languoid either if LanguageTable contains a column named Glottolog_Languoid_Level with the value language or if the glottocode column of LanguageTable specifies a language-level Glottolog languoid. (In the latter case, access to Glottolog data is necessary, see above.)

geojson.geojson

While most of the GeoJSON data that comes with CLDF datasets can be loaded as such directly in tools like QGIS, it is sometimes useful to inspect only subsets of the data. The geojson.geojson command creates GeoJSON representations of configurable subsets of the speaker areas reported in a dataset. This command is intended to be used in tandem with the validation commands above. I.e. the output of the commands above can be manipulated, filtered and pruned to a simple list of language IDs which can then serve as input to geojson.geojson.

A full example of this workflow would look as follows:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog --format tsv | \
csvformat -t | \
csvgrep -c Distance -i -r"^0" | \
csvcut -c ID | csvformat -E | \
cldfbench geojson.geojson rantanenurageo/cldf -

and result in GeoJSON looking as follows when viewed via https://geojson.io/ where the point markers locate the Glottolog coordinates and the polygons represent the speaker areas reported in the dataset.

Other commands

leaflet.draw

This package contains the leaflet.draw plugin in the form of data:// URLs in a mako template. leaflet.draw is distributed under a MIT license:

Copyright 2012-2017 Jon West, Jacob Toye, and Leaflet

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cldfgeojson-2.0.1.tar.gz (95.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cldfgeojson-2.0.1-py2.py3-none-any.whl (92.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cldfgeojson-2.0.1.tar.gz.

File metadata

  • Download URL: cldfgeojson-2.0.1.tar.gz
  • Upload date:
  • Size: 95.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for cldfgeojson-2.0.1.tar.gz
Algorithm Hash digest
SHA256 903fb75a0a711b8ba7410a4264a9717c90a5056e7ba5023719294494e66c1b6f
MD5 4ec4494ff2ed666e44f1ee7471504c02
BLAKE2b-256 2c831dc258f397e0a5cf4a89eab79ea8a3e9582cfa209d8aace80c6f6c58e96f

See more details on using hashes here.

File details

Details for the file cldfgeojson-2.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: cldfgeojson-2.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for cldfgeojson-2.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2e4412f1ed986bc9908af73345498ebdd654f60fb11af0426e49d95a081ee80d
MD5 d6cd4781e02add3df61c43a8b15a2a1b
BLAKE2b-256 3fb5421fe581c7fb5b87fb94d370c40136d9de4e65a464694e45da368ce4c2ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page