Skip to main content

Functionality to access and curate GeoJSON in CLDF datasets

Project description

cldfgeojson

Build Status PyPI

cldfgeojson provides tools to work with geographic data structures encoded as GeoJSON in the context of CLDF datasets.

Install

pip install cldfgeojson

Creating CLDF datasets with speaker area data in GeoJSON

The functionality in cldfgeojson.create helps adding speaker area information when creating CLDF datasets (e.g. with cldfbench).

Working around Antimeridian problems

Tools like shapely allow doing geometry with shapes derived from GeoJSON, e.g. computing intersections or centroids. But shapely considers coordinates to be in the cartesian plane rather than on the surface of the earth. While this works generally well enough close to the equator, it fails for geometries crossing the antimeridian. To prepare GeoJSON objects for investigation with shapely, we provide a function that "moves" objects on a - somewhat linguistically informed - pacific-centered cartesian plane: longitudes less than 26°W are adapted by adding 360°, basically moving the interval of valid longitudes from -180°..180° to -26°..334°. While this just moves the antimeridian problems to 26°W, it's still useful because most spatial data about languages does not cross 26°W - which cannot be said for 180°E because this longitude is crosssed by the speaker area of the Austronesian family.

>>> from cldfgeojson.geojson import pacific_centered
>>> from shapely.geometry import shape
>>> p1 = shape({"type": "Point", "coordinates": [179, 0]})
>>> p2 = shape({"type": "Point", "coordinates": [-179, 0]})
>>> p1.distance(p2)
358.0
>>> p1 = shape(pacific_centered({"type": "Point", "coordinates": [179, 0]}))
>>> p2 = shape(pacific_centered({"type": "Point", "coordinates": [-179, 0]}))
>>> p1.distance(p2)
2.0

Manipulating geo-referenced images in GeoTIFF format

The cldfgeojson.geotiff module provides functionality related to images in GeoTIFF format.

Commandline interface

cldfgeojson also provides cldfbench sub-commands. These are particularly useful to validate GeoJSON speaker areas during dataset creation/curation.

geojson.validate

Thegeojson.validate command can be used to make sure GeoJSON Polygon and MultiPolygon geometries for speaker areas are valid. (For a short explanation what this validity entails, see https://postgis.net/workshops/postgis-intro/validity.html .)

The dataset used for testing this package contains one invalid geometry, which can be detected and reported running

$ cldfbench geojson.validate tests/fixtures/dataset/
id        glottocode    reason                        fixable
--------  ------------  ----------------------------  ---------
bare1276  abcd1234      Ring Self-intersection[15 5]  True

geojson.glottolog_distance

While speakers of languages, and thus the "language area", can move around over time, for most languages and the timespan since the languages have been described in the literature this has not happened on a large scale. Thus, comparing speaker areas reported in a dataset to the corresponding point coordinates for the languages as reported by Glottolog is a good plausibility check which will detect issues such as mistyped Glottocodes, etc.

Such a comparison is provided by the geojson.glottolog_distance command, which computes the distances between speaker area (Multi)Polygons and Glottolog's point coordinate. To keep the implementation simple, this computation is done with shapely, which allows for "analysis of geometric objects in the Cartesian plane". Thus, distances are reported in "grid units" of a geographic coordinate system and require some interpretation. Close to the equator, where we find the biggest linguistic diversity, one grid unit roughly equals a distance of 110km on the globe, whereas closer to the poles it may be less. A distance of 0 means that the speaker area (or its convex hull, taking into account that Glottolog's point coordinate is often chosen as some sort of midpoint in cases of spread out, disjoint speaker populations) contains the Glottolog coordinate.

The command needs to access the Glottolog data. To do so, a local clone or export of a specific version must be available. The path to clone or export must be passed as value of the --glottolog option. If a full clone is available, a particular release may be selected by passing the relese tag as value of the --glottolog-version option.

As an example, we can compute Glottolog distances for the speaker areas of Uralic languages as reported in the the CLDF dataset derived from Rantanen et al.'s "Geographical database of the Uralic languages". Assuming this dataset is downloaded to rantanenurageo and Glottolog data is available at glottolog, we can run

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog

and get a (long) listing the results printed to the screen. Since we are typically interested in the outliers, i.e. cases where the Glottolog coordinate is not contained in the area, we can just use grep to filter the result list:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog  | grep False
Ingrian                                0.00  False              13
Karelian                               0.00  False              16
...

But we can also make use of the --format option to create TSV output which we can then manipulate with the csvkit tools to give a better overview:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog --format tsv | csvformat -t | csvsort -c Distance | csvcut -c ID,Distance
...
KomiYazva,1.0292353508840884
EasternMari,1.6462432136183747
KarelianLivvi,2.431383058632656
TomskregionSelkupSouthernSelkup,2.4567889514058106

geojson.multipolygon_spread

Thegeojson.multipolygon_spread command provides another check for the plausibility of Glottocode assignments to the speaker areas reported in a dataset. Sometimes datasets assign dialect-level Glottocodes to polygons and later aggregate these polygons to compute the area for the parent language. Incorrect Glottocode assignments may then result in a language area containing one outlier polygon, i.e. one polygon which is far away from the rest of the area.

With geojson.multipolygon_spread we compute the spread of polygons which are aggregated into a single language area. High spread may be a symptom of wrong Glottocode assignment. Of course, the spread numbers need interpretation as well. For languages spoken on multiple islands in the pacific a spread > 5 may be expected, while for languages spoken in Morobe province (Papua New Guinea) a spread > 2 already means that polygons of the area are probably separated by at least one different language area.

The options and output of geojson.multipolygon_spread are largely the same as for geojson.glottolog_distance. A row in the dataset's LanguageTable is considered to represent a language-level Glottolog languoid either if LanguageTable contains a column named Glottolog_Languoid_Level with the value language or if the glottocode column of LanguageTable specifies a language-level Glottolog languoid. (In the latter case, access to Glottolog data is necessary, see above.)

geojson.geojson

While most of the GeoJSON data that comes with CLDF datasets can be loaded as such directly in tools like QGIS, it is sometimes useful to inspect only subsets of the data. The geojson.geojson command creates GeoJSON representations of configurable subsets of the speaker areas reported in a dataset. This command is intended to be used in tandem with the validation commands above. I.e. the output of the commands above can be manipulated, filtered and pruned to a simple list of language IDs which can then serve as input to geojson.geojson.

A full example of this workflow would look as follows:

cldfbench geojson.glottolog_distance rantanenurageo/cldf --glottolog glottolog --format tsv | \
csvformat -t | \
csvgrep -c Distance -i -r"^0" | \
csvcut -c ID | csvformat -E | \
cldfbench geojson.geojson rantanenurageo/cldf -

and result in GeoJSON looking as follows when viewed via https://geojson.io/ where the point markers locate the Glottolog coordinates and the polygons represent the speaker areas reported in the dataset.

Other commands

leaflet.draw

This package contains the leaflet.draw plugin in the form of data:// URLs in a mako template. leaflet.draw is distributed under a MIT license:

Copyright 2012-2017 Jon West, Jacob Toye, and Leaflet

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cldfgeojson-2.0.0.tar.gz (95.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cldfgeojson-2.0.0-py2.py3-none-any.whl (92.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cldfgeojson-2.0.0.tar.gz.

File metadata

  • Download URL: cldfgeojson-2.0.0.tar.gz
  • Upload date:
  • Size: 95.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for cldfgeojson-2.0.0.tar.gz
Algorithm Hash digest
SHA256 823fc89c5c2b113655d6f2825f0ec66fe3a2c91bd0d4d5ee22cbfab83d95a10f
MD5 653011700ea2302868996da00960a290
BLAKE2b-256 3348916217c88e6013ecbea61af3816c239232d61e47c6f5ee134e2af4b7d8f0

See more details on using hashes here.

File details

Details for the file cldfgeojson-2.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: cldfgeojson-2.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for cldfgeojson-2.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 39cd343af67f70b70858361767aceff8f19edac44e64a99998b05fc79dd2e0ca
MD5 cec1409579f3af60cd5b944f45f174b0
BLAKE2b-256 8ecd050f0767122e2a2e2f46be7ecd366b2f254126e40381a374d86b889c5935

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page