Skip to main content

A package to extract geospatial extent from files and directories

Project description

geoextent

Python package PyPI version Binder Project Status: Active – The project has reached a stable, usable state and is being actively developed. DOI SWH

Python library for extracting geospatial and temporal extents from files and directories.

Key Capabilities:

📖 Full Documentation | 📦 PyPI | 🚀 Quick Start | 📓 EarthCube 2021 Article

Installation

pip install geoextent

Requirements: Python 3.10+ and GDAL 3.11.x

See the installation guide for system dependencies and Docker setup.

Quick Start

Command Line

# Extract from a file
geoextent -b -t tests/testdata/geojson/muenster_ring_zeit.geojson

# Extract from research repository
python -m geoextent -b -t https://doi.org/10.5281/zenodo.4593540

# Extract merged bbox from multiple local files
geoextent -b -t tests/testdata/geojson/muenster_ring_zeit.geojson tests/testdata/csv/cities_NL.csv

# Extract from multiple repositories (returns merged geometry)
python -m geoextent -b 10.5281/zenodo.123 10.25532/OPARA-456

# Extract convex hull from multiple Wikidata items and open in geojson.io.
# --convex-hull keeps the GeoJSON payload under the 150 KB URL-fragment limit
# of the geojsonio wrapper; the anonymous-gist fallback for larger payloads
# is no longer reachable since GitHub requires auth for gist creation.
# See the text-extraction guide for details.
python -m geoextent -b --convex-hull --geojsonio Q64 Q35 Q60786916

# Parallel extraction from a directory (auto-detect CPU cores)
geoextent -p -b -t path/to/geodata_directory

# Parallel extraction with 4 workers
geoextent -p 4 -b -t path/to/geodata_directory

# Extract place names from free text — spaCy NER + Nominatim by default,
# no API key required. Install the optional extra and English model once:
#   pip install geoextent[nlp] && python -m spacy download en_core_web_sm
geoextent -b --text "Field campaigns in Berlin and Paris"
echo "Workshops in Tokyo and London" | geoextent -b -
geoextent -b notes.md

# Keep the highest-ranked gazetteer match instead of dropping ambiguous names
geoextent -b --ner-ambiguity top --text "Field campaigns in Berlin and Paris"

# Administrative boundaries: Nominatim returns the polygon of areal features,
# so a state name resolves to its bounding polygon rather than a centroid.
geoextent -b --ner-ambiguity top --text "Field campaign in Saxony"
# Force the centroid instead with --place-geometry point
geoextent -b --ner-ambiguity top --place-geometry point --text "Field campaign in Saxony"

# Extract a temporal extent from text — calendar dates, decades, centuries,
# ranges, and named geological time periods (ICS GTS2020 bundled gazetteer)
geoextent -t --text "Monitoring ran between 2010 and 2015"
# → "tbox": ["2010-01-01", "2015-12-31"]
geoextent -t --text "Sediment cores from the Holocene"
# → "tbox": ["-9750-01-01", "1950-01-01"]  (signed ISO 8601: years before 1 BCE
#    are prefixed with `-`; deep-time periods like the Mesozoic produce
#    long-year strings such as "-251900050-01-01")
geoextent -b -t --text "Pleistocene cores near Berlin re-surveyed on 2024-05-12"

# Show the source text with matched place names and periods highlighted
geoextent -b -t --annotate brackets \
  --text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12"
# → ...JSON...
# → ---annotated source (brackets)---
# → Sediment cores in [[Berlin|place]] span the [[Holocene|period]]; resurvey on [[2024-05-12|date]]

# Disable text extraction (e.g. when processing directories of structured
# data and you don't want README.md to be NER-ed)
geoextent -b -t --text-method none path/to/data_dir

For each matched place / date / period, geoextent also emits standoff char_start / char_end offsets into the (NFC-normalised) source so external tools can highlight matches independently:

from geoextent.lib import extent
result = extent.from_text("Sediment cores in Berlin span the Holocene.",
                          bbox=True, tbox=True,
                          ner_ambiguity="top")
src = result["source_text"]
for rec in result["place_names"] + result["date_entities"]:
    s, e = rec["char_start"], rec["char_end"]
    print(f"{rec.get('kind', 'place'):6} {src[s:e]!r}{rec.get('gazetteer_url') or rec.get('start')}")

See the text-extraction guide for examples and gotchas, or the highlighting guide for the offset contract and a JS/Java re-encoding recipe.

See the CLI guide for all options.

Python API

import geoextent.lib.extent as geoextent

# From file
result = geoextent.fromFile('data.geojson', bbox=True, tbox=True)

# From directory
result = geoextent.fromDirectory('data/', bbox=True, tbox=True)

# From directory with parallel extraction (0 = auto-detect CPU cores)
result = geoextent.from_directory('data/', bbox=True, tbox=True, workers=0)

# From repository (single or multiple)
result = geoextent.fromRemote('10.5281/zenodo.4593540', bbox=True)

identifiers = ['10.5281/zenodo.4593540', '10.25532/OPARA-581']
result = geoextent.fromRemote(identifiers, bbox=True)
print(result['bbox'])  # Merged bounding box covering all resources

See the API documentation and examples.

What Can I Do With geoextent?

  • Extract Spatial Extents - Get bounding boxes or convex hulls from geospatial files
  • Process Research Data - Extract extents from Zenodo, Figshare, Dryad, PANGAEA, OSF, DataONE, SEANOE, UKCEH, GBIF, DEIMS-SDR, NFDI4Earth, GitHub, GitLab, any STAC catalog, and more
  • Batch Processing - Process directories or multiple repositories in one call
  • Add Location Context - Automatic placename lookup for your data
  • Flexible Output - Export as GeoJSON, WKT, or WKB for use in other tools
  • Interactive Visualization - Open extracted extents in geojson.io with one command

Documentation

Development

This project was developed as part of the DFG-funded research project Opening Reproducible Research (o2r, https://o2r.info).

# Install dev and test dependencies
pip install -e .[dev,test,docs]

# Run tests (parallel execution enabled by default with -n auto)
pytest

# Run tests with specific number of workers
pytest -n 4

# Disable parallel execution for debugging
pytest -n 0

# Format code
black geoextent/ tests/
pre-commit install

See the development guide for detailed instructions.

Showcase Notebooks

Interactive Jupyter notebooks demonstrating geoextent are available in the showcase/ directory:

To run the notebooks:

cd showcase
pip install -r requirements.txt
pip install -e ..  # install geoextent from local checkout
jupyter lab

Contributing

Contributions are welcome! Please use the issue tracker to report bugs or suggest features, and submit pull requests for code or documentation improvements.

Citation

If you use geoextent in your research, please cite:

Nüst, Daniel; Garzón, Sebastian and Qamaz, Yousef. (2021, May 11). o2r-project/geoextent (Version v0.7.1). Zenodo. https://doi.org/10.5281/zenodo.3925693

License

This software is published under the MIT license. See the LICENSE file for details.

This documentation is published under a Creative Commons CC0 1.0 Universal License.

Bundled third-party material

  • geoextent/lib/data/periods.json — the named-time-period gazetteer used by the text/NER source. Derived from the International Chronostratigraphic Chart (ICS / IUGS, GTS2020 vocabulary), distributed by CGI-IUGS at https://github.com/CGI-IUGS/timescale-data and dedicated to the public domain under CC0-1.0 (https://creativecommons.org/publicdomain/zero/1.0/). The file embeds the upstream commit SHA, build timestamp, and full attribution string in its metadata block; run geoextent --list-periods to read it.
  • The DOI regex and helper functions in geoextent/lib/helpfunctions.py are derived from idutils (© 2015-2018 CERN; © 2018 Alan Rubin) under BSD-3-Clause, as noted inline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoextent-0.13.0.tar.gz (17.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geoextent-0.13.0-py3-none-any.whl (314.0 kB view details)

Uploaded Python 3

File details

Details for the file geoextent-0.13.0.tar.gz.

File metadata

  • Download URL: geoextent-0.13.0.tar.gz
  • Upload date:
  • Size: 17.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for geoextent-0.13.0.tar.gz
Algorithm Hash digest
SHA256 1063b368de0928503b42998c6b9edfd18dca512872c2290b4fca9d7f468e1974
MD5 27d66f3c98232892b31e5b0ae74ad266
BLAKE2b-256 99b94f4c50fd569c62e04028024690c6d3c7766948c2d7297bfe378253d938e4

See more details on using hashes here.

File details

Details for the file geoextent-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: geoextent-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 314.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for geoextent-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a0b0e94c996e2d0c7b8d91f875a379163b7a406fb83a0d530f29c324a615919a
MD5 ca6210e58778139f8c31d3d718b7c23d
BLAKE2b-256 f9e2e999f8b8f8e22997f7a59c00a0288d6700db3f5dd54681888c8dfb7aba94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page