A package to extract geospatial extent from files and directories
Project description
geoextent
Python library for extracting geospatial and temporal extents from files and directories.
Key Capabilities:
- Extract spatial extents (bounding boxes, convex hulls) and temporal extents
- Support for 10+ file formats (GeoJSON, CSV, Shapefile, GeoTIFF, GeoPackage, GPX, GML, KML, FlatGeobuf, Esri File Geodatabase, LAS/LAZ point clouds) plus world files
- Plain-text inputs via spaCy named entity recognition + place and time-period gazetteers; recognises calendar dates, decade/century envelopes, ranges, and named geological periods (ICS GTS2020)
- Journal article landing-page support for OJS (with the ojsGeo plugin), Janeway (with janeway_geometadata), Pensoft, and GeoScienceWorld; reads JSON-LD
spatialCoverage, Dublin CoreDC.SpatialCoverage(GeoJSON / WKT),DC.box, ISO 19139EX_GeographicBoundingBox, and ICBM /geo.positioncentroids — and lifts the article DOI out of the HTML head so--ext-metadataworks on plain article URLs (docs) - Direct integration with 35 research repositories (Zenodo, PANGAEA, OSF, Figshare, 4TU.ResearchData, Dryad, GFZ, RADAR, Arctic Data Center, DataONE, B2SHARE, MDI-DE, GDI-DE, NFDI4Earth, SEANOE, GeoScienceWorld, UKCEH, GBIF, DEIMS-SDR, HALO DB, GitHub, GitLab, Software Heritage, Dataverse [Harvard, DataverseNL, DataverseNO, UNC, UVA, Recherche Data Gouv, ioerDATA, heiDATA, Edmond], Pensoft, TU Dresden Opara, Senckenberg, BGR, BAW, Mendeley Data), Wikidata, any STAC catalog, and any CKAN instance (e.g. data.gov.uk, GovData.de, data.gov.au, data.gov.ie)
- Process single files, directories, or multiple repositories in one call
- Command-line interface and Python API
- Export as GeoJSON, WKT, or WKB
📖 Full Documentation | 📦 PyPI | 🚀 Quick Start | 📓 EarthCube 2021 Article
Installation
pip install geoextent
Requirements: Python 3.10+ and GDAL 3.11.x
See the installation guide for system dependencies and Docker setup.
Quick Start
Command Line
# Extract from a file
geoextent -b -t tests/testdata/geojson/muenster_ring_zeit.geojson
# Extract from research repository
python -m geoextent -b -t https://doi.org/10.5281/zenodo.4593540
# Extract merged bbox from multiple local files
geoextent -b -t tests/testdata/geojson/muenster_ring_zeit.geojson tests/testdata/csv/cities_NL.csv
# Extract from multiple repositories (returns merged geometry)
python -m geoextent -b 10.5281/zenodo.123 10.25532/OPARA-456
# Extract convex hull from multiple Wikidata items and open in geojson.io.
# --convex-hull keeps the GeoJSON payload under the 150 KB URL-fragment limit
# of the geojsonio wrapper; the anonymous-gist fallback for larger payloads
# is no longer reachable since GitHub requires auth for gist creation.
# See the text-extraction guide for details.
python -m geoextent -b --convex-hull --geojsonio Q64 Q35 Q60786916
# Parallel extraction from a directory (auto-detect CPU cores)
geoextent -p -b -t path/to/geodata_directory
# Parallel extraction with 4 workers
geoextent -p 4 -b -t path/to/geodata_directory
# Extract place names from free text — spaCy NER + Nominatim by default,
# no API key required. Install the optional extra and English model once:
# pip install geoextent[nlp] && python -m spacy download en_core_web_sm
geoextent -b --text "Field campaigns in Berlin and Paris"
echo "Workshops in Tokyo and London" | geoextent -b -
geoextent -b notes.md
# Keep the highest-ranked gazetteer match instead of dropping ambiguous names
geoextent -b --ner-ambiguity top --text "Field campaigns in Berlin and Paris"
# Administrative boundaries: Nominatim returns the polygon of areal features,
# so a state name resolves to its bounding polygon rather than a centroid.
geoextent -b --ner-ambiguity top --text "Field campaign in Saxony"
# Force the centroid instead with --place-geometry point
geoextent -b --ner-ambiguity top --place-geometry point --text "Field campaign in Saxony"
# Extract a temporal extent from text — calendar dates, decades, centuries,
# ranges, and named geological time periods (ICS GTS2020 bundled gazetteer)
geoextent -t --text "Monitoring ran between 2010 and 2015"
# → "tbox": ["2010-01-01", "2015-12-31"]
geoextent -t --text "Sediment cores from the Holocene"
# → "tbox": ["-9750-01-01", "1950-01-01"] (signed ISO 8601: years before 1 BCE
# are prefixed with `-`; deep-time periods like the Mesozoic produce
# long-year strings such as "-251900050-01-01")
geoextent -b -t --text "Pleistocene cores near Berlin re-surveyed on 2024-05-12"
# Show the source text with matched place names and periods highlighted
geoextent -b -t --annotate brackets \
--text "Sediment cores in Berlin span the Holocene; resurvey on 2024-05-12"
# → ...JSON...
# → ---annotated source (brackets)---
# → Sediment cores in [[Berlin|place]] span the [[Holocene|period]]; resurvey on [[2024-05-12|date]]
# Disable text extraction (e.g. when processing directories of structured
# data and you don't want README.md to be NER-ed)
geoextent -b -t --text-method none path/to/data_dir
For each matched place / date / period, geoextent also emits standoff
char_start / char_end offsets into the (NFC-normalised) source so
external tools can highlight matches independently:
from geoextent.lib import extent
result = extent.from_text("Sediment cores in Berlin span the Holocene.",
bbox=True, tbox=True,
ner_ambiguity="top")
src = result["source_text"]
for rec in result["place_names"] + result["date_entities"]:
s, e = rec["char_start"], rec["char_end"]
print(f"{rec.get('kind', 'place'):6} {src[s:e]!r} → {rec.get('gazetteer_url') or rec.get('start')}")
See the text-extraction guide for examples and gotchas, or the highlighting guide for the offset contract and a JS/Java re-encoding recipe.
See the CLI guide for all options.
Python API
import geoextent.lib.extent as geoextent
# From file
result = geoextent.fromFile('data.geojson', bbox=True, tbox=True)
# From directory
result = geoextent.fromDirectory('data/', bbox=True, tbox=True)
# From directory with parallel extraction (0 = auto-detect CPU cores)
result = geoextent.from_directory('data/', bbox=True, tbox=True, workers=0)
# From repository (single or multiple)
result = geoextent.fromRemote('10.5281/zenodo.4593540', bbox=True)
identifiers = ['10.5281/zenodo.4593540', '10.25532/OPARA-581']
result = geoextent.fromRemote(identifiers, bbox=True)
print(result['bbox']) # Merged bounding box covering all resources
See the API documentation and examples.
What Can I Do With geoextent?
- Extract Spatial Extents - Get bounding boxes or convex hulls from geospatial files
- Process Research Data - Extract extents from Zenodo, Figshare, Dryad, PANGAEA, OSF, DataONE, SEANOE, UKCEH, GBIF, DEIMS-SDR, NFDI4Earth, GitHub, GitLab, any STAC catalog, and more
- Batch Processing - Process directories or multiple repositories in one call
- Add Location Context - Automatic placename lookup for your data
- Flexible Output - Export as GeoJSON, WKT, or WKB for use in other tools
- Interactive Visualization - Open extracted extents in geojson.io with one command
Documentation
- Quick Start Guide - Get started in minutes
- Installation Guide - System dependencies, Docker setup
- Examples - Common usage patterns with code
- CLI Reference - Command-line options
- Python API - Function signatures and parameters
- Core Features - Essential features for everyday use
- Advanced Features - Specialized options
- Content Providers - Repository integration details
- Supported Formats - File format details
- Development Guide - Contributing and testing
Development
This project was developed as part of the DFG-funded research project Opening Reproducible Research (o2r, https://o2r.info).
# Install dev and test dependencies
pip install -e .[dev,test,docs]
# Run tests (parallel execution enabled by default with -n auto)
pytest
# Run tests with specific number of workers
pytest -n 4
# Disable parallel execution for debugging
pytest -n 0
# Format code
black geoextent/ tests/
pre-commit install
See the development guide for detailed instructions.
Showcase Notebooks
Interactive Jupyter notebooks demonstrating geoextent are available in the showcase/ directory:
- NFDI4Earth Knowledge Hub × geoextent — Queries the NFDI4Earth Knowledge Hub SPARQL endpoint to map NFDI4Earth-labelled and harvested repositories to geoextent providers, analyses dataset spatial/temporal metadata coverage, and demonstrates live extraction with
geoextent.fromRemote(). - Exploring Research Data Repositories with geoextent — EarthCube 2021 case study analysing Zenodo records.
To run the notebooks:
cd showcase
pip install -r requirements.txt
pip install -e .. # install geoextent from local checkout
jupyter lab
Contributing
Contributions are welcome! Please use the issue tracker to report bugs or suggest features, and submit pull requests for code or documentation improvements.
Citation
If you use geoextent in your research, please cite:
Nüst, Daniel; Garzón, Sebastian and Qamaz, Yousef. (2021, May 11). o2r-project/geoextent (Version v0.7.1). Zenodo. https://doi.org/10.5281/zenodo.3925693
License
This software is published under the MIT license. See the LICENSE file for details.
This documentation is published under a Creative Commons CC0 1.0 Universal License.
Bundled third-party material
geoextent/lib/data/periods.json— the named-time-period gazetteer used by the text/NER source. Derived from the International Chronostratigraphic Chart (ICS / IUGS, GTS2020 vocabulary), distributed by CGI-IUGS at https://github.com/CGI-IUGS/timescale-data and dedicated to the public domain under CC0-1.0 (https://creativecommons.org/publicdomain/zero/1.0/). The file embeds the upstream commit SHA, build timestamp, and full attribution string in its metadata block; rungeoextent --list-periodsto read it.- The DOI regex and helper functions in
geoextent/lib/helpfunctions.pyare derived fromidutils(© 2015-2018 CERN; © 2018 Alan Rubin) under BSD-3-Clause, as noted inline.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geoextent-0.13.0.tar.gz.
File metadata
- Download URL: geoextent-0.13.0.tar.gz
- Upload date:
- Size: 17.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1063b368de0928503b42998c6b9edfd18dca512872c2290b4fca9d7f468e1974
|
|
| MD5 |
27d66f3c98232892b31e5b0ae74ad266
|
|
| BLAKE2b-256 |
99b94f4c50fd569c62e04028024690c6d3c7766948c2d7297bfe378253d938e4
|
File details
Details for the file geoextent-0.13.0-py3-none-any.whl.
File metadata
- Download URL: geoextent-0.13.0-py3-none-any.whl
- Upload date:
- Size: 314.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0b0e94c996e2d0c7b8d91f875a379163b7a406fb83a0d530f29c324a615919a
|
|
| MD5 |
ca6210e58778139f8c31d3d718b7c23d
|
|
| BLAKE2b-256 |
f9e2e999f8b8f8e22997f7a59c00a0288d6700db3f5dd54681888c8dfb7aba94
|