Skip to main content

An automation tool for harvesting and processing geodata from the web

Project description

Geodata-Harvester

Automate geodata harvesting from the web and jumpstart your analysis with a ready-made set of spatiotemporal processed maps and data tables.

Geodata-Harvester logo

License PyPI-Server Conda Version Monthly Downloads

The Geodata-Harvester Python package offers reusable and automated workflows for data extraction from a wide range of geospatial and environmental data sources. User provided data is auto-completed with a suitable set of spatial- and temporal-aligned covariates as a ready-made dataset for machine learning and environmental models. In addition, all requested data layer maps are automatically extracted and aligned for a specific region and time period.

For the R-package wrapper of the Geodata-Harvester, please visit the Github dataharvesteR project.

📚 Table of Contents

💡 Introduction

There is an enormous amount of national/global space-time data that is free and accessible. Examples are the numerous satellite platforms, weather, soil landscape grid of Australia. Many have a temporal dimension so for any point in Australia you can extract a time series of remote sensing and weather data and soil and terrain site variables. In the case of time series covariates there are a number of post-processing steps that a user can undertake to extract meaning, e.g. temporal means, aggregating in time. All of the above is a non-trivial task and a workflow where a user could enter a point (s) and get a tidy data frame of data cube variables would be a step towards people understanding its value and being able to jumpstart their analysis. This project will contribute processing tools for finding, extracting and converting these key data layers.

Developed as part of the Agricultural Research Federation (AgReFed), Geodata-Harvester is an open-source software that allows users to jumpstart their analysis with a suitable set of spatial-temporal aligned raster maps and dataframes.

🌍 Data Sources

A detailed list of all available layers and their description can be found in Data Overview.

The following main data sources are currently implemented:

  • Soil and Landscape Grid of Australia (SLGA)
  • SILO Climate Database
  • National Digital Elevation Model (DEM) 1 Second Hydrologically Enforced
  • Digital Earth Australia (DEA) Geoscience Earth Observations
  • GSKY Data Server for DEA Geoscience Earth Observations
  • Radiometric Data
  • Google Earth Engine Data (GEE account needed), see for overview Earth_Engine_Data_Overview.

🔄 Functionality

The main goal of the Data Harvester is to enable researchers with reusable workflows for automatic data extraction and processing:

  1. Retrieve: given set of locations, automatically access and download multiple data sources (APIs) from a diverse range of geospatial and soil data sources
  2. Process: Spatial and temporal processing, conversion to dataframes and custom raster-files
  3. Output: Ready-made dataset for machine learning (training set and prediction mapping)

Geodata-Harvester is designed as a modular and maintainable project in the form of a multi-stage pipeline by providing explicit boundaries among tasks. To encourage interaction and experimentation with the pipeline, multiple frontend notebooks and use case scenarios are provided.

🌟 Key Features

The geodata-harvester package provides the following core features:

For more details about all functionalities, please consult the API reference documentation.

🔧 Installation

Geodata-Harvester can be run on cloud-servers (e.g., in JupyterHub environment) or on your local machine. Example notebooks for importing and using the package can be found in the folder notebooks. The package can be installed via PyPI or Conda:

Conda or Mamba

The package geodata-harvester is available via the conda-forge channel:

conda install geodata-harvester -c conda-forge

Note that the geodata-harvester is imported with underscore as

import geodata_harvester

PyPI

Installation via PyPI requires a pre-installation of gdal (see, e.g., pypi.org/project/GDAL/installation guide) in your environment. Once gdal is installed, you can install geodata-harvester via

pip install geodata-harvester

The geodata-harvester library can then be imported via

import geodata_harvester

Google Earth Engine extension

Optionally you can include Google Earth Engine (GEE) data in Geodata-Harvester (see Settings_Overview). GEE requires a Google account and a GEE authorization. If this is your first time using GEE, please follow these instructions and authorise Geodata-Harvester to use the Google Earth Engine API. See a preview of the process here.

NOTE: You only have to perform this authorisation ONCE. Or at least you only have to do it once per “connection” or if you use an incognito window.

Local development

If you like to develop Data Harvester locally, it is recommended to setup a virtual environment for the installation, e.g., via conda miniforge (see for dependencies environment.yaml) and to fork the Geodata-harvester repo. To install only the latest development version use:

pip install git+https://github.com/Sydney-Informatics-Hub/geodata-harvester

Workshop Cloud Sandbox

As play-ground for workshop training sessions and testing of the Geodata-Harvester we provide a pre-installed cloud Python Jupyterlab environment, which does not require any local installation. For login instructions and how to access the sandbox, please visit our Python workshop page.

The Jupyter environment is hosted on the ARDC Nectar Research Cloud in partnership with AgReFed and Australian Research Data Commons (ARDC). Note that this sandbox is currently hosted for test purposes only and generated data is not permanently stored.

The Geodata-Harvester can be easily installed also on other cloud services (e.g., Google Colab, Azure Notebooks).

⚙️ Settings Overview

The Geodata-Harvester is controlled by a settings file in YAML format. The settings file contains all user-defined settings for the data extraction and processing. A detailed settings overview is provided in Settings_Overview. Example settings files are provided along the notebooks in the folder notebooks/settings.

Alternatively a settings file can be also created via the interactive widget-panels as demonstrated in the notebook example_harvest_with_widgets.ipynb.

🚀 How to get started

You may now invoke the geodata-harvester directly from a python terminal with:

import geodata_harvester as gdh
gdh.harvest.run(PATH_TO_SETTINGS_YAMLFILE)

Note the subtle but important difference in use of an underscore _ to import the package and the use of a dash - to install it!

To get started, some example workflows are provided as Jupyter notebooks:

  1. Clone the geodata-harvester repo to your local machine or cloud server. Alternatively, download the package as zip folder from the geodata-harvester Github page and unzip the folder. This will download the geodata-harvester package including the example notebooks, settings files and example input data.

  2. Options and user settings are defined by the user in the settings; see for example settings file settings_harvest.yaml

  3. Run a jupyter notebook in the notebooks folder, such as example_harvest.ipynb.

  4. The notebook will run the geodata-harvester with the settings file and download/process all the requested data. The final data is saved in the folder results_example_harvest in the current working directory as specified in the settings file. There you can find the generated data table results.csv and the downloaded georeferenced .tif files (open with, e.g., rasterio, QGIS or ArcGIS). A summary of all generated images is provided in the table download_summary.csv.

A step-by-step tutorial on how to use the individual modules of the Geodata-Harvester is provided in the notebook example_harvest_stepwise.ipynb.

To include Google Earth Engine (GEE) data in Geodata-Harvester, please follow the instructions in the notebook example_harvest_withGEE.ipynb. Note that this requires a GEE account and authorisation (see Google Earth Engine extension).

If you would like to learn more about the Geodata-Harvester, please also visit our Workshop webpage.

✅ Testing

Test functions are included in the tests folder. Note that due to the nature of this package, these tests require an internet connection and may fail if the data source API servers are not available or the data source API has changed. To run automated tests with pytest, you need to install the package with

pip install pytest

and then run:

cd tests
pytest ./

or test individual modules with, e.g.,

cd tests
pytest test_getdata_dea.py

➕ How to add new data source modules

The Geodata-Harvester is designed to be extendable and new data source modules can be added as Python modules (for examples, see getdata_*.py modules). If you would like to add a new data source, please follow the adding new data source guidelines

We recommend to fork the geodata-harvester repo and develop new modules in a local environment. If you would like to contribute your data source module to the geodata-harvester package, please visit the geodata-harvester contribution guidelines.

📚 Code reference API

An auto-generated API reference documentation is available here.

🤝 Contributions

We are happy for any contribution to the geodata-harvester, whether feedbacks and bug reports via github Issues, adding use-case examples via notebook contributions, to improving source-code and adding new or updating existing data source modules.

For more details about about how to contribute to the development, please visit the Geodata-Harvester contribution guidelines.

👏 Attribution and Acknowledgments

This software was developed by the Sydney Informatics Hub, a core research facility of the University of Sydney, as part of the Data Harvesting project for the Agricultural Research Federation (AgReFed).

Acknowledgments are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub.

If you make use of this software for your research project, please include the following acknowledgment:

“This research was supported by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney, and the Agricultural Research Federation (AgReFed)."

AgReFed is supported by the Australian Research Data Commons (ARDC) and the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS).

📄 License

Copyright 2023 The University of Sydney

This is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License (LGPL version 3) as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program (see LICENSE). If not, see https://www.gnu.org/licenses/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geodata-harvester-1.1.1.tar.gz (73.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page