Libraries and command-line scripts for performing ecohydrology data preparation workflows.

## Project description

EcohydroLib {#index}
=======================

This software is provided free of charge under the New BSD License. Please see
the following license information:

Copyright (c) 2013, University of North Carolina at Chapel Hill

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
- Neither the name of the University of North Carolina at Chapel Hill nor
the names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Authors
-------
Brian Miles <brian_miles@unc.edu>

Lawrence E. Band <lband@email.unc.edu>

For questions or support contact [Brian Miles](brian_miles@unc.edu)

Funding
-------
This work was supported by the following NSF grants

- Award no. 1239678 EAGER: Collaborative Research: Interoperability
Testbed-Assessing a Layered Architecture for Integration of
Existing Capabilities

- Award no. 0940841 DataNet Federation Consortium.

- Award no. 1148090 Collaborative Research: SI2-SSI: An Interactive Software
Infrastructure for Sustaining Collaborative Innovation in the
Hydrologic Sciences

Introduction
------------
EcohydroLib provides a series of Python scripts for performing
ecohydrology data preparation workflows. Workflow sub-components are
orchestrated via a metadata persistence store provided by the
ecohydrolib.metadata package. These scripts are built on top of a
series of task-oriented APIs defined in the Python package
manipulating geospatial data needed to run a ecohydrology models,
information such as: digital elevation model (DEM), soils, land cover,
and vegetation leaf area index. These data can be drawn both from
national spatial data infrastructure (NLCD, SSURGO) as well as custom
local datasets.

A metadata store is used to orchestrate a series of workflow scripts
used to prepare data for an ecohydrology model. The current
implementation uses the Python ConfigParser key-value storage
mechanism to persist metadata to disk, however conceivably any
key-value store could be used. The metadata contains information
related to the study area (e.g. bounding box, spatial reference, DEM
resolution), as well as provenance information for each spatial data
layer imported, and a processing history that records the parameters
used to invoke each workflow script; provenance information is
represented as a subset of Dublin Core attributes
(http://dublincore.org). When using the workflow scripts in a
stand-alone environment, the metadata store provides information
necessary to understand where ecohydrology input data came from, and
what transformations have been made to the data. When the workflow
scripts are integrated into a data grid or workflow environment
(e.g. iRODS), the metadata store can serve as a staging area for
metadata and provenance information that will be registered into the
formal workflow environment.

The fundamental operation for any ecohydrology modeling workflow is to
define the study region of interest (ROI). In EcohydroLib the ROI is
simply defined as a bounding box of WGS84 latitude and longitude
coordinates (e.g. coordinates for the upper- left and lower-right
corners). For workflows using the National Hydrography Dataset (NHD),
the ROI bounding box can be derived using catchment polygons
associated with the stream reaches upstream of a particular gage. The
user begins by picking a streamflow discharge gage listed in the NHD
dataset. EcohydroLib can then determine the stream reaches upstream
of the data, and then select the catchment polygons associated with
each upstream reach. From these polygons, the bounding box of the
land area draining through the streamflow gage can easily be
calculated.

Once the ROI is known, EcohydroLib can extract datasets (DEM, soils,
etc.) specific to the study area. Some of these datasets are
extracted from static local copies of national spatial data
(e.g. NLCD), while other are retrieved via web services interfaces
from federal agency data centers (e.g. SSURGO soils data from USDA) or
from third-party data centers (GeoBrain's DEM Explorer). However it
is also possible for the user to register their own custom data for a
given datatype (e.g. local LIDAR-based DEM).

![Fig. 1 Ecohydrology model data preparation workflow software stack depicting EcohydroLib's role as an intermediary between raw data, derived data subsets, and specific ecohydrology models](EcohydroLib-Architecture.png)

Source code
-----------
Source code can be found at: https://github.com/selimnairb/EcohydroLib.

Documentation can be found at: http://pythonhosted.org/ecohydrolib

Installation
------------
Detailed installation instructions can be found in the RHESSysWorkflows

The following instructions should only be used by advanced users.

Using easy_install:

easy_install --script-dir /path/to/install/scripts EcohydroLib

Using pip:

pip install EcohydroLib

It is recommended that you install the workflow scripts in a location
distinct from where the Python package will be installed. This is
accomplished by specifying the --script-dir option to easy install
(see above).

Note, pyspatialite 3.0.1, needed for GHCNDSetup.py and
GetGHCNDailyClimateData*.py, currently fails to build under easy_install/pip.
Until this is fixed by the pyspatialite developer, I have removed
pyspatialite from the dependency list. If you need to use GNCHD data, you
can install pyspatialite manually using the following steps (this can be
done before or after installing EcohydroLib):
https://pypi.python.org/pypi/pyspatialite/3.0.1
- Apply the following patch to pyspatialite's setup.py:
- Install pyspatialite
- Install EcohydroLib as described above

Required runtime software
-------------------------
Python 2.7

- libxml2
- libxslt
- libproj
- libgeos

Binaries:
- GDAL/OGR 1.9 or later (throughout)
- SQLite3 (throughout)
- Seven Zip (if using NHDPlusV2Setup/NHDPlusV2Setup.py)
- Spatialite (if using GHCNDSetup.py/GetGHCNDailyClimateData*.py)
- Unix find (if using NHDPlusV2Setup/NHDPlusV2Setup.py)

Data stored locally
-------------------
- NLCD 2006 raster (http://www.mrlc.gov/nlcd06_data.php)
- HYDRO1k North America dataset (http://eros.usgs.gov/#/Find_Data/Products_and_Data_Available/gtopo30/hydro/namerica)
- NHDPlus V2 dataset (http://www.horizon-systems.com/NHDPlus/NHDPlusV2_home.php)

NHDPlus V2 database setup
-------------------------
Before EcohydroLib is able to extract study area ROI using the NHDPlus
dataset, it is necessary to have a local copy of the NHDPlus dataset.
Owing to the large size of the NHDPlus dataset, these data are
distributed as as series of compressed archives broken into several
regions for the continental U.S. There are two choices for obtaining
NHDPlus in a format usable by EcohydroLib (as several SQLite3
databases). A national-scale dataset (i.e. covering the entire
continental U.S.) is available for download here:

http://...

Once downloaded, extract the archive and record its location in your
EcohydroLib configuation file; see the 'Configuration files' section

If you wish to build you own copy of the database (i.e. for a subset
of U.S. country) a script for building the dataset from downloaded
NHDPlus V2 7z archives is provided in
bin/NHDPlusV2Setup/NHDPlusV2Setup.py. The following NHDPlus V2
datasets are required:

- NHDPlusV21_NationalData_GageInfo_02.7z
- NHDPlusV21_NationalData_GageLoc_01.7z
- NHDPlusV21_NationalData_Gage_Smooth_01.7z
- NHDPlusV21_??_??_NHDPlusAttributes_??.7z
- NHDPlusV21_??_??_NHDPlusCatchment_??.7z
- NHDPlusV21_??_??_NHDSnapshot_??.7z

Note that the NHDPlusAttributes, NHDPlusCatchment, and NHDPlusSnapshot
data are released as regional subsets (due to the large size and
complexity of the data). NHDPlusV2Setup.py can build its NHDPlus
SQLite3 databases for any number of regions; all data for the desired
number of regions will be combined into a single database.

Once you've decided which NHDPlusV2 regions you wish to build a
database for, simply download the relevant 7z archives from the
NHDPlusV2 web site (see above), and store the archives in a single
directory. NHDPlusV2Setup.py will unpack these archives into a
specified output location and then will process the unarchived files
into the following databases: - Catchment.sqlite (a spatial dataset
containing all catchment polygons in the selected NHD region(s); -
GageLoc.sqlite (a spatial dataset containing streamflow gage points
for the national NHD dataset; - NHDPlusDB.sqlite (a tabular dataset
containing other NHD data needed by EcohydroLib).

Make sure to edit your configuration file to include the absolute
paths of these files (see below).

For national NHD coverage, Catchment.sqlite is over 8 GB, and
NHDPlusDB.sqlite is over 2 GB, so you will need a kernel and
filesystem that has large file support to build and use these
datasets. Also, it may take over an hour to create these datasets; 8
GB of memory or more is recommended to build the datasets efficiently.
However, database setup is a one-time process, and you can use
databases created on one machine on other machines, provided SQLite3
is installed. NHDPlusV2Setup.py creates each database with the
indices needed by EcohydroLib, so lookups are very fast.

HYDRO1k North America
---------------------
To use HYDRO1k basin shapefile, you must first uncompress
na_bas.e00.gz to na_bas.e00. Then you must convert the e00 (Arc
interchange file) to a shapefile using a tool such as ArcGIS.

GHCN Climate Data
-----------------
To download NCDC Global Historical Climatology Network (GHCN) dataset
for daily climate data, you must first create the spatialite database
that EcohydroLib uses to find climate stations using
spatial queries. This database is created using
bin/GHCNDSetup/GHCNDSetup.py. The output from the script will be a
spatialite database. Make sure to edit your configuration file and
set PATH_OF_STATION_DB to the absolute path of this spatialite
database (see below).

Configuration files
-------------------
Many of the command line scripts (including NHDPlusV2Setup.py) require
a configuration file to specify locations to executables and datasets
required by the ecohydrology workflow libraries. The configuration
file can be specified via the environmental variable
ECOHYDROLIB_CFG or via command line option. Here is an example
configuration file:

[GDAL/OGR]
PATH_OF_OGR2OGR = /Library/Frameworks/GDAL.framework/Versions/Current/Programs/ogr2ogr
PATH_OF_GDAL_RASTERIZE = /Library/Frameworks/GDAL.framework/Versions/Current/Programs/gdal_rasterize
PATH_OF_GDAL_WARP = /Library/Frameworks/GDAL.framework/Versions/Current/Programs/gdalwarp
PATH_OF_GDAL_TRANSLATE = /Library/Frameworks/GDAL.framework/Versions/Current/Programs/gdal_translate

[NHDPLUS2]

[SOLIM]

[NLCD]

[HYDRO1k]
HYDRO1k_BAS_LAYER_NAME = na_bas_polygon

[GHCND]

[UTIL]
PATH_OF_FIND = /usr/bin/find
PATH_OF_SEVEN_ZIP = /opt/local/bin/7z
PATH_OF_SQLITE = /opt/local/bin/sqlite3

If you create your initial configuration file by copying and pasting
from this documentation, make sure to remove any leading spaces from
each line of the file.

How to use - Typical workflows
------------------------------
A workflow using data from large-scale spatial data infrastructure
will consist of running the follow scripts in the following order:
1. GetNHDStreamflowGageIdentifiersAndLocation.py
2. GetCatchmentShapefileForNHDStreamflowGage.py
3. GetBoundingboxFromStudyAreaShapefile.py
4. GetDEMExplorerDEMForBoundingbox.py
5. GetNLCDForDEMExtent.py
6. GetSSURGOFeaturesForBoundingbox.py
7. GenerateSoilPropertyRastersFromSSURGO.py or GenerateSoilPropertyRastersFromSOLIM.py

The first 4 steps must be run in this order, the remaining workflow
components can be run in any order. Other workflow components,
e.g. to register a custom dataset, can be substituted for the latter 4
workflow components as well (as indicated above). See the
documentation for each script to see invocations details.

A workflow collecting data appropriate for large-scale land surface
process models may consist of running the following scripts in the
following order:
1. GetCatchmentShapefileForHYDRO1kBasins.py
2. GetBoundingboxFromStudyAreaShapefile.py
3. GetHYDRO1kDEMForBoundingbox.py
6. GetGHCNDailyClimateDataForBoundingboxCentroid.py OR GetGHCNDailyClimateDataForStationsInBoundingbox.py

A workflow using custom local data sources will consist of running the
follow scripts in the following order:
1. RegisterDEM.py
2. RegisterGage.py
3. RegisterRaster.py
4. GetSSURGOFeaturesForBoundingbox.py

A workflow using custom streamflow gage, but with standard spatial data (NED, NLCD, SSURGO)
could consist of running the follow scripts in the following order:
1. RegisterStudyAreaShapefile.py
2. GetBoundingboxFromStudyAreaShapefile.py
3. GetDEMExplorerDEMForBoundingbox.py
4. GetNLCDForDEMExtent.py
5. GetSSURGOFeaturesForBoundingbox.py
6. GenerateSoilPropertyRastersFromSSURGO.py

## Release history Release notifications | RSS feed

Uploaded source