Skip to main content

Package to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

Project description

PyPi Documentation Status Project Status: Active – The project has reached a stable, usable state and is being actively developed. test Python Version from PEP 621 TOML pyOpenSci Peer-Reviewed

harmonize-wq

Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)
  • Identify differences in sampling or analytic methods
  • Resolve data errors using transparent assumptions
  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

For complete documentation see docs. For more complete tutorial information see: demos

Quick Start

harmonize_wq can be installed using pip:

python3 -m pip install harmonize-wq

To install the latest development version of harmonize_wq using pip:

pip install git+https://github.com/USEPA/harmonize-wq.git

Example Workflow

dataretrieval Query for a geojson

import dataretrieval.wqp as wqp
from harmonize_wq import wrangle

# File for area of interest
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'

# Build query
query = {'characteristicName': ['Temperature, water',
                                'Depth, Secchi disk depth',
                                ]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'

# Run query
res_narrow, md_narrow = wqp.get_results(**query)

# dataframe of downloaded results
res_narrow

Harmonize results

from harmonize_wq import harmonize

# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized

Clean results

from harmonize_wq import clean

# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized)  # datetime
df_cleaned = clean.harmonize_depth(df_cleaned)  # Sample depth
df_cleaned

Transform results from long to wide format

There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

from harmonize_wq import wrangle

# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)

# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)

# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)

The number of columns in the resulting table is greatly reduced

Output Column Type Source Changes
MonitoringLocationIdentifier Defines row MonitoringLocationIdentifier NA
Activity_datetime Defines row ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode Combined and UTC
ActivityIdentifier Defines row ActivityIdentifier NA
OrganizationIdentifier Defines row OrganizationIdentifier NA
OrganizationFormalName Metadata OrganizationFormalName NA
ProviderName Metadata ProviderName NA
StartDate Metadata ActivityStartDate Preserves date where time NAT
Depth Metadata ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode standardized to meters
Secchi Result ResultMeasureValue, ResultMeasure/MeasureUnitCode standardized to meters
QA_Secchi QA NA harmonization processing quality issues
Temperature Result ResultMeasureValue, ResultMeasure/MeasureUnitCode standardized to degrees Celcius
QA_Temperature QA NA harmonization processing quality issues

Issue Tracker

harmonize_wq is under development. Please report any bugs and enhancement ideas using issues

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harmonize_wq-0.5.0.tar.gz (54.8 kB view details)

Uploaded Source

Built Distribution

harmonize_wq-0.5.0-py3-none-any.whl (57.2 kB view details)

Uploaded Python 3

File details

Details for the file harmonize_wq-0.5.0.tar.gz.

File metadata

  • Download URL: harmonize_wq-0.5.0.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for harmonize_wq-0.5.0.tar.gz
Algorithm Hash digest
SHA256 190fbf6fc725ac570d5efd624cd207d7dcf6bfbec315d34eb2abe50831746e9b
MD5 275dd0ded87643f7688ec50e1f615e07
BLAKE2b-256 2534a0175b4aaa1d82fa02a61eb06fde2743c800f6939196b6264760b68062fa

See more details on using hashes here.

File details

Details for the file harmonize_wq-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: harmonize_wq-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 57.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for harmonize_wq-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4096e9fee48d96f517ebd0fa9cdf8e7cc561f9130da452229183fd3ad47b3a1d
MD5 f498be5bec9295bd03628113463d7019
BLAKE2b-256 9ecb1225d3144b5b7c4c124f3270ecb5b49d9be8e105d019c900537e24284514

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page