Skip to main content

Extensible Environmental Data Preprocessing Framework

Project description

EnvDataPrep: High-performance Environmental Data Pre-processing

Python License Development Status

Why EnvDataPrep?

EnvDataPrep aims to help environmental scientists overcome common challenges in handling environmental datasets, incluidng:

  • Insufficient disk space for massive datasets
  • Complex conversions between different file formats
  • Time-consuming geospatial operations
  • And more ...

EnvDataPrep is designed for high performance and syntax simplicity. By leveraging vectorized operations, parallelism, and industry-standard libraries like NumPy, Xarray, and netCDF4, it streamlines heavy and complex data preparation tasks into efficient and easy-to-use APIs.

Core Capacities

Currently, the main functionality is subsetting netCDF files. Typical satellite products (e.g., TROPOMI NO2) and model simulations (e.g., WRF outputs) can shrink by a large fraction (e.g., 90%) when you keep only the data fields you need.

Example of subsetting netCDF files

import glob
import os

import envdataprep as edp

# Directories
INPUT_DIR = "path/to/input/directory"
OUTPUT_DIR = "path/to/output/directory"

# Input files (using TROPOMI NO2 satellite products as an example)
input_files = glob.glob(os.path.join(INPUT_DIR, "S5P*.nc"))

# Explore all available variables
all_vars = edp.list_netcdf_vars(input_files[0])
print(*all_vars, sep="\n")

# Select the variables to keep
selected_vars = [
    "PRODUCT/latitude",
    "PRODUCT/longitude",
    "PRODUCT/time_utc",
    "PRODUCT/qa_value",
    "PRODUCT/nitrogendioxide_tropospheric_column",
    "PRODUCT/nitrogendioxide_tropospheric_column_precision",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/solar_zenith_angle",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/viewing_zenith_angle",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/latitude_bounds",
    "PRODUCT/SUPPORT_DATA/GEOLOCATIONS/longitude_bounds",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_altitude",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/eastward_wind",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/northward_wind",
]

# Subset the netCDF files in parallel
edp.subset_netcdf(
    nc_input=input_files,
    output_dir=OUTPUT_DIR,
    keep_vars=selected_vars,
    workers=8,
)

Installation

Requirements: Python 3.12+.

Install from PyPI:

pip install envdataprep

⚠️ Disclaimer

Due to the massive scale and inherent diversity of environmental data, some edge cases may remain unexplored. For critical research or production workflows, it is strongly recommended to manually validate processed outputs.

If you encounter any discrepancies or unexpected behavior, please open an issue.

License

This project is licensed under the MIT License.

⬆ Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

envdataprep-0.1.2.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

envdataprep-0.1.2-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file envdataprep-0.1.2.tar.gz.

File metadata

  • Download URL: envdataprep-0.1.2.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for envdataprep-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2bf04e397a9b2117d4401f1f0f9b1ad20eefed6e1b7510a9c678b25e4b6f161c
MD5 f229108d353ad213d6da6c593ea6eb04
BLAKE2b-256 f46acb00e3d098468241d555c597cc74c994bf2db7702ccd12cff9f8824eee8b

See more details on using hashes here.

File details

Details for the file envdataprep-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: envdataprep-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for envdataprep-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 31691ad2eb05e6e3dd3e45cb71fe7eeff35fd7f8f313f5c3692c67c13cf26c40
MD5 1daae92ba1cfea9336aa4b174a52850b
BLAKE2b-256 9bb7bb8d848da593416fac6321249a375870d5d298139c5100bb22a2c8059d5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page