Extensible Environmental Data Preprocessing Framework
Project description
EnvDataPrep: High-performance Environmental Data Pre-processing
Why EnvDataPrep?
EnvDataPrep aims to help environmental scientists overcome common challenges in handling environmental datasets, incluidng:
- Insufficient disk space for massive datasets
- Complex conversions between different file formats
- Time-consuming geospatial operations
- And more ...
EnvDataPrep is designed for high performance and syntax simplicity. By leveraging vectorized operations, parallelism, and industry-standard libraries like NumPy, Xarray, and netCDF4, it streamlines heavy and complex data preparation tasks into efficient and easy-to-use APIs.
Core Capacities
Currently, the main functionality is subsetting netCDF files. Typical satellite products (e.g., TROPOMI NO2) and model simulations (e.g., WRF outputs) can shrink by a large fraction (e.g., 90%) when you keep only the data fields you need.
Example of subsetting netCDF files
import glob
import os
import envdataprep as edp
# Directories
INPUT_DIR = "path/to/input/directory"
OUTPUT_DIR = "path/to/output/directory"
# Input files (using TROPOMI NO2 satellite products as an example)
input_files = glob.glob(os.path.join(INPUT_DIR, "S5P*.nc"))
# Explore all available variables
all_vars = edp.list_netcdf_vars(input_files[0])
print(*all_vars, sep="\n")
# Select the variables to keep
selected_vars = [
"PRODUCT/latitude",
"PRODUCT/longitude",
"PRODUCT/time_utc",
"PRODUCT/qa_value",
"PRODUCT/nitrogendioxide_tropospheric_column",
"PRODUCT/nitrogendioxide_tropospheric_column_precision",
"PRODUCT/SUPPORT_DATA/GEOLOCATIONS/solar_zenith_angle",
"PRODUCT/SUPPORT_DATA/GEOLOCATIONS/viewing_zenith_angle",
"PRODUCT/SUPPORT_DATA/GEOLOCATIONS/latitude_bounds",
"PRODUCT/SUPPORT_DATA/GEOLOCATIONS/longitude_bounds",
"PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_altitude",
"PRODUCT/SUPPORT_DATA/INPUT_DATA/eastward_wind",
"PRODUCT/SUPPORT_DATA/INPUT_DATA/northward_wind",
]
# Subset the netCDF files in parallel
edp.subset_netcdf(
nc_input=input_files,
output_dir=OUTPUT_DIR,
keep_vars=selected_vars,
workers=8,
)
Installation
Requirements: Python 3.12+.
Install from PyPI:
pip install envdataprep
⚠️ Disclaimer
Due to the massive scale and inherent diversity of environmental data, some edge cases may remain unexplored. For critical research or production workflows, it is strongly recommended to manually validate processed outputs.
If you encounter any discrepancies or unexpected behavior, please open an issue.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file envdataprep-0.1.2.tar.gz.
File metadata
- Download URL: envdataprep-0.1.2.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bf04e397a9b2117d4401f1f0f9b1ad20eefed6e1b7510a9c678b25e4b6f161c
|
|
| MD5 |
f229108d353ad213d6da6c593ea6eb04
|
|
| BLAKE2b-256 |
f46acb00e3d098468241d555c597cc74c994bf2db7702ccd12cff9f8824eee8b
|
File details
Details for the file envdataprep-0.1.2-py3-none-any.whl.
File metadata
- Download URL: envdataprep-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31691ad2eb05e6e3dd3e45cb71fe7eeff35fd7f8f313f5c3692c67c13cf26c40
|
|
| MD5 |
1daae92ba1cfea9336aa4b174a52850b
|
|
| BLAKE2b-256 |
9bb7bb8d848da593416fac6321249a375870d5d298139c5100bb22a2c8059d5d
|