Skip to main content

A package for analyzing pollution data from NetCDF files

Project description

DSS Pollution Extraction

Python Version License: MIT Development Status

A comprehensive Python package for analyzing pollution data from NetCDF files, developed at the Alfred Wegener Institute (AWI). This package provides tools for temporal aggregations, spatial extractions, visualizations, and health threshold analysis of atmospheric pollution data.

๐ŸŒŸ Features

๐Ÿ“Š Data Analysis

  • Multi-pollutant support: Black Carbon (BC), NOโ‚‚, PMโ‚‚.โ‚…, PMโ‚โ‚€
  • Temporal aggregations: Monthly, seasonal, annual, and custom period averages
  • Spatial extractions: Point-based, polygon-based, and NUTS3 region analysis
  • Statistical analysis: Comprehensive data statistics and quality control

๐Ÿ—บ๏ธ Visualization

  • Spatial maps: Interactive and publication-ready maps with Cartopy support
  • Time series plots: Domain averages and location-specific trends
  • Seasonal cycles: Annual pattern analysis and visualization
  • Distribution analysis: Histograms and box plots for data exploration
  • Spatial statistics: Mean, maximum, minimum, and standard deviation maps

๐Ÿ“ค Data Export

  • Multiple formats: NetCDF, GeoTIFF, CSV, GeoJSON, Shapefile
  • Flexible subsetting: Temporal and spatial data subsetting
  • Batch processing: Multi-file analysis workflows
  • Compression support: Optimized file sizes for large datasets

๐Ÿฅ Health Analysis

  • WHO guidelines: Air quality threshold analysis
  • EU standards: Compliance checking with European regulations
  • Exceedance mapping: Spatial distribution of threshold violations
  • Health impact assessment: Tools for public health research

๐Ÿš€ Quick Start

Installation

# Install from PyPI (when available)
pip install dss-pollution-extraction

# Or install from source
git clone https://github.com/MuhammadShafeeque/dss-pollution-extraction.git
cd dss-pollution-extraction
pip install -e .

Basic Usage

from pollution_extraction import PollutionAnalyzer

# Initialize analyzer
analyzer = PollutionAnalyzer("your_pollution_data.nc", pollution_type="pm25")

# Print dataset summary
analyzer.print_summary()

# Create visualizations
analyzer.plot_map(time_index=0, save_path="spatial_map.png")
analyzer.plot_time_series(save_path="time_series.png")
analyzer.plot_seasonal_cycle(save_path="seasonal_cycle.png")

# Temporal analysis
monthly_avg = analyzer.get_monthly_averages()
annual_avg = analyzer.get_annual_averages()

# Spatial extraction
point_locations = [(4321000, 3210000), (4500000, 3400000)]
point_data = analyzer.extract_at_points(point_locations)

# Export data
analyzer.export_to_geotiff("pm25_annual.tif", aggregation_method="mean")
analyzer.export_to_csv("pm25_data.csv")

๐Ÿ“ Project Structure

dss-pollution-extraction/
โ”œโ”€โ”€ pollution_extraction/           # Main package
โ”‚   โ”œโ”€โ”€ core/                      # Core functionality
โ”‚   โ”‚   โ”œโ”€โ”€ data_reader.py         # NetCDF data reading
โ”‚   โ”‚   โ”œโ”€โ”€ temporal_aggregator.py # Time-based analysis
โ”‚   โ”‚   โ”œโ”€โ”€ spatial_extractor.py   # Spatial data extraction
โ”‚   โ”‚   โ”œโ”€โ”€ data_visualizer.py     # Plotting and visualization
โ”‚   โ”‚   โ””โ”€โ”€ data_exporter.py       # Multi-format data export
โ”‚   โ”œโ”€โ”€ analyzer.py                # Main analysis interface
โ”‚   โ”œโ”€โ”€ cli.py                     # Command-line interface
โ”‚   โ”œโ”€โ”€ config.py                  # Configuration management
โ”‚   โ”œโ”€โ”€ utils.py                   # Utility functions
โ”‚   โ””โ”€โ”€ examples.py                # Usage examples
โ”œโ”€โ”€ examples/                      # Example data and scripts
โ”‚   โ”œโ”€โ”€ notebooks/                 # Jupyter notebooks
โ”‚   โ”œโ”€โ”€ scripts/                   # Example Python scripts
โ”‚   โ””โ”€โ”€ data/                      # Sample datasets
โ”œโ”€โ”€ tests/                         # Unit tests
โ”œโ”€โ”€ docs/                          # Documentation
โ””โ”€โ”€ config/                        # Configuration files

๐Ÿ› ๏ธ Requirements

Core Dependencies

  • Python: โ‰ฅ3.8
  • xarray: โ‰ฅ2022.3.0 (NetCDF data handling)
  • pandas: โ‰ฅ1.4.0 (Data manipulation)
  • numpy: โ‰ฅ1.21.0 (Numerical operations)
  • geopandas: โ‰ฅ0.10.0 (Spatial data)
  • rioxarray: โ‰ฅ0.11.0 (Raster I/O)
  • matplotlib: โ‰ฅ3.5.0 (Plotting)
  • seaborn: โ‰ฅ0.11.0 (Statistical visualization)
  • cartopy: โ‰ฅ0.20.0 (Geographic projections)

Optional Dependencies

  • Jupyter: Interactive notebooks
  • Plotly: Interactive visualizations
  • Folium: Web-based mapping
  • Numba: Performance optimization

๐Ÿ“‹ Supported Data Formats

Input Formats

  • NetCDF4 (.nc): Primary format for atmospheric data
  • Coordinate Systems: LAEA Europe projection, Geographic (WGS84)
  • Temporal Resolution: Daily, monthly, annual data
  • Spatial Resolution: High-resolution gridded data

Output Formats

  • NetCDF4: Processed data with metadata preservation
  • GeoTIFF: Raster format for GIS applications
  • CSV: Tabular data for statistical analysis
  • GeoJSON: Vector format for web applications
  • Shapefile: Standard GIS vector format

๐ŸŒ Supported Pollutants

Pollutant Variable Name Units Description
Black Carbon (BC) BC_downscaled 10โปโต m Aerosol Optical Depth
Nitrogen Dioxide (NOโ‚‚) no2_downscaled ฮผg/mยณ Surface concentration
PMโ‚‚.โ‚… PM2p5_downscaled ฮผg/mยณ Fine particulate matter
PMโ‚โ‚€ PM10_downscaled ฮผg/mยณ Coarse particulate matter

๐Ÿ“š Examples

Temporal Analysis

# Monthly averages for specific years
monthly_data = analyzer.get_monthly_averages(years=[2019, 2020, 2021])

# Seasonal patterns
seasonal_data = analyzer.get_seasonal_averages()

# Custom period analysis
custom_periods = [("2020-03-01", "2020-05-31")]  # COVID lockdown
lockdown_data = analyzer.get_custom_period_averages(custom_periods)

Spatial Extraction

# Extract data for monitoring stations
stations = [(4321000, 3210000), (4500000, 3400000)]
station_data = analyzer.extract_at_points(stations)

# Extract data for administrative regions
nuts3_data = analyzer.extract_for_nuts3("nuts3_regions.shp")

# Extract data for custom polygons
polygon_data = analyzer.extract_for_polygons("study_areas.shp")

Health Threshold Analysis

# WHO and EU guideline analysis
thresholds = analyzer.config.get_health_thresholds("pm25")
annual_avg = analyzer.get_annual_averages()

# Calculate exceedances
who_exceeded = (annual_avg > thresholds["who_annual"]).sum()
eu_exceeded = (annual_avg > thresholds["eu_annual"]).sum()

Batch Processing

# Process multiple files
file_patterns = ["data_2019.nc", "data_2020.nc", "data_2021.nc"]

for file_path in file_patterns:
    with PollutionAnalyzer(file_path, pollution_type="no2") as analyzer:
        # Annual analysis
        annual_avg = analyzer.get_annual_averages()

        # Export results
        year = file_path.split("_")[1].split(".")[0]
        analyzer.export_to_geotiff(f"no2_annual_{year}.tif")

๐Ÿ–ฅ๏ธ Command Line Interface

The package includes a comprehensive CLI for common operations:

# Basic dataset information
pollution-cli info pollution_data.nc --type pm25

# Create visualizations
pollution-cli plot pollution_data.nc --type no2 --time-series --spatial-map

# Export data to different formats
pollution-cli export pollution_data.nc --format geotiff --output pm25_annual.tif

# Health threshold analysis
pollution-cli health-analysis pollution_data.nc --type pm25 --who-guidelines

# Batch processing
pollution-cli batch-process data/*.nc --type bc --monthly --annual

๐Ÿ”ง Configuration

Default Configuration

The package uses sensible defaults but can be customized via configuration files:

# config/user_config.yaml
visualization:
  default_colormap: "viridis"
  figure_size: [12, 8]
  dpi: 300

export:
  compression: "lzw"
  nodata_value: -9999

health_thresholds:
  pm25:
    who_annual: 5.0      # ฮผg/mยณ
    eu_annual: 25.0      # ฮผg/mยณ

Environment Variables

export POLLUTION_CONFIG_PATH="/path/to/your/config.yaml"
export POLLUTION_OUTPUT_DIR="/path/to/output"

๐Ÿงช Testing

Run the test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=pollution_extraction

# Run specific test categories
pytest tests/test_spatial_extractor.py

๐Ÿ“– Documentation

Comprehensive documentation is available:

  • API Reference: Detailed function and class documentation
  • Tutorials: Step-by-step guides for common use cases
  • Examples: Jupyter notebooks with real-world applications
  • Installation Guide: Platform-specific installation instructions
# Build documentation locally
cd docs/
make html

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Setting up the development environment
  • Running tests and quality checks
  • Submitting pull requests
  • Coding standards and guidelines

Development Setup

git clone https://github.com/MuhammadShafeeque/dss-pollution-extraction.git
cd dss-pollution-extraction

# Create development environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev,docs,jupyter]"

# Run pre-commit hooks
pre-commit install

๐Ÿ“Š Performance

The package is optimized for large datasets:

  • Chunked processing: Memory-efficient handling of large NetCDF files
  • Parallel computation: Multi-core processing with Dask
  • Optimized I/O: Compressed output formats
  • Caching: Intelligent caching of intermediate results

Benchmarks

  • 10GB NetCDF file: ~5 minutes for annual aggregation
  • 50 monitoring stations: <30 seconds for time series extraction
  • NUTS3 regions (1,000+ polygons): ~2 minutes for spatial aggregation

๐Ÿ› Known Issues & Limitations

  • Coordinate Systems: Currently optimized for LAEA Europe projection
  • Memory Usage: Very large files (>20GB) may require chunking configuration
  • Projection Support: Limited support for non-European coordinate systems

See our Issues page for current bugs and feature requests.

๐Ÿ“„ License

This project is licensed under the MIT License. See LICENSE file for details.

๐Ÿ‘ฅ Authors & Acknowledgments

Main Developer:

Institution:

  • Alfred Wegener Institute (AWI) - Helmholtz Centre for Polar and Marine Research

Acknowledgments

  • European Space Agency (ESA) for satellite data
  • Copernicus Atmosphere Monitoring Service (CAMS)
  • The xarray and pandas development communities
  • Contributors to the atmospheric science Python ecosystem

๐Ÿ“ž Support & Contact

๐Ÿ“ˆ Roadmap

Version 1.1 (Planned)

  • Additional coordinate system support
  • Real-time data processing
  • Machine learning integration
  • Performance optimizations

Version 1.2 (Future)

  • Cloud processing support
  • Additional pollutant types
  • Web-based dashboard
  • API service

๐Ÿท๏ธ Citation

If you use this package in your research, please cite:

@software{shafeeque2024dss,
  title={DSS Pollution Extraction: A Python Package for Atmospheric Pollution Data Analysis},
  author={Shafeeque, Muhammad},
  year={2024},
  institution={Alfred Wegener Institute},
  url={https://github.com/MuhammadShafeeque/dss-pollution-extraction}
}

Keywords: atmospheric pollution, air quality, NetCDF, spatial analysis, temporal analysis, PM2.5, NO2, black carbon, Python, xarray, environmental data science

For questions, suggestions, or collaborations, please contact the development team at AWI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dss_pollution_extraction-1.0.0.tar.gz (67.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dss_pollution_extraction-1.0.0-py3-none-any.whl (65.6 kB view details)

Uploaded Python 3

File details

Details for the file dss_pollution_extraction-1.0.0.tar.gz.

File metadata

File hashes

Hashes for dss_pollution_extraction-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f8281ca1d798bf1b25145e96958642dbf103888da5656c859baf95236200d5d6
MD5 0a1f859439881a797c3d20249349765e
BLAKE2b-256 208978a4487b3e2f2829be9bfbf9f27e6b101f057c3b74d4c91a924f49bb7388

See more details on using hashes here.

File details

Details for the file dss_pollution_extraction-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dss_pollution_extraction-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 039c6df6e5d74d0ca41444c11b60e9f603f42866d93578c6e652d2f88f1b8abb
MD5 a21ab9f90dbae40a1e17e335690791ce
BLAKE2b-256 cc8a5da4be379e16f01aae4b2c46b289b8751c8d39d2b43c4d6fd32e3854b8d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page