Skip to main content

Preprocessing tools for satellite imagery analysis

Project description

GeoPre: Geospatial Data Processing Toolkit

GeoPre is a Python library designed to streamline common geospatial data operations, offering a unified interface for handling raster and vector datasets. It simplifies preprocessing tasks essential for GIS analysis, machine learning workflows, and remote sensing applications.

Key Features

  • Data Scaling:

    • Normalization (Z-Score) and Min-Max scaling for raster bands.
    • Prepares data for ML models while preserving geospatial metadata.
  • CRS Management:

    • Retrieve and compare Coordinate Reference Systems (CRS) across raster (Rasterio/Xarray) and vector (GeoPandas) datasets.
    • Ensure consistency between datasets with automated CRS checks.
  • Reprojection:

    • Reproject vector data (GeoDataFrames) and raster data (Rasterio/Xarray) to any target CRS.
    • Supports EPSG codes, WKT, and Proj4 strings.
  • No-Data Masking:

    • Handle missing values in raster datasets (NumPy/Xarray) with flexible masking.
    • Integrates seamlessly with raster metadata for error-free workflows.
  • Cloud Masking:

    • Identify and mask clouds in Sentinel-2 and Landsat imagery.
    • Supports multiple methods: QA bands, scene classification layers (SCL), probability bands, and OmniCloudMask AI-based detection.
    • Optionally mask cloud shadows for improved accuracy.
  • Band Stacking:

    • Stack multiple raster bands from a folder into a single multi-band raster for analysis.
    • Supports automatic band detection and resampling for different resolutions.

Supported Data Types

  • Raster: NumPy arrays, Rasterio DatasetReader, Xarray DataArray (via rioxarray).
  • Vector: GeoPandas GeoDataFrame.

Benefits of GeoPre

  • Unified Workflow: Eliminates boilerplate code by providing consistent functions for raster and vector data.
  • Interoperability: Bridges gaps between GeoPandas, Rasterio, and Xarray, ensuring smooth data transitions.
  • Robust Error Handling: Automatically detects CRS mismatches and missing metadata to prevent silent failures.
  • Efficiency: Optimized reprojection and masking operations reduce preprocessing time for large datasets.
  • ML-Ready Outputs: Scaling functions preserve data structure, making outputs directly usable in machine learning pipelines.

Ideal for researchers and developers working with geospatial data, GeoPre enhances productivity by standardizing preprocessing steps and ensuring compatibility across diverse geospatial tools.

Installation

Install the package by running the following command:

pip install geopre

This will automatically install all required dependencies.

Usage

1. Data Scaling

Z-Score Scaling

Description:This method centers the data around zero by subtracting the mean and dividing by the standard deviation, which is useful for machine learning models sensitive to outliers and can standardize a band of pixel values for clustering/classification.

Parameters:

  • data (numpy.ndarray): Input array to normalize.

Returns:

  • numpy.ndarray: Standardized data with mean 0 and standard deviation 1.

Min_Max_Scaling

Description: This method scales the pixel values to a fixed range, typically [0, 1] or [-1, 1]. Ideal when you want to preserve the relative range of values. For GeoTIFF image values (e.g., 0 to 65535), scale them to [0, 1].

Parameters:

  • data (numpy.ndarray): Input array to normalize.

Returns:

  • numpy.ndarray: Scaled data with values between 0 and 1, or -1 and 1.

Example:

import numpy as np
import geopre as gp

data = np.array([[10, 20, 30], [40, 50, 60]])
z_scaled = gp.Z_score_scaling(data)
minmax_scaled = gp.Min_Max_Scaling(data)

2. CRS Management

get_crs

Description: Retrieve CRS from geospatial data objects.

Parameters:

  • data: GeoPandas GeoDataFrames (vector), Rasterio DatasetReaders (raster) or Xarray DataArrays with rio accessor (raster)

Returns:

  • pyproj.CRS: Coordinate reference system or None if undefined

compare_crs

Description: Compare CRS between raster and vector datasets.

Parameters:

  • raster_obj (DatasetReader/xarray.DataArray): Raster data source.
  • vector_gdf (gpd.GeoDataFrame): Vector data source.

Returns:

dict: Comparison results with keys:

  • raster_crs: Formatted CRS string
  • vector_crs: Formatted CRS string
  • same_crs: Boolean comparison result
  • error: Exception message if any

Example:

import geopandas as gpd
import rasterio
import geopre as gp

vector = gpd.read_file("data.shp")
raster = rasterio.open("image.tif")

print(gp.get_crs(vector))  # EPSG:4326
print(gp.compare_crs(raster, vector))  # CRS comparison results

3. Reprojection

reproject_data

Description: Reproject geospatial data to target CRS.

Parameters:

  • data: GeoDataFrames (vector reprojection), or Rasterio datasets (returns array + metadata), or Xarray objects (rioxarray reprojection)
  • target_crs: CRS to reproject to (EPSG code/WKT/proj4 string)

Returns:

  • Reprojected data in format matching input type

Example:

import rasterio
import xarray as xr
import geopre as gp

# Vector reprojection
reprojected_vector = gp.reproject_data(vector, "EPSG:3857")

# Raster reprojection (Rasterio)
with rasterio.open("input.tif") as src:
    array, metadata = gp.reproject_data(src, "EPSG:32633")

# Xarray reprojection
da = xr.open_rasterio("image.tif")
reprojected_da = gp.reproject_data(da, "EPSG:4326")

4. No-Data Masking

mask_raster_data

Description: Mask no-data values in raster datasets. Handles both rasterio (numpy) and rioxarray (xarray) workflows.

Parameters:

  • data: Raster data (numpy.ndarray or xarray.DataArray)
  • profile: Rasterio metadata dict (required for numpy arrays)
  • no_data_value: Override for metadata's nodata value
  • return_mask: Whether to return boolean mask

Returns:

  • Masked data array. For numpy inputs, returns tuple:(masked_array, profile). For xarray, returns DataArray.

Example:

import xarray as xr
import rasterio
import geopre as gp

# Rasterio workflow
with rasterio.open("data.tif") as src:
    data = src.read(1)
    masked, profile = gp.mask_raster_data(data, src.profile)

# rioxarray workflow
da = xr.open_rasterio("data.tif")
masked_da = gp.mask_raster_data(da)

5. Cloud Masking

mask_clouds_S2

Description: Masks clouds and optionally shadows in a Sentinel-2 raster image using various methods.

Parameters:

  • image_path (str): Path to the input raster image.
  • output_path (str, optional): Path to save the masked output raster. Defaults to the same directory as the input with '_masked' appended to the filename.
  • method (str, optional): The method for masking. Options are:
    • 'auto': Automatically chooses the best available method.
    • 'qa': Uses the QA60 band to mask clouds. WARNING: QA60 is deprecated after 2022-01-25, results for images after that date could be wrong
    • 'probability': Uses the cloud probability band MSK_CLDPRB with a threshold for masking.
    • 'omnicloudmask': Utilizes OmniCloudMask for AI-based cloud detection. Might take a long time for big images
    • 'scl': Leverages the Scene Classification Layer (SCL) for masking.
    • 'standard': Similar to 'auto', but avoids the OmniCloudMask method.
  • mask_shadows (bool): Whether to mask cloud shadows. Defaults to False.
  • threshold (int, optional): Cloud probability threshold (if using a cloud probability band), from 0 to 100. Defaults to 20.
  • qa60_idx (int, optional): Index of the QA60 band (1-based). Auto-detected if not provided.
  • qa60_path (str, optional): Path to the QA60 band (if in a separate file).
  • prob_band_idx (int, optional): Index of the cloud probability band (1-based). Auto-detected if not provided.
  • prob_band_path (str, optional): Path to the cloud probability band (if in a separate file).
  • scl_idx (int, optional): Index of the SCL band (1-based). Auto-detected if not provided.
  • scl_path (str, optional): Path to the SCL band (if in a separate file).
  • red_idx, green_idx, nir_idx (int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.
  • nodata_value (float): Value for no-data regions. Defaults to np.nan.

Returns:

  • (str): The path to the saved masked output raster.

Example:

import geopre as gp

output_s2 = gp.mask_clouds_S2("sentinel2_image.tif", method='auto', mask_shadows=True)

mask_clouds_landsat

Description:
Masks clouds and optionally shadows in a Landsat raster image using various methods.

Parameters:

  • image_path (str): Path to the input multi-band raster image.
  • output_path (str, optional): Path to save the masked output raster. Defaults to the same directory as the input with _masked suffix.
  • method (str): The method for masking. Options are:
    • 'auto': Automatically chooses the best available method.
    • 'qa': Uses the QA_PIXEL band to mask clouds.
    • 'omnicloudmask': Utilizes OmniCloudMask for AI-based cloud detection.
  • mask_shadows (bool): Whether to mask cloud shadows. Defaults to False.
  • qa_pixel_path (str, optional): Path to the separate QA_PIXEL raster file.
  • qa_pixel_idx (int, optional): Index of the QA_PIXEL band (1-based).
  • confidence_threshold (str, optional): Confidence threshold for cloud masking (e.g., 'Low', 'Medium', 'High'). Defaults to 'High'. WARNING: as per the Landsat official documentation, the confidence bands are still under development, always use the default 'High' untill further notice. Source
  • red_idx, green_idx, nir_idx (int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.
  • nodata_value (float): Value for no-data regions. Defaults to np.nan.

Returns

  • (str): The path to the saved masked output raster.

Example

import geopre as gp

output_landsat = gp.mask_clouds_landsat("landsat_image.tif", method='auto', mask_shadows=True)

6. Band Stacking

stack_bands

Description:
Stacks multiple raster bands from a folder into a single multi-band raster. Support also .SAFE folders.

Parameters

  • input_path (str or Path): Path to the folder containing band files.
  • required_bands (list of str): List of band name identifiers (e.g., ["B4", "B3", "B2"]).
  • output_path (str or Path, optional): Path to save the stacked raster. Defaults to "stacked.tif" in the input folder.
  • resolution (float, optional): Target resolution for resampling. Defaults to the highest available resolution.

Returns

  • (str): The path to the saved stacked output raster.

Example

import geopre as gp

stacked_image = gp.stack_bands("/path/to/folder/containing/bands", ["B4", "B3", "B2"])

Examples

We provide two example Jupyter notebooks demonstrating the usage of GeoPre:

Contributing

  1. Fork the repository

    Click the "Fork" button at the top-right of this repository to create your copy.

  2. Create your feature branch

    git checkout -b feature/your-feature
    
  3. Commit changes

    git commit -am 'Add some feature'
    
  4. Push to branch

    git push origin feature/your-feature
    
  5. Open a Pull Request

    Navigate to the Pull Requests tab in the original repository and click "New Pull Request" to submit your changes.

License

This project is licensed under the MIT License. See LICENSE for more information.

Author

Liang Zhongyou – GitHub Profile

Matteo Gobbi Frattini – GitHub Profile

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geopre-0.2.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geopre-0.2.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file geopre-0.2.0.tar.gz.

File metadata

  • Download URL: geopre-0.2.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for geopre-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4efc8f928d4b3a647f2f761c57aa2e5a294de83a28a9e2e8b58a71a2a7ca0e8c
MD5 551020616c748e171fffa5737f599400
BLAKE2b-256 dfdd05dd03cf9ade7a3e498c901b5e12dc7f84a2585b84364f539d1c3f0fad0f

See more details on using hashes here.

File details

Details for the file geopre-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: geopre-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for geopre-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4c9288c930544f9f4083777734cc47f2b3b16d0571c0ae37892eeda33601101e
MD5 93cdca00fe37c89020c482d727119da9
BLAKE2b-256 6899bbadecdd1b980210c1e6ecd6b6a896446e7be9271d12df0c34473f864441

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page