Preprocessing tools for satellite imagery analysis
Project description
GeoPre: Geospatial Data Processing Toolkit
GeoPre is a Python library designed to streamline common geospatial data operations, offering a unified interface for handling raster and vector datasets. It simplifies preprocessing tasks essential for GIS analysis, machine learning workflows, and remote sensing applications.
Key Features
-
Data Scaling:
- Normalization (Z-Score) and Min-Max scaling for raster bands.
- Prepares data for ML models while preserving geospatial metadata.
-
CRS Management:
- Retrieve and compare Coordinate Reference Systems (CRS) across raster (Rasterio/Xarray) and vector (GeoPandas) datasets.
- Ensure consistency between datasets with automated CRS checks.
-
Reprojection:
- Reproject vector data (GeoDataFrames) and raster data (Rasterio/Xarray) to any target CRS.
- Supports EPSG codes, WKT, and Proj4 strings.
-
No-Data Masking:
- Handle missing values in raster datasets (NumPy/Xarray) with flexible masking.
- Integrates seamlessly with raster metadata for error-free workflows.
-
Cloud Masking:
- Identify and mask clouds in Sentinel-2 and Landsat imagery.
- Supports multiple methods: QA bands, scene classification layers (SCL), probability bands, and OmniCloudMask AI-based detection.
- Optionally mask cloud shadows for improved accuracy.
-
Water Masking (NEW in v0.3.0):
- Mask water bodies in Sentinel-2 and Landsat imagery using the NDWI index.
- Tries to automatically detects relevant bands (Green/NIR).
-
Band Stacking:
- Stack multiple raster bands from a folder into a single multi-band raster for analysis.
- Supports automatic band detection and resampling for different resolutions.
Supported Data Types
- Raster: NumPy arrays, Rasterio
DatasetReader, XarrayDataArray(via rioxarray). - Vector: GeoPandas
GeoDataFrame.
Benefits of GeoPre
- Unified Workflow: Eliminates boilerplate code by providing consistent functions for raster and vector data.
- Interoperability: Bridges gaps between GeoPandas, Rasterio, and Xarray, ensuring smooth data transitions.
- Robust Error Handling: Automatically detects CRS mismatches and missing metadata to prevent silent failures.
- Efficiency: Optimized reprojection and masking operations reduce preprocessing time for large datasets.
- ML-Ready Outputs: Scaling functions preserve data structure, making outputs directly usable in machine learning pipelines.
Ideal for researchers and developers working with geospatial data, GeoPre enhances productivity by standardizing preprocessing steps and ensuring compatibility across diverse geospatial tools.
Installation
GeoPre is available on PyPI and can be installed with:
pip install geopre
This will automatically install all required dependencies.
Usage
1. Data Scaling
Z-Score Scaling
Description:This method centers the data around zero by subtracting the mean and dividing by the standard deviation, which is useful for machine learning models sensitive to outliers and can standardize a band of pixel values for clustering/classification.
Parameters:
- data (numpy.ndarray): Input array to normalize.
Returns:
- numpy.ndarray: Standardized data with mean 0 and standard deviation 1.
Min_Max_Scaling
Description: This method scales the pixel values to a fixed range, typically [0, 1] or [-1, 1]. Ideal when you want to preserve the relative range of values. For GeoTIFF image values (e.g., 0 to 65535), scale them to [0, 1].
Parameters:
- data (numpy.ndarray): Input array to normalize.
Returns:
- numpy.ndarray: Scaled data with values between 0 and 1, or -1 and 1.
Example:
import numpy as np
import geopre as gp
data = np.array([[10, 20, 30], [40, 50, 60]])
z_scaled = gp.Z_score_scaling(data)
minmax_scaled = gp.Min_Max_Scaling(data)
2. CRS Management
get_crs
Description: Retrieve CRS from geospatial data objects.
Parameters:
- data: GeoPandas GeoDataFrames (vector), Rasterio DatasetReaders (raster) or Xarray DataArrays with rio accessor (raster)
Returns:
- pyproj.CRS: Coordinate reference system or None if undefined
compare_crs
Description: Compare CRS between raster and vector datasets.
Parameters:
- raster_obj (DatasetReader/xarray.DataArray): Raster data source.
- vector_gdf (gpd.GeoDataFrame): Vector data source.
Returns:
dict: Comparison results with keys:
- raster_crs: Formatted CRS string
- vector_crs: Formatted CRS string
- same_crs: Boolean comparison result
- error: Exception message if any
Example:
import geopandas as gpd
import rasterio
import geopre as gp
vector = gpd.read_file("data.shp")
raster = rasterio.open("image.tif")
print(gp.get_crs(vector)) # EPSG:4326
print(gp.compare_crs(raster, vector)) # CRS comparison results
3. Reprojection
reproject_data
Description: Reproject geospatial data to target CRS.
Parameters:
- data: GeoDataFrames (vector reprojection), or Rasterio datasets (returns array + metadata), or Xarray objects (rioxarray reprojection)
- target_crs: CRS to reproject to (EPSG code/WKT/proj4 string)
Returns:
- Reprojected data in format matching input type
Example:
import rasterio
import xarray as xr
import geopre as gp
# Vector reprojection
reprojected_vector = gp.reproject_data(vector, "EPSG:3857")
# Raster reprojection (Rasterio)
with rasterio.open("input.tif") as src:
array, metadata = gp.reproject_data(src, "EPSG:32633")
# Xarray reprojection
da = xr.open_rasterio("image.tif")
reprojected_da = gp.reproject_data(da, "EPSG:4326")
4. No-Data Masking
mask_raster_data
Description: Mask no-data values in raster datasets. Handles both rasterio (numpy) and rioxarray (xarray) workflows.
Parameters:
- data: Raster data (numpy.ndarray or xarray.DataArray)
- profile: Rasterio metadata dict (required for numpy arrays)
- no_data_value: Override for metadata's nodata value
- return_mask: Whether to return boolean mask
Returns:
- Masked data array. For numpy inputs, returns tuple:(masked_array, profile). For xarray, returns DataArray.
Example:
import xarray as xr
import rasterio
import geopre as gp
# Rasterio workflow
with rasterio.open("data.tif") as src:
data = src.read(1)
masked, profile = gp.mask_raster_data(data, src.profile)
# rioxarray workflow
da = xr.open_rasterio("data.tif")
masked_da = gp.mask_raster_data(da)
5. Cloud Masking
mask_clouds_S2
Description: Masks clouds and optionally shadows in a Sentinel-2 raster image using various methods.
Parameters:
image_path(str): Path to the input raster image.output_path(str, optional): Path to save the masked output raster. Defaults to the same directory as the input with '_masked' appended to the filename.method(str, optional): The method for masking. Options are:'auto': Automatically chooses the best available method.'qa': Uses the QA60 band to mask clouds. WARNING: QA60 is masked between 2022-01-25 and 2024-02-28. Results for images in that date range could be wrong'probability': Uses the cloud probability band MSK_CLDPRB with a threshold for masking.'omnicloudmask': Utilizes OmniCloudMask for AI-based cloud detection. Might take a long time for big images'scl': Leverages the Scene Classification Layer (SCL) for masking.'standard': Similar to 'auto', but avoids the OmniCloudMask method.
mask_shadows(bool): Whether to mask cloud shadows. Defaults toFalse.threshold(int, optional): Cloud probability threshold (if using a cloud probability band), from 0 to 100. Defaults to20.qa60_idx(int, optional): Index of the QA60 band (1-based). Auto-detected if not provided.qa60_path(str, optional): Path to the QA60 band (if in a separate file).prob_band_idx(int, optional): Index of the cloud probability band (1-based). Auto-detected if not provided.prob_band_path(str, optional): Path to the cloud probability band (if in a separate file).scl_idx(int, optional): Index of the SCL band (1-based). Auto-detected if not provided.scl_path(str, optional): Path to the SCL band (if in a separate file).red_idx,green_idx,nir_idx(int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.nodata_value(float): Value for no-data regions. Defaults tonp.nan.
Returns:
- (str): The path to the saved masked output raster.
Example:
import geopre as gp
output_s2 = gp.mask_clouds_S2("sentinel2_image.tif", method='auto', mask_shadows=True)
mask_clouds_landsat
Description:
Masks clouds and optionally shadows in a Landsat raster image using various methods.
Parameters:
image_path(str): Path to the input multi-band raster image.output_path(str, optional): Path to save the masked output raster. Defaults to the same directory as the input with_maskedsuffix.method(str): The method for masking. Options are:'auto': Automatically chooses the best available method.'qa': Uses the QA_PIXEL band to mask clouds.'omnicloudmask': Utilizes OmniCloudMask for AI-based cloud detection.
mask_shadows(bool): Whether to mask cloud shadows. Defaults toFalse.qa_pixel_path(str, optional): Path to the separate QA_PIXEL raster file.qa_pixel_idx(int, optional): Index of the QA_PIXEL band (1-based).confidence_threshold(str, optional): Confidence threshold for cloud masking (e.g.,'Low','Medium','High'). Defaults to'High'. WARNING: as per the Landsat official documentation, the confidence bands are still under development, always use the default 'High' untill further notice. Sourcered_idx,green_idx,nir_idx(int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.nodata_value(float): Value for no-data regions. Defaults tonp.nan.
Returns
- (str): The path to the saved masked output raster.
Example
import geopre as gp
output_landsat = gp.mask_clouds_landsat("landsat_image.tif", method='auto', mask_shadows=True)
6. Water Masking
mask_water_S2
Description:
Masks water areas in Sentinel-2 imagery using the NDWI (Normalized Difference Water Index). Automatically detects Green and NIR bands based on band descriptions (e.g., B3, B8).
Parameters:
image_path(str): Path to the input raster image.output_path(str, optional): Output path. If not specified, adds_water_maskedsuffix.ndwi_threshold(float, optional): Threshold for NDWI. Default is 0.01.nodata_value(float, optional): Value for masked (non-water) pixels. Default isnp.nan.green_idx,nir_idx(int, optional): Index of Green/NIR bands (1-based). Auto-detected if not provided.green_path,nir_path(str, optional): If bands are stored in separate files.
Returns:
- (str): Path to the saved water-masked output raster.
Example:
import geopre as gp
masked_s2 = gp.mask_water_S2("sentinel_image.tif", ndwi_threshold=0.05)
mask_water_landsat
Description:
Same functionality as mask_water_S2, adapted for Landsat band naming (e.g., B3, B5, SR_B3, SR_B5).
Parameters:
image_path(str): Path to the input raster image.output_path(str, optional): Output path. If not specified, adds_water_maskedsuffix.ndwi_threshold(float, optional): Threshold for NDWI. Default is 0.01.nodata_value(float, optional): Value for masked (non-water) pixels. Default isnp.nan.green_idx,nir_idx(int, optional): Index of Green/NIR bands (1-based). Auto-detected if not provided.green_path,nir_path(str, optional): If bands are stored in separate files.
Returns:
- (str): Path to the saved water-masked output raster.
Example:
import geopre as gp
masked_landsat = gp.mask_water_landsat("landsat_image.tif", ndwi_threshold=0.05)
7. Band Stacking
stack_bands
Description:
Stacks multiple raster bands from a folder into a single multi-band raster. Support also .SAFE folders.
Parameters
input_path(str or Path): Path to the folder containing band files.required_bands(list of str): List of band name identifiers (e.g.,["B4", "B3", "B2"]).output_path(str or Path, optional): Path to save the stacked raster. Defaults to"stacked.tif"in the input folder.resolution(float, optional): Target resolution for resampling. Defaults to the highest available resolution.
Returns
- (str): The path to the saved stacked output raster.
Example
import geopre as gp
stacked_image = gp.stack_bands("/path/to/folder/containing/bands", ["B4", "B3", "B2"])
Examples
We provide two example Jupyter notebooks demonstrating the usage of GeoPre:
- example_usage.ipynb – Demonstrates scaling, reprojecting, and masking operations.
- example_usage_2.ipynb – Covers cloud masking, water masking and band stacking.
Contributing
-
Fork the repository
Click the "Fork" button at the top-right of this repository to create your copy.
-
Create your feature branch
git checkout -b feature/your-feature
-
Commit changes
git commit -am 'Add some feature'
-
Push to branch
git push origin feature/your-feature
-
Open a Pull Request
Navigate to the Pull Requests tab in the original repository and click "New Pull Request" to submit your changes.
Changelog
See the full release notes in the CHANGELOG.md.
License
This project is licensed under the MIT License. See LICENSE for more information.
Author
Liang Zhongyou – GitHub Profile
Matteo Gobbi Frattini – GitHub Profile
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geopre-0.3.0.tar.gz.
File metadata
- Download URL: geopre-0.3.0.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d90c1f27ff6a83e7e90b3ed2c78892e7a255bcc6000a5f92ef057091197bfbc
|
|
| MD5 |
69e9374e19d5e2b574f5593cc0162d5e
|
|
| BLAKE2b-256 |
732494abd4cfb0f2e1c4879f540f4833f4b52956c5a92245e36035cb8de3bf6d
|
File details
Details for the file geopre-0.3.0-py3-none-any.whl.
File metadata
- Download URL: geopre-0.3.0-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ad051542cc593a7a46f9f45ec775eafb60fabad554d24f89b9d9bc8ef6b16af
|
|
| MD5 |
0089c56fc808acc08a3621eff1883ca6
|
|
| BLAKE2b-256 |
7d6c62e554595ce16a09698a191fea194079f20aa278ee00bf052cb417d32116
|