utility for checking the distribution of values in intermediate variables in model runs
Project description
check-distribution
A Python utility for analyzing and logging the statistical distribution of values in arrays and raster images during model runs. Particularly useful for debugging geospatial data processing pipelines and monitoring intermediate variables in scientific computing workflows.
Author
Gregory H. Halverson (they/them)
gregory.h.halverson@jpl.nasa.gov
NASA Jet Propulsion Laboratory 329G
Features
- 📊 Statistical Analysis: Automatically computes min, max, mean, and NaN proportion
- 🎨 Color-Coded Logging: Uses colored output to highlight warnings and important values
- 🗺️ Raster Support: Native support for
Rasterobjects from therasterspackage - ⚠️ Blank Detection: Optionally raises errors when outputs are completely blank
- 📈 Smart Output: Detailed distribution for categorical data, statistics for continuous data
- 🕒 Temporal Tracking: Optional date and location parameters for tracking values over time
Installation
From PyPI
pip install check-distribution
From Source
git clone https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution.git
cd check-distribution
pip install -e .
Development Installation
pip install -e ".[dev]"
Quick Start
import numpy as np
from check_distribution import check_distribution
# Analyze a simple array
data = np.random.rand(100, 100)
check_distribution(data, "temperature")
Detailed Usage
Basic Usage
The check_distribution function analyzes array or raster data and logs statistical information:
from check_distribution import check_distribution
import numpy as np
# Create sample data
temperature = np.random.normal(25, 5, (100, 100))
# Basic check
check_distribution(temperature, "temperature")
Output:
variable temperature min: 10.234 mean: 25.123 max: 39.876 nan: 0.00% (nan)
Working with Categorical Data
For arrays with fewer than 10 unique values, the function displays counts for each value:
# Land cover classification
land_cover = np.array([
[1, 1, 2, 3],
[1, 2, 2, 3],
[2, 2, 3, 3],
[3, 3, 3, 0]
])
check_distribution(land_cover, "land_cover")
Output:
variable land_cover (int64) with 4 unique values
* 0: 1
* 1: 3
* 2: 5
* 3: 7
Adding Temporal Context
Track variable distributions over time:
from datetime import date
# Daily temperature data
for day in range(1, 8):
temp = np.random.normal(20 + day, 3, (100, 100))
check_distribution(
temp,
"temperature",
date_UTC=date(2025, 1, day)
)
Output:
variable temperature on 2025-01-01 min: 15.234 mean: 21.123 max: 27.876 nan: 0.00% (nan)
variable temperature on 2025-01-02 min: 16.123 mean: 22.456 max: 28.234 nan: 0.00% (nan)
...
Adding Spatial Context
Include location information for spatial data:
# Process multiple tiles
tiles = ["h08v05", "h09v05", "h08v06"]
for tile in tiles:
ndvi = np.random.rand(2400, 2400)
check_distribution(
ndvi,
"NDVI",
date_UTC=date(2025, 6, 15),
target=tile
)
Output:
variable NDVI on 2025-06-15 at h08v05 min: 0.001 mean: 0.512 max: 0.999 nan: 0.00% (nan)
variable NDVI on 2025-06-15 at h09v05 min: 0.003 mean: 0.498 max: 0.997 nan: 2.34% (nan)
...
Working with Raster Objects
The function natively supports Raster objects from the rasters package:
from rasters import Raster
# Load a raster
dem = Raster("elevation.tif")
# Check distribution (automatically uses raster's nodata value)
check_distribution(dem, "elevation")
Detecting Blank Outputs
By default, the function raises BlankOutputError if an array is completely NaN:
from check_distribution import check_distribution, BlankOutputError
# This will raise an error
try:
blank_data = np.full((100, 100), np.nan)
check_distribution(blank_data, "missing_data")
except BlankOutputError as e:
print(f"Error: {e}")
Output:
Error: variable missing_data is a blank image
To allow blank outputs:
blank_data = np.full((100, 100), np.nan)
check_distribution(blank_data, "missing_data", allow_blank=True)
Handling Arrays with NaN Values
The function automatically detects and reports NaN proportions:
# Create data with NaN values
data = np.random.rand(100, 100)
data[data < 0.2] = np.nan
check_distribution(data, "partial_data")
Output:
variable partial_data min: 0.201 mean: 0.612 max: 0.999 nan: 20.15% (nan)
High NaN proportions (>50%) are highlighted in yellow, and 100% NaN triggers a red warning.
Detecting All-Zero Arrays
Arrays with all zeros are automatically flagged:
zeros = np.zeros((100, 100))
check_distribution(zeros, "empty_result", allow_blank=True)
Output (warning):
variable empty_result all zeros min: 0.000 mean: 0.000 max: 0.000 nan: 0.00% (nan)
Complete Example: Processing Pipeline
import numpy as np
from datetime import date
from check_distribution import check_distribution, BlankOutputError
def process_satellite_image(tile_id, acquisition_date):
"""Example processing pipeline with distribution checks."""
# Load raw data
raw_dn = np.random.randint(0, 16384, (2400, 2400))
check_distribution(raw_dn, "raw_DN", acquisition_date, tile_id)
# Convert to reflectance
reflectance = raw_dn * 0.0001
check_distribution(reflectance, "reflectance", acquisition_date, tile_id)
# Calculate NDVI
nir = reflectance * np.random.uniform(0.8, 1.2, reflectance.shape)
red = reflectance * np.random.uniform(0.1, 0.3, reflectance.shape)
ndvi = (nir - red) / (nir + red + 1e-10)
try:
check_distribution(ndvi, "NDVI", acquisition_date, tile_id)
except BlankOutputError:
print(f"Warning: NDVI calculation failed for {tile_id}")
return None
# Apply cloud mask
cloud_mask = np.random.choice([0, 1], reflectance.shape, p=[0.15, 0.85])
check_distribution(cloud_mask, "cloud_mask", acquisition_date, tile_id)
ndvi_masked = ndvi.copy()
ndvi_masked[cloud_mask == 1] = np.nan
check_distribution(ndvi_masked, "NDVI_masked", acquisition_date, tile_id)
return ndvi_masked
# Run pipeline
result = process_satellite_image("h08v05", date(2025, 6, 15))
API Reference
check_distribution()
check_distribution(
image: Union[Raster, np.ndarray],
variable: str,
date_UTC: Union[date, str] = None,
target: str = None,
allow_blank: bool = False
)
Parameters:
image(Raster or np.ndarray): The array or raster to analyzevariable(str): Name of the variable for logging purposesdate_UTC(date or str, optional): Date associated with the datatarget(str, optional): Location or tile identifierallow_blank(bool, optional): If False (default), raisesBlankOutputErrorfor completely NaN arrays
Raises:
BlankOutputError: When the array is completely NaN andallow_blank=False
Behavior:
- < 10 unique values: Lists each unique value with its count
- ≥ 10 unique values: Shows min, mean, max, and NaN percentage
- All zeros: Logs a warning
- Negative minimums: Highlighted in red
- Non-positive maximums: Highlighted in red
- High NaN proportion (>50%): Highlighted in yellow
- Complete NaN (100%): Highlighted in red and may raise error
BlankOutputError
Exception raised when an array is completely blank (all NaN values).
from check_distribution import BlankOutputError
Dependencies
numpy: Array operations and statisticscolored-logging: Colored console outputrasters: Raster data support
Use Cases
Scientific Computing
- Monitor intermediate variables in physics simulations
- Validate numerical solver outputs
- Track convergence in iterative algorithms
Geospatial Processing
- Verify satellite image calibration
- Monitor vegetation indices calculation
- Validate atmospheric correction results
- Track data quality across tiles and dates
Machine Learning
- Inspect feature distributions during preprocessing
- Monitor activation values in neural networks
- Validate data augmentation pipelines
Data Quality Control
- Detect data loading errors
- Identify processing failures
- Monitor data completeness over time
Development
Running Tests
# Run all tests
make test
# Run with coverage
pytest --cov=check_distribution tests/
# Run specific test file
pytest tests/test_check_distribution.py
Building
# Build distribution
make build
# Install locally
pip install -e .
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Apache License 2.0
Citation
If you use this software in your research, please cite:
@software{halverson2025checkdistribution,
author = {Halverson, Gregory H.},
title = {check-distribution: A Python utility for analyzing array distributions},
year = {2025},
url = {https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution}
}
Related Projects
- rasters: Geospatial raster processing
- colored-logging: Enhanced logging with colors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file check_distribution-1.2.0.tar.gz.
File metadata
- Download URL: check_distribution-1.2.0.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a73a494503b7a79ac8f54a7471ba714ddef1642fe8ec47593e3b97fa612b9d2c
|
|
| MD5 |
3616d6b01c90818082645139c1ed08d5
|
|
| BLAKE2b-256 |
b9acefcf2039fcaa93e11ce66df07056293f10627c1dea8cd24d6f2d68d9748e
|
File details
Details for the file check_distribution-1.2.0-py3-none-any.whl.
File metadata
- Download URL: check_distribution-1.2.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f378e2fd7c05e05bf9cc23a2ce50582046ed93361c71dd5a307c8aa619b5bb4a
|
|
| MD5 |
ac0a2bca439fa848f93236bcbcd05688
|
|
| BLAKE2b-256 |
9f70d89c367e33c625b936bc891af6e82b1074eb733710f7c17cf269aace6572
|