Skip to main content

utility for checking the distribution of values in intermediate variables in model runs

Project description

check-distribution

Python Version License

A Python utility for analyzing and logging the statistical distribution of values in arrays and raster images during model runs. Particularly useful for debugging geospatial data processing pipelines and monitoring intermediate variables in scientific computing workflows.

Author

Gregory H. Halverson (they/them)
gregory.h.halverson@jpl.nasa.gov
NASA Jet Propulsion Laboratory 329G

Features

  • 📊 Statistical Analysis: Automatically computes min, max, mean, and NaN proportion
  • 🎨 Color-Coded Logging: Uses colored output to highlight warnings and important values
  • 🗺️ Raster Support: Native support for Raster objects from the rasters package
  • ⚠️ Blank Detection: Optionally raises errors when outputs are completely blank
  • 📈 Smart Output: Detailed distribution for categorical data, statistics for continuous data
  • 🕒 Temporal Tracking: Optional date and location parameters for tracking values over time

Installation

From PyPI

pip install check-distribution

From Source

git clone https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution.git
cd check-distribution
pip install -e .

Development Installation

pip install -e ".[dev]"

Quick Start

import numpy as np
from check_distribution import check_distribution

# Analyze a simple array
data = np.random.rand(100, 100)
check_distribution(data, "temperature")

Detailed Usage

Basic Usage

The check_distribution function analyzes array or raster data and logs statistical information:

from check_distribution import check_distribution
import numpy as np

# Create sample data
temperature = np.random.normal(25, 5, (100, 100))

# Basic check
check_distribution(temperature, "temperature")

Output:

variable temperature min: 10.234 mean: 25.123 max: 39.876 nan: 0.00% (nan)

Working with Categorical Data

For arrays with fewer than 10 unique values, the function displays counts for each value:

# Land cover classification
land_cover = np.array([
    [1, 1, 2, 3],
    [1, 2, 2, 3],
    [2, 2, 3, 3],
    [3, 3, 3, 0]
])

check_distribution(land_cover, "land_cover")

Output:

variable land_cover (int64) with 4 unique values
* 0: 1
* 1: 3
* 2: 5
* 3: 7

Adding Temporal Context

Track variable distributions over time:

from datetime import date

# Daily temperature data
for day in range(1, 8):
    temp = np.random.normal(20 + day, 3, (100, 100))
    check_distribution(
        temp, 
        "temperature",
        date_UTC=date(2025, 1, day)
    )

Output:

variable temperature on 2025-01-01 min: 15.234 mean: 21.123 max: 27.876 nan: 0.00% (nan)
variable temperature on 2025-01-02 min: 16.123 mean: 22.456 max: 28.234 nan: 0.00% (nan)
...

Adding Spatial Context

Include location information for spatial data:

# Process multiple tiles
tiles = ["h08v05", "h09v05", "h08v06"]

for tile in tiles:
    ndvi = np.random.rand(2400, 2400)
    check_distribution(
        ndvi,
        "NDVI",
        date_UTC=date(2025, 6, 15),
        target=tile
    )

Output:

variable NDVI on 2025-06-15 at h08v05 min: 0.001 mean: 0.512 max: 0.999 nan: 0.00% (nan)
variable NDVI on 2025-06-15 at h09v05 min: 0.003 mean: 0.498 max: 0.997 nan: 2.34% (nan)
...

Working with Raster Objects

The function natively supports Raster objects from the rasters package:

from rasters import Raster

# Load a raster
dem = Raster("elevation.tif")

# Check distribution (automatically uses raster's nodata value)
check_distribution(dem, "elevation")

Detecting Blank Outputs

By default, the function raises BlankOutputError if an array is completely NaN:

from check_distribution import check_distribution, BlankOutputError

# This will raise an error
try:
    blank_data = np.full((100, 100), np.nan)
    check_distribution(blank_data, "missing_data")
except BlankOutputError as e:
    print(f"Error: {e}")

Output:

Error: variable missing_data is a blank image

To allow blank outputs:

blank_data = np.full((100, 100), np.nan)
check_distribution(blank_data, "missing_data", allow_blank=True)

Handling Arrays with NaN Values

The function automatically detects and reports NaN proportions:

# Create data with NaN values
data = np.random.rand(100, 100)
data[data < 0.2] = np.nan

check_distribution(data, "partial_data")

Output:

variable partial_data min: 0.201 mean: 0.612 max: 0.999 nan: 20.15% (nan)

High NaN proportions (>50%) are highlighted in yellow, and 100% NaN triggers a red warning.

Detecting All-Zero Arrays

Arrays with all zeros are automatically flagged:

zeros = np.zeros((100, 100))
check_distribution(zeros, "empty_result", allow_blank=True)

Output (warning):

variable empty_result all zeros min: 0.000 mean: 0.000 max: 0.000 nan: 0.00% (nan)

Complete Example: Processing Pipeline

import numpy as np
from datetime import date
from check_distribution import check_distribution, BlankOutputError

def process_satellite_image(tile_id, acquisition_date):
    """Example processing pipeline with distribution checks."""
    
    # Load raw data
    raw_dn = np.random.randint(0, 16384, (2400, 2400))
    check_distribution(raw_dn, "raw_DN", acquisition_date, tile_id)
    
    # Convert to reflectance
    reflectance = raw_dn * 0.0001
    check_distribution(reflectance, "reflectance", acquisition_date, tile_id)
    
    # Calculate NDVI
    nir = reflectance * np.random.uniform(0.8, 1.2, reflectance.shape)
    red = reflectance * np.random.uniform(0.1, 0.3, reflectance.shape)
    ndvi = (nir - red) / (nir + red + 1e-10)
    
    try:
        check_distribution(ndvi, "NDVI", acquisition_date, tile_id)
    except BlankOutputError:
        print(f"Warning: NDVI calculation failed for {tile_id}")
        return None
    
    # Apply cloud mask
    cloud_mask = np.random.choice([0, 1], reflectance.shape, p=[0.15, 0.85])
    check_distribution(cloud_mask, "cloud_mask", acquisition_date, tile_id)
    
    ndvi_masked = ndvi.copy()
    ndvi_masked[cloud_mask == 1] = np.nan
    check_distribution(ndvi_masked, "NDVI_masked", acquisition_date, tile_id)
    
    return ndvi_masked

# Run pipeline
result = process_satellite_image("h08v05", date(2025, 6, 15))

API Reference

check_distribution()

check_distribution(
    image: Union[Raster, np.ndarray],
    variable: str,
    date_UTC: Union[date, str] = None,
    target: str = None,
    allow_blank: bool = False
)

Parameters:

  • image (Raster or np.ndarray): The array or raster to analyze
  • variable (str): Name of the variable for logging purposes
  • date_UTC (date or str, optional): Date associated with the data
  • target (str, optional): Location or tile identifier
  • allow_blank (bool, optional): If False (default), raises BlankOutputError for completely NaN arrays

Raises:

  • BlankOutputError: When the array is completely NaN and allow_blank=False

Behavior:

  • < 10 unique values: Lists each unique value with its count
  • ≥ 10 unique values: Shows min, mean, max, and NaN percentage
  • All zeros: Logs a warning
  • Negative minimums: Highlighted in red
  • Non-positive maximums: Highlighted in red
  • High NaN proportion (>50%): Highlighted in yellow
  • Complete NaN (100%): Highlighted in red and may raise error

BlankOutputError

Exception raised when an array is completely blank (all NaN values).

from check_distribution import BlankOutputError

Dependencies

  • numpy: Array operations and statistics
  • colored-logging: Colored console output
  • rasters: Raster data support

Use Cases

Scientific Computing

  • Monitor intermediate variables in physics simulations
  • Validate numerical solver outputs
  • Track convergence in iterative algorithms

Geospatial Processing

  • Verify satellite image calibration
  • Monitor vegetation indices calculation
  • Validate atmospheric correction results
  • Track data quality across tiles and dates

Machine Learning

  • Inspect feature distributions during preprocessing
  • Monitor activation values in neural networks
  • Validate data augmentation pipelines

Data Quality Control

  • Detect data loading errors
  • Identify processing failures
  • Monitor data completeness over time

Development

Running Tests

# Run all tests
make test

# Run with coverage
pytest --cov=check_distribution tests/

# Run specific test file
pytest tests/test_check_distribution.py

Building

# Build distribution
make build

# Install locally
pip install -e .

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache License 2.0

Citation

If you use this software in your research, please cite:

@software{halverson2025checkdistribution,
  author = {Halverson, Gregory H.},
  title = {check-distribution: A Python utility for analyzing array distributions},
  year = {2025},
  url = {https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution}
}

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

check_distribution-1.2.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

check_distribution-1.2.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file check_distribution-1.2.0.tar.gz.

File metadata

  • Download URL: check_distribution-1.2.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for check_distribution-1.2.0.tar.gz
Algorithm Hash digest
SHA256 a73a494503b7a79ac8f54a7471ba714ddef1642fe8ec47593e3b97fa612b9d2c
MD5 3616d6b01c90818082645139c1ed08d5
BLAKE2b-256 b9acefcf2039fcaa93e11ce66df07056293f10627c1dea8cd24d6f2d68d9748e

See more details on using hashes here.

File details

Details for the file check_distribution-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for check_distribution-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f378e2fd7c05e05bf9cc23a2ce50582046ed93361c71dd5a307c8aa619b5bb4a
MD5 ac0a2bca439fa848f93236bcbcd05688
BLAKE2b-256 9f70d89c367e33c625b936bc891af6e82b1074eb733710f7c17cf269aace6572

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page