utility for checking the distribution of values in intermediate variables in model runs

These details have not been verified by PyPI

Project links

Homepage

Project description

`check-distribution`

A Python utility for analyzing and logging the statistical distribution of values in arrays and raster images during model runs. Particularly useful for debugging geospatial data processing pipelines and monitoring intermediate variables in scientific computing workflows.

Author

Gregory H. Halverson (they/them)
gregory.h.halverson@jpl.nasa.gov
NASA Jet Propulsion Laboratory 329G

Features

📊 Statistical Analysis: Automatically computes min, max, mean, and NaN proportion
🎨 Color-Coded Logging: Uses colored output to highlight warnings and important values
🗺️ Raster Support: Native support for Raster objects from the rasters package
⚠️ Blank Detection: Optionally raises errors when outputs are completely blank
📈 Smart Output: Detailed distribution for categorical data, statistics for continuous data
🕒 Temporal Tracking: Optional date and location parameters for tracking values over time

Installation

From PyPI

pip install check-distribution

From Source

git clone https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution.git
cd check-distribution
pip install -e .

Development Installation

pip install -e ".[dev]"

Quick Start

import numpy as np
from check_distribution import check_distribution

# Analyze a simple array
data = np.random.rand(100, 100)
check_distribution(data, "temperature")

Detailed Usage

Basic Usage

The check_distribution function analyzes array or raster data and logs statistical information:

from check_distribution import check_distribution
import numpy as np

# Create sample data
temperature = np.random.normal(25, 5, (100, 100))

# Basic check
check_distribution(temperature, "temperature")

Output:

variable temperature min: 10.234 mean: 25.123 max: 39.876 nan: 0.00% (nan)

Working with Categorical Data

For arrays with fewer than 10 unique values, the function displays counts for each value:

# Land cover classification
land_cover = np.array([
    [1, 1, 2, 3],
    [1, 2, 2, 3],
    [2, 2, 3, 3],
    [3, 3, 3, 0]
])

check_distribution(land_cover, "land_cover")

Output:

variable land_cover (int64) with 4 unique values
* 0: 1
* 1: 3
* 2: 5
* 3: 7

Adding Temporal Context

Track variable distributions over time:

from datetime import date

# Daily temperature data
for day in range(1, 8):
    temp = np.random.normal(20 + day, 3, (100, 100))
    check_distribution(
        temp, 
        "temperature",
        date_UTC=date(2025, 1, day)
    )

Output:

variable temperature on 2025-01-01 min: 15.234 mean: 21.123 max: 27.876 nan: 0.00% (nan)
variable temperature on 2025-01-02 min: 16.123 mean: 22.456 max: 28.234 nan: 0.00% (nan)
...

Adding Spatial Context

Include location information for spatial data:

# Process multiple tiles
tiles = ["h08v05", "h09v05", "h08v06"]

for tile in tiles:
    ndvi = np.random.rand(2400, 2400)
    check_distribution(
        ndvi,
        "NDVI",
        date_UTC=date(2025, 6, 15),
        target=tile
    )

Output:

variable NDVI on 2025-06-15 at h08v05 min: 0.001 mean: 0.512 max: 0.999 nan: 0.00% (nan)
variable NDVI on 2025-06-15 at h09v05 min: 0.003 mean: 0.498 max: 0.997 nan: 2.34% (nan)
...

Working with Raster Objects

The function natively supports Raster objects from the rasters package:

from rasters import Raster

# Load a raster
dem = Raster("elevation.tif")

# Check distribution (automatically uses raster's nodata value)
check_distribution(dem, "elevation")

Detecting Blank Outputs

By default, the function raises BlankOutputError if an array is completely NaN:

from check_distribution import check_distribution, BlankOutputError

# This will raise an error
try:
    blank_data = np.full((100, 100), np.nan)
    check_distribution(blank_data, "missing_data")
except BlankOutputError as e:
    print(f"Error: {e}")

Output:

Error: variable missing_data is a blank image

To allow blank outputs:

blank_data = np.full((100, 100), np.nan)
check_distribution(blank_data, "missing_data", allow_blank=True)

Handling Arrays with NaN Values

The function automatically detects and reports NaN proportions:

# Create data with NaN values
data = np.random.rand(100, 100)
data[data < 0.2] = np.nan

check_distribution(data, "partial_data")

Output:

variable partial_data min: 0.201 mean: 0.612 max: 0.999 nan: 20.15% (nan)

High NaN proportions (>50%) are highlighted in yellow, and 100% NaN triggers a red warning.

Detecting All-Zero Arrays

Arrays with all zeros are automatically flagged:

zeros = np.zeros((100, 100))
check_distribution(zeros, "empty_result", allow_blank=True)

Output (warning):

variable empty_result all zeros min: 0.000 mean: 0.000 max: 0.000 nan: 0.00% (nan)

Complete Example: Processing Pipeline

import numpy as np
from datetime import date
from check_distribution import check_distribution, BlankOutputError

def process_satellite_image(tile_id, acquisition_date):
    """Example processing pipeline with distribution checks."""
    
    # Load raw data
    raw_dn = np.random.randint(0, 16384, (2400, 2400))
    check_distribution(raw_dn, "raw_DN", acquisition_date, tile_id)
    
    # Convert to reflectance
    reflectance = raw_dn * 0.0001
    check_distribution(reflectance, "reflectance", acquisition_date, tile_id)
    
    # Calculate NDVI
    nir = reflectance * np.random.uniform(0.8, 1.2, reflectance.shape)
    red = reflectance * np.random.uniform(0.1, 0.3, reflectance.shape)
    ndvi = (nir - red) / (nir + red + 1e-10)
    
    try:
        check_distribution(ndvi, "NDVI", acquisition_date, tile_id)
    except BlankOutputError:
        print(f"Warning: NDVI calculation failed for {tile_id}")
        return None
    
    # Apply cloud mask
    cloud_mask = np.random.choice([0, 1], reflectance.shape, p=[0.15, 0.85])
    check_distribution(cloud_mask, "cloud_mask", acquisition_date, tile_id)
    
    ndvi_masked = ndvi.copy()
    ndvi_masked[cloud_mask == 1] = np.nan
    check_distribution(ndvi_masked, "NDVI_masked", acquisition_date, tile_id)
    
    return ndvi_masked

# Run pipeline
result = process_satellite_image("h08v05", date(2025, 6, 15))

API Reference

`check_distribution()`

check_distribution(
    image: Union[Raster, np.ndarray],
    variable: str,
    date_UTC: Union[date, str] = None,
    target: str = None,
    allow_blank: bool = False
)

Parameters:

image (Raster or np.ndarray): The array or raster to analyze
variable (str): Name of the variable for logging purposes
date_UTC (date or str, optional): Date associated with the data
target (str, optional): Location or tile identifier
allow_blank (bool, optional): If False (default), raises BlankOutputError for completely NaN arrays

Raises:

BlankOutputError: When the array is completely NaN and allow_blank=False

Behavior:

< 10 unique values: Lists each unique value with its count
≥ 10 unique values: Shows min, mean, max, and NaN percentage
All zeros: Logs a warning
Negative minimums: Highlighted in red
Non-positive maximums: Highlighted in red
High NaN proportion (>50%): Highlighted in yellow
Complete NaN (100%): Highlighted in red and may raise error

`BlankOutputError`

Exception raised when an array is completely blank (all NaN values).

from check_distribution import BlankOutputError

Dependencies

numpy: Array operations and statistics
colored-logging: Colored console output
rasters: Raster data support

Use Cases

Scientific Computing

Monitor intermediate variables in physics simulations
Validate numerical solver outputs
Track convergence in iterative algorithms

Geospatial Processing

Verify satellite image calibration
Monitor vegetation indices calculation
Validate atmospheric correction results
Track data quality across tiles and dates

Machine Learning

Inspect feature distributions during preprocessing
Monitor activation values in neural networks
Validate data augmentation pipelines

Data Quality Control

Detect data loading errors
Identify processing failures
Monitor data completeness over time

Development

Running Tests

# Run all tests
make test

# Run with coverage
pytest --cov=check_distribution tests/

# Run specific test file
pytest tests/test_check_distribution.py

Building

# Build distribution
make build

# Install locally
pip install -e .

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache License 2.0

Citation

If you use this software in your research, please cite:

@software{halverson2025checkdistribution,
  author = {Halverson, Gregory H.},
  title = {check-distribution: A Python utility for analyzing array distributions},
  year = {2025},
  url = {https://github.com/JPL-Evapotranspiration-Algorithms/check-distribution}
}

Related Projects

rasters: Geospatial raster processing
colored-logging: Enhanced logging with colors

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.2.0

Dec 17, 2025

1.1.0

Aug 14, 2025

1.0.1

Apr 1, 2025

1.0.0

Apr 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

check_distribution-1.2.0.tar.gz (17.7 kB view details)

Uploaded Dec 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

check_distribution-1.2.0-py3-none-any.whl (10.7 kB view details)

Uploaded Dec 17, 2025 Python 3

File details

Details for the file check_distribution-1.2.0.tar.gz.

File metadata

Download URL: check_distribution-1.2.0.tar.gz
Upload date: Dec 17, 2025
Size: 17.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for check_distribution-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a73a494503b7a79ac8f54a7471ba714ddef1642fe8ec47593e3b97fa612b9d2c`
MD5	`3616d6b01c90818082645139c1ed08d5`
BLAKE2b-256	`b9acefcf2039fcaa93e11ce66df07056293f10627c1dea8cd24d6f2d68d9748e`

See more details on using hashes here.

File details

Details for the file check_distribution-1.2.0-py3-none-any.whl.

File metadata

Download URL: check_distribution-1.2.0-py3-none-any.whl
Upload date: Dec 17, 2025
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for check_distribution-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f378e2fd7c05e05bf9cc23a2ce50582046ed93361c71dd5a307c8aa619b5bb4a`
MD5	`ac0a2bca439fa848f93236bcbcd05688`
BLAKE2b-256	`9f70d89c367e33c625b936bc891af6e82b1074eb733710f7c17cf269aace6572`

See more details on using hashes here.

check-distribution 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

check-distribution

Author

Features

Installation

From PyPI

From Source

Development Installation

Quick Start

Detailed Usage

Basic Usage

Working with Categorical Data

Adding Temporal Context

Adding Spatial Context

Working with Raster Objects

Detecting Blank Outputs

Handling Arrays with NaN Values

Detecting All-Zero Arrays

Complete Example: Processing Pipeline

API Reference

check_distribution()

BlankOutputError

Dependencies

Use Cases

Scientific Computing

Geospatial Processing

Machine Learning

Data Quality Control

Development

Running Tests

Building

Contributing

License

Citation

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`check-distribution`

`check_distribution()`

`BlankOutputError`