Skip to main content

Nonparametric distributional regression using LightGBM

Project description

DistributionRegressor

Nonparametric distributional regression using LightGBM. Predicts full probability distributions p(y|x) instead of just point estimates.

Documentation | PyPI | Examples

Overview

DistributionRegressor provides a robust way to predict complete probability distributions over continuous targets. Unlike standard regression that outputs a single value, this package allows you to:

  • Predict full probability distributions (arbitrary shapes: multimodal, skewed, etc.)
  • Quantify uncertainty with natural confidence intervals
  • Obtain point predictions (mean, mode/peak, quantiles)

It uses a CDF-based approach:

  1. Discretizes the target space into a grid of threshold points.
  2. Learns the conditional CDF F(τ|x) = P(Y ≤ τ | X = x) using binary targets and logistic loss.
  3. Enforces monotonicity via LightGBM's monotone constraints on the threshold feature.
  4. Recovers the PMF by differencing the predicted CDF.

This approach is fast, stable, and requires minimal tuning.

Installation

pip install distribution-regressor

Quick Start

import numpy as np
from distribution_regressor import DistributionRegressor

# 1. Initialize
model = DistributionRegressor(
    n_bins=50,              # Resolution of the distribution grid
    n_estimators=100,       # Number of boosting trees
)

# 2. Train
# X: (n_samples, n_features), y: (n_samples,)
model.fit(X_train, y_train)

# 3. Predict Points
y_mean = model.predict(X_test)               # Mean (Expected Value)
y_mode = model.predict_mode(X_test)          # Mode (Most likely value / Peak)
y_median = model.predict_quantile(X_test, 0.5)

# 4. Predict Intervals & Uncertainty
# 10th and 90th percentiles (80% confidence interval)
lower = model.predict_quantile(X_test, 0.1)
upper = model.predict_quantile(X_test, 0.9)

# 5. Predict Full Distribution
grids, dists, offsets = model.predict_distribution(X_test)
# grids: (n_samples, n_bins) - Per-sample grid points
# dists: (n_samples, n_bins) - Probability mass for each sample

Key Parameters

DistributionRegressor(
    n_bins=50,              # Number of grid points (higher = more resolution, more RAM)
    use_base_model=False,   # If True, learns residual CDF around a base LGBM prediction
    monte_carlo_training=False,  # If True, sample grid points instead of full expansion
    mc_samples=5,           # MC sample points per observation (when MC enabled)
    mc_resample_freq=100,   # Resample grid points every N trees (lower = better coverage)
    n_estimators=100,       # LightGBM trees
    learning_rate=0.1,      # Learning rate
    random_state=42,        # Seed
    **kwargs                # Passed to LGBMRegressor (e.g., max_depth, num_leaves)
)

How It Works

The model learns the conditional CDF using binary classification:

  1. Grid Creation: A grid of n_bins threshold points is created covering the range of y.
  2. Binary Targets: For each training sample (x_i, y_i) and threshold τ_j, the target is z_ij = 1{y_i ≤ τ_j} — simply whether y_i falls below the threshold.
  3. Single Model: A single LightGBM model is trained with cross-entropy loss on (x_i, τ_j) → z_ij, with a monotone increasing constraint on τ_j to ensure a valid CDF.
  4. Prediction: At inference, the model predicts F(τ|x) for all grid points, then differences the CDF to recover the probability mass function.

Example Visualization

import matplotlib.pyplot as plt

# Predict distribution for a single sample
grids, dists, offsets = model.predict_distribution(X_test[0:1])

plt.plot(grids[0], dists[0], label='Predicted PMF')
plt.axvline(y_test[0], color='r', linestyle='--', label='True Value')
plt.legend()
plt.show()

Citation

@software{distributionregressor2025,
  title={DistributionRegressor: Nonparametric Distributional Regression},
  author={Gabor Gulyas},
  year={2025},
  url={https://github.com/guyko81/DistributionRegressor}
}

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distribution_regressor-2.1.1.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distribution_regressor-2.1.1-py3-none-any.whl (60.4 kB view details)

Uploaded Python 3

File details

Details for the file distribution_regressor-2.1.1.tar.gz.

File metadata

  • Download URL: distribution_regressor-2.1.1.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for distribution_regressor-2.1.1.tar.gz
Algorithm Hash digest
SHA256 bc0e3fe510978dc7930055d45032b61d72fc3cf197d7ff0e3346901b194ba009
MD5 025fecf46b2d838cab7dd0fece150c69
BLAKE2b-256 cc38d3d4b7e08cab602ca3f74c05c91f8bea98beb5f24b39b0311e9f1a9c4e73

See more details on using hashes here.

File details

Details for the file distribution_regressor-2.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for distribution_regressor-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 97ff8f7c4d4c298a34f65c79c56ae62b205b35c66488ca08a26e052cd4a74c01
MD5 2cc40dc7c698818fb28142fd1b14feae
BLAKE2b-256 7e020dc9c63358a9e30a9df5d01c6c03547be520f390c85e93e1203d5d54ba65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page