Skip to main content

A Python library for Imbalanced Regression with SMOGN, stratified CV, and utility-based metrics.

Project description

imbreg

PyPI Version Python Version Status License

imbreg is a powerful Python library specifically designed to tackle the Imbalanced Regression problem. It facilitates the processing of datasets with missing values, applies advanced synthetic over-sampling techniques like SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise), evaluates predictive models using utility-based metrics, and manages stratified cross-validation partitioning.


Key Features

  • SMOGN Resampling: Generates synthetic examples for extreme minority values in continuous domains using the SMOGN strategy (a combination of SmoteR interpolation and GaussNoise perturbation).
  • Stratified Partitioning: Implements purely stratified cross-validation (CV) algorithms to ensure that extreme values are evenly distributed across folds.
  • Robust Data Imputation: Native integration with iterative algorithms (Scikit-Learn IterativeImputer) that prevents data leakage between training and test partitions.
  • Advanced Utility-based Metrics: Precise calculation of specialized metrics for imbalanced regression:
    • Utility-based F1-Score ($\beta$-measure).
    • SERA (Squared Error Relevance Area).
  • Dataset Loading (KEEL/CSV/ARFF): A smart data loader that infers categorical variables, caps decimals, maps ranges, and cleans noisy values automatically.
  • Data Visualization: Built-in 2D and 3D plotting modules (using Plotly, Seaborn) to visually analyze the relevance of the target variable and the impact of noise/distribution.

Requirements and Installation

To use this library, ensure you have Python 3.9 or higher installed. The library is available on PyPI:

pip install imbreg

Quickstart Guide

Examples

Check the examples/ directory in the repository for ready-to-run scripts:

  • quickstart.py: Minimal example of all features using synthetic data.
  • plot_examples.py: How to use the plotting functions.
  • generate_cv_partitions.py & evaluate_models.py: Full cross-validation pipeline.
  • evaluate_external_predictions.py: Utility to evaluate arbitrary external predictions against ground truth using the library's relevance metrics.

Here is a quick snippet of how to use the core functions:

1. Generate Partitions (Cross-Validation)

The cv_partitions function will take care of reading your original dataset, cleaning it, performing missing data imputation, and injecting SMOGN oversampling automatically into each repetition.

from imbreg import cv_partitions

cv_partitions(
    ds_name="my_dataset.csv",
    ds_location="raw_data/",
    times=1,                 # Number of repetitions
    folds=10,                # Number of partitions (k-fold)
    strat=True,              # Enable stratification
    smogn=True,              # Apply SMOGN during training
    impute=True,             # Impute missing values (NaNs)
    out_dir="Output/"        # Output directory for raw data partitions
)

2. Evaluate Predictions

Once the physical folds are generated on your disk, you can automatically train the algorithms and retrieve the results summary containing SERA and F1 metrics.

from imbreg import evaluate_folds

results = evaluate_folds(
    output_dir="Output/",    # Directory containing the generated folds
    dataset="my_dataset",
    model_type="rf",         # 'rf' (Random Forest), 'et' (Extra Trees), 'xgb' (XGBoost)
    n_reps=1,
    n_folds=10,
    use_imputation=True,
    use_smogn=True,
    thr_rel=0.8              # Relevance threshold to define "rare" cases
)

# You can export these results to a flat structure using the built-in exporter
from imbreg.validation import export_experiment_summaries
export_experiment_summaries(results, output_dir="Results/", dataset_name="my_dataset", flat_output=True)

3. Visualize the Data

Analyze the relevance curve of your target variable:

import matplotlib.pyplot as plt
from imbreg import read_dataset, phi_control, plot_target_distribution

# Load dataset and create relevance control structure
df = read_dataset("my_dataset.csv", "raw_data/")
ctrl = phi_control(df["y"].values, method="extremes")

# Visualize distribution vs relevance
fig = plot_target_distribution(df, target_col="y", phi_ctrl=ctrl, thr_rel=0.8)
plt.show()

Project Structure

imbreg/
│
├── data_loader.py    # I/O functions (CSV/KEEL) and imputation wrappers
├── metrics.py        # Mathematical evaluation functions (Utility F1, SERA, Bumps)
├── models.py         # Training and prediction wrappers (RF, ET, XGBoost)
├── plots.py          # Advanced visualizations (Histograms, Scatters, Prediction Error)
├── resampling.py     # Core engine for the SMOGN strategy (SmoteR + GaussNoise)
├── stratification.py # Phi function (relevance) and K-Folds generators
├── utils.py          # Math operations, distance metrics, and internal helpers
└── validation.py     # Cross-validation evaluation pipeline and result export

Datasets

A sample dataset (Datasets/servo/) is included for quick testing. Additional regression datasets can be downloaded from KEEL.

Folder Architecture for Experiments

When running the full pipeline (e.g., examples/evaluate_models.py), the project enforces a clean separation of concerns:

  • Output/: Stores all heavy, raw data partitions generated by cross-validation and SMOGN.
  • Results/: A flat, clean directory containing only the final .txt and .csv summary metrics.
  • Plots/: Directory where generated visualizations and figures are saved.

Testing

The project includes a robust suite of unit tests implemented with pytest that covers:

  • Parser resilience against null values and troublesome column formats.
  • Mathematical precision in array dimensional flattening.
  • Robustness against memory leakage or empty variable crashes.

To run the stress test suite locally:

python -m pytest tests/ -v

Author: Gabriel Oliveros

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imbreg-0.1.2.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

imbreg-0.1.2-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file imbreg-0.1.2.tar.gz.

File metadata

  • Download URL: imbreg-0.1.2.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for imbreg-0.1.2.tar.gz
Algorithm Hash digest
SHA256 cf578e52ddc84a60ca37f74432e8706f8e324c5954c99ce11b6c7f8c2daf8158
MD5 ddd687273f04752500b8a4ec846b9f48
BLAKE2b-256 73264d5170ee85335f52ceddd1bb5559c6e808fd73642da30745aed805d34d5a

See more details on using hashes here.

File details

Details for the file imbreg-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: imbreg-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for imbreg-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2cb7aecd4f23794e68b8455cb35ffad5dca2decddccc850158e2d502959b30a7
MD5 697d6fc5b2d3aaefebdd9ddfcc2e0f1b
BLAKE2b-256 48c07c001866cbaab1209106adcdd56c3ba533f170fbc3b441515b477b0ee57b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page