A Python library for Imbalanced Regression with SMOGN, stratified CV, and utility-based metrics.
Project description
imbreg
imbreg is a powerful Python library specifically designed to tackle the Imbalanced Regression problem. It facilitates the processing of datasets with missing values, applies advanced synthetic over-sampling techniques like SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise), evaluates predictive models using utility-based metrics, and manages stratified cross-validation partitioning.
Key Features
- SMOGN Resampling (DIBS): Generates synthetic examples for extreme minority values in continuous domains using the DIBS strategy (a combination of SmoteR interpolation and GaussNoise perturbation).
- Stratified Partitioning: Implements purely stratified cross-validation (CV) algorithms to ensure that extreme values are evenly distributed across folds.
- Robust Data Imputation: Native integration with iterative algorithms (Scikit-Learn IterativeImputer) that prevents data leakage between training and test partitions.
- Advanced Utility-based Metrics: Precise calculation of specialized metrics for imbalanced regression:
- Utility-based F1-Score ($\beta$-measure).
- SERA (Squared Error Relevance Area).
- Dataset Loading (KEEL/CSV/ARFF): A smart data loader that infers categorical variables, caps decimals, maps ranges, and cleans noisy values automatically.
- Data Visualization: Built-in 2D and 3D plotting modules (using Plotly, Seaborn) to visually analyze the relevance of the target variable and the impact of noise/distribution.
Requirements and Installation
To use this library, ensure you have Python 3.9 or higher installed. The main dependencies are built around the classic data science ecosystem.
pip install imbreg
Quickstart Guide
Here is a quick snippet of how to use the core functions:
1. Generate Partitions (Cross-Validation)
The cv_partitions function will take care of reading your original dataset, cleaning it, performing missing data imputation, and injecting SMOGN oversampling automatically into each repetition.
from imbreg import cv_partitions
cv_partitions(
ds_name="my_dataset.csv",
ds_location="raw_data/",
times=1, # Number of repetitions
folds=10, # Number of partitions (k-fold)
strat=True, # Enable stratification
smogn=True, # Apply SMOGN during training
impute=True, # Impute missing values (NaNs)
out_dir="Output/" # Output directory for raw data partitions
)
2. Evaluate Predictions
Once the physical folds are generated on your disk, you can automatically train the algorithms and retrieve the results summary containing SERA and F1 metrics.
from imbreg import evaluate_folds
results = evaluate_folds(
output_dir="Output/", # Directory containing the generated folds
dataset="my_dataset",
model_type="rf", # 'rf' (Random Forest), 'et' (Extra Trees), 'xgb' (XGBoost)
n_reps=1,
n_folds=10,
use_imputation=True,
use_smogn=True,
thr_rel=0.8 # Relevance threshold to define "rare" cases
)
# You can export these results to a flat structure using the built-in exporter
from imbreg.validation import export_experiment_summaries
export_experiment_summaries(results, output_dir="Results/", dataset_name="my_dataset", flat_output=True)
3. Visualize the Data
Analyze the relevance curve of your target variable:
import matplotlib.pyplot as plt
from imbreg import read_dataset, phi_control, plot_target_distribution
# Load dataset and create relevance control structure
df = read_dataset("my_dataset.csv", "raw_data/")
ctrl = phi_control(df["y"].values, method="extremes")
# Visualize distribution vs relevance
fig = plot_target_distribution(df, target_col="y", phi_ctrl=ctrl, thr_rel=0.8)
plt.show()
Project Structure
imbreg/
│
├── data_loader.py # I/O functions (CSV/KEEL) and imputation wrappers
├── metrics.py # Mathematical evaluation functions (Utility F1, SERA, Bumps)
├── models.py # Training and prediction wrappers (RF, ET, XGBoost)
├── plots.py # Advanced visualizations (Histograms, Scatters, Prediction Error)
├── resampling.py # Core engine for the DIBS strategy (SMOGN for regression)
├── stratification.py # Phi function (relevance) and K-Folds generators
├── utils.py # Math operations, distance metrics, and internal helpers
└── validation.py # Cross-validation evaluation pipeline and result export
Folder Architecture for Experiments
When running the full validation pipeline, the project enforces a clean separation of concerns:
Output/: Stores all heavy, raw data partitions generated by cross-validation and SMOGN.Results/: A flat, clean directory containing only the final.txtand.csvsummary metrics.
Author: Gabriel Oliveros
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imbreg-0.1.0.tar.gz.
File metadata
- Download URL: imbreg-0.1.0.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad42d7121c047834fb49a591910bb4280bf2cc913c1c41361328398ed75ce231
|
|
| MD5 |
17d7be3440d6c6482fe3128ab5e58b9d
|
|
| BLAKE2b-256 |
c50d88a57dde11905430f3d733983bde8e3a15e9614eff2c7d3945753bf6ce64
|
File details
Details for the file imbreg-0.1.0-py3-none-any.whl.
File metadata
- Download URL: imbreg-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84c44afc702f3af43bd6733e1ef41e4bf80016a6ade3ee79e1ed2e4e10731c71
|
|
| MD5 |
fff0e242b0eb0a7c66ca1bff554a3505
|
|
| BLAKE2b-256 |
6faa983a736b17826d93ea90a7e97e1833ee382172f35c226015b3b081819ea8
|