A Python package for imbalanced learning with ensemble learning.
Project description
A rich documentation is available at Read the Docs
Shape Penalized Decision Forests
Shape Penalized Decision Forests, for training ensemble classifiers tailored for imbalanced datasets. This package provides two primary implementations:
-
SPBoDF (Shape Penalized Boosting Decision Forest): A boosting ensemble method that builds multiple trees sequentially, adjusting the weights of samples to focus on harder-to-classify instances.
-
SPBaDF (Shape Penalized Bagging Decision Forest): A bagging ensemble method that builds multiple trees independently on bootstrap samples of the dataset, improving robustness and reducing overfitting.
Both implementations use the concept of Surface-to-Volume Regularization (SVR) to penalize irregular decision boundaries, thus improving generalization and addressing challenges associated with imbalanced datasets.
Key Features
- Boosting and Bagging: Two ensemble approaches tailored for classification tasks.
- Shape Penalization: Incorporates a novel regularization technique to control decision boundary complexity.
- Imbalanced Data Handling: Designed with class imbalance in mind, using weighting and bootstrapping techniques.
- Scikit-learn Compatible: Implements
BaseEstimatorandClassifierMixin, making it seamlessly integrable with the Scikit-learn ecosystem. - Customizability: Hyperparameters such as the number of trees, shape penalty, and maximal leaves are configurable for fine-tuning.
Installation
-
Downloading Locally and Installing
git clone https://www.github.com/yuvrajiro/imbalanced-spdf.git cd imbalance-spdf
-
Install dependencies:
pip install -r requirements.txt
-
Install the package:
python install -e .
-
Using pip install from GitHub
pip install git+https://www.github.com/yuvrajiro/imbalanced-spdf.git
-
Using pip install from PyPi
pip install imbalanced-spdf
Usage
1. SPBoDF (Boosting)
Example
import numpy as np
from imbalanced_spdf.ensemble import SPBoDF
# Generate synthetic data
X_train = np.random.rand(100, 5)
y_train = np.random.randint(0, 2, 100)
# Initialize and fit SPBoDF
boosting_model = SPBoDF(n_trees=50, weight=2, pen=1.0, random_state=42)
boosting_model.fit(X_train, y_train)
# Predict
X_test = np.random.rand(20, 5)
y_pred = boosting_model.predict(X_test)
print("Predictions:", y_pred)
2. SPBaDF (Bagging)
Example
from imbalanced_spdf.ensemble import SPBaDF
# Generate synthetic data
X_train = np.random.rand(100, 5)
y_train = np.random.randint(0, 2, 100)
# Initialize and fit SPBaDF
bagging_model = SPBaDF(n_trees=50, weight=2, pen=1.0, random_state=42)
bagging_model.fit(X_train, y_train)
# Predict
X_test = np.random.rand(20, 5)
y_pred = bagging_model.predict(X_test)
print("Predictions:", y_pred)
API Reference
SPBoDF
A boosting ensemble classifier that uses SVR-regularized trees to handle imbalanced datasets.
Parameters:
n_trees(int): Number of trees in the ensemble (default: 40).weight(float): Weight for the minority class to address imbalance (default: 1).pen(float): Regularization penalty controlling decision boundary complexity (default: 0).maximal_leaves(int or float, optional): Maximum leaves per tree. Defaults to2 * sqrt(n_samples) * 0.3333.random_state(int): Random seed for reproducibility (default: 23).
Methods:
fit(X, y): Fits the ensemble on the training data.predict(X): Predicts the labels of the test data.
SPBaDF
A bagging ensemble classifier that uses SVR-regularized trees to improve robustness.
Parameters:
n_trees(int): Number of trees in the ensemble (default: 40).weight(float): Weight for the minority class to address imbalance (default: 1).pen(float): Regularization penalty controlling decision boundary complexity (default: 0).maximal_leaves(int or float, optional): Maximum leaves per tree. Defaults to2 * sqrt(n_samples) * 0.3333.random_state(int): Random seed for reproducibility (default: 23).
Methods:
fit(X, y): Fits the ensemble on the training data.predict(X): Predicts the labels of the test data.
How It Works
-
SPBoDF (Boosting):
- Trees are built sequentially, with sample weights updated after each iteration to focus on misclassified samples.
- Regularization (SVR) penalizes irregular decision boundaries to avoid overfitting.
-
SPBaDF (Bagging):
- Trees are built independently on bootstrap samples of the training data.
- Each tree focuses on non-constant feature subsets, improving robustness and generalization.
Dataset Details
License
This project is licensed under the MIT License. See the LICENSE file for details.
Citation
If you are using this package in your research, please consider citing the following paper:
Shape Penalized Decision Forests for Imbalanced Data Classification : Rahul Goswami, Aindrila Garai, Payel Sadhukhan, Palash Ghosh, Tanujit Chakraborty
References
- Zhu, Y., Li, C., & Dunson, D. B. (2023). "Classification Trees for Imbalanced Data: Surface-to-Volume Regularization." Journal of the American Statistical Association.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file imbalanced_spdf-0.0.18.tar.gz.
File metadata
- Download URL: imbalanced_spdf-0.0.18.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e90ca8c1bf7160b79ff0fd5ae227e9bb09c023f70b70674c6e9be01e636203b3
|
|
| MD5 |
a5082d0b51fe1b7f850c13fe5b4b77ef
|
|
| BLAKE2b-256 |
01e6022380d1b0153b4a5e7d61f0ab26d100b799707db0e93e9320825c35702e
|
File details
Details for the file imbalanced_spdf-0.0.18-py3-none-any.whl.
File metadata
- Download URL: imbalanced_spdf-0.0.18-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ed7170083d4d21c3a133df42b1970d4be847783cca1fcc257d128afa8374e52
|
|
| MD5 |
8eca055eb60bffcd39e92a613bf330fa
|
|
| BLAKE2b-256 |
6bf9ea67e92983039eb7bdfc88460dd6a555227d656ccb57b045e103f3158264
|