Resampling strategies for regression
Project description
Resreg is a Python package for resampling imbalanced distributions in regression problems.
If you find resreg useful, please cite the following article:
Gado, J.E., Beckham, G.T., and Payne, C.M (2020). Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 60(8), 4098-4107.
If you use RO, RU, SMOTER, GN, or WERCS methods, also cite
Branco, P., Torgo, L., and Ribeiro, R.P. (2019). Pre-processing approaches for imbalanced distributions in regression. Neurocomputing. 343, 76-99.
If you use REBAGG, also cite
Branco, P., Torgo, L., and Ribeiro, R.P. (2018). REBAGG: Resampled bagging for imbalanced regression. In 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications. pp 67-81.
If you use precision, recall, or F1-score for regression, also cite
Torgo, L. and Ribeira, R.P. (2009). Precision and recall for regression. In International Conference on Discovery Science. pp332-346
Installation
Preferrably, install from GitHub source. The use of a virtual environment is strongly advised.
git clone https://github.com/jafetgado/resreg.git
cd resreg
pip install -r requirements.txt
python setup.py install
Or, install with pip (less preferred)
pip install resreg
Prerequisites
Python 3
Numpy
Scipy
Pandas
Scikit-learn
Usage
A regression dataset (X, y) can be resampled to mitigate the imbalance in the distribution with any of six strategies: random oversampling, random undersampling, SMOTER, Gaussian noise, WERCS, or Rebagg.
Random oversampling (RO): randomly oversample rare values selected by the user via a relevance function.
Random undersampling (RU): randomly undersample abundant values.
SMOTER: randomly undersample abundant values; oversample rare values by interpolation between nearest neighbors.
Gaussian noise (GN): randomly undersample abundant values; oversample rare values by adding Gaussian noise.
WERCS: resample the dataset by selecting instances using user-specified relevance values as weights.
REBAGG: Train an ensemble of Scikit-learn base learners on independently resampled bootstrap subsets of the dataset.
See the tutorial for more details.
Examples
import resreg
from sklearn.metrics import train_test_split
from sklearn.metrics import RandomForestRegressor
# Split dataset to training and testing sets
X_train, X_test, y_train, y_test = resreg.train_test_split(X, y, test_size=0.25)
# Resample training set with random oversampling
relevance = resreg.sigmoid_relevance(y, cl=None, ch=np.percentile(y, 90))
X_train, y_train = resreg.random_oversampling(X_train, y_train, relevance, relevance_threshold=0.5,
over='balance')
# Fit regressor to resampled training set
reg = RandomForestRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_train, y_train)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.