Resampling strategies for regression
Resreg is a Python package for resampling imbalanced distributions in regression problems.
If you find resreg useful, please cite the following article:
- Gado, J.E., Beckham, G.T., and Payne, C.M (2020). Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 60(8), 4098-4107.
If you use RO, RU, SMOTER, GN, or WERCS methods, also cite
- Branco, P., Torgo, L., and Ribeiro, R.P. (2019). Pre-processing approaches for imbalanced distributions in regression. Neurocomputing. 343, 76-99.
If you use REBAGG, also cite
- Branco, P., Torgo, L., and Ribeiro, R.P. (2018). REBAGG: Resampled bagging for imbalanced regression. In 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications. pp 67-81.
If you use precision, recall, or F1-score for regression, also cite
- Torgo, L. and Ribeira, R.P. (2009). Precision and recall for regression. In International Conference on Discovery Science. pp332-346
Preferrably, install from GitHub source. The use of a virtual environment is strongly advised.
git clone https://github.com/jafetgado/resreg.git cd resreg pip install -r requirements.txt python setup.py install
Or, install with pip (less preferred)
pip install resreg
- Python 3
A regression dataset (X, y) can be resampled to mitigate the imbalance in the distribution with any of six strategies: random oversampling, random undersampling, SMOTER, Gaussian noise, WERCS, or Rebagg.
- Random oversampling (RO): randomly oversample rare values selected by the user via a relevance function.
- Random undersampling (RU): randomly undersample abundant values.
- SMOTER: randomly undersample abundant values; oversample rare values by interpolation between nearest neighbors.
- Gaussian noise (GN): randomly undersample abundant values; oversample rare values by adding Gaussian noise.
- WERCS: resample the dataset by selecting instances using user-specified relevance values as weights.
- REBAGG: Train an ensemble of Scikit-learn base learners on independently resampled bootstrap subsets of the dataset.
See the tutorial for more details.
import resreg from sklearn.metrics import train_test_split from sklearn.metrics import RandomForestRegressor # Split dataset to training and testing sets X_train, X_test, y_train, y_test = resreg.train_test_split(X, y, test_size=0.25) # Resample training set with random oversampling relevance = resreg.sigmoid_relevance(y, cl=None, ch=np.percentile(y, 90)) X_train, y_train = resreg.random_oversampling(X_train, y_train, relevance, relevance_threshold=0.5, over='balance') # Fit regressor to resampled training set reg = RandomForestRegressor() reg.fit(X_train, y_train) y_pred = reg.predict(X_train, y_train)
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size resreg-0.2-py3-none-any.whl (27.4 kB)||File type Wheel||Python version py3||Upload date||Hashes View|