Data Science Utility Functions
Project description
ds_utils
Install
pip install ds_utils
Development notes
This library has been developed using nbdev. Any change or PR has to be made directly to the notebooks that creates each module. All tests must pass.
ml.RegressorCV
Class
Overview
The
RegressorCV
class in the ml.py
module of the ds_utils
library is designed to
train an estimator using cross-validation and record metrics for each
fold. It also stores each model that is trained on each fold of the
cross-validation process. The final prediction is made as the median
value of the predictions from each stored model.
Initialization
RegressorCV(base_reg, cv=5, groups=None, verbose=False, n_bins_stratify=None)
Parameters
base_reg
: Estimator object implementing ‘fit’. The object to use to fit the data.cv
: int or cross-validation generator, default=5. Determines the cross-validation splitting strategy.groups
: array-like of shape (n_samples,), default=None. Group labels for the samples used while splitting the dataset into train/test set.n_bins_stratify
: int, default=None. Number of bins to use for stratification.verbose
: bool, default=False. Whether or not to print metrics.
Attributes
cv_results_
: Dictionary containing the results of the cross-validation, including the fold number, the regressor, and the metrics.oof_train_
: Series containing the out-of-fold predictions on the training set.oof_score_
: The R2 score calculated using the out-of-fold predictions.oof_mape_
: The Mean Absolute Percentage Error calculated using the out-of-fold predictions.oof_rmse_
: The Root Mean Squared Error calculated using the out-of-fold predictions.metrics_
: The summary of metrics calculated during the fitting process.
Methods
fit(self, X, y, **kwargs)
Trains the base regressor on the provided data using cross-validation and stores the results.
predict(self, X)
Predicts the target variable for the provided features using the median value of the predictions from the models trained during cross-validation.
Example
from ds_utils.ml import RegressorCV
from sklearn.ensemble import RandomForestRegressor
# Initialize the RegressorCV with a base regressor and cross-validation strategy
reg_cv = RegressorCV(base_reg=RandomForestRegressor(), cv=5, verbose=True)
# Fit the RegressorCV to the training data
reg_cv.fit(X_train, y_train)
# Predict the target variable for new data
predictions = reg_cv.predict(X_new)
# Get the summary of recorded metrics
metrics = reg_cv.metrics_
ml.RegressorTimeSeriesCV
Class
Overview
The
RegressorTimeSeriesCV
class in the ml.py
module of the ds_utils
library is designed to
train a base regressor using time series cross-validation and record
metrics for each fold.
Initialization
RegressorTimeSeriesCV(base_reg, cv=5, verbose=False, catboost_use_eval_set=False)
Parameters
base_reg
: Estimator object implementing ‘fit’. The object to use to fit the data.cv
: int, default=5. Determines the cross-validation splitting strategy.verbose
: bool, default=False. Whether or not to print metrics.catboost_use_eval_set
: bool, default=False. Whether or not to use eval_set in CatBoostRegressor.
Attributes
cv_results_
: List containing the results of the cross-validation, including fold number, regressor, train and test indices, and metrics.metrics_
: The summary of metrics calculated during the fitting process.y_test_last_fold_
: The true target variable values of the last fold.y_pred_last_fold_
: The predicted target variable values of the last fold.
Methods
fit(self, X, y, sample_weight=None)
Trains the base regressor on the provided data using time series cross-validation and stores the results.
predict(self, X)
Predicts the target variable for the provided features using the base regressor trained on the full data.
Example
from ds_utils.ml import RegressorTimeSeriesCV
from sklearn.ensemble import RandomForestRegressor
# Initialize the RegressorTimeSeriesCV with a base regressor and cross-validation strategy
reg_tscv = RegressorTimeSeriesCV(base_reg=RandomForestRegressor(), cv=5, verbose=True)
# Fit the RegressorTimeSeriesCV to the training data
reg_tscv.fit(X_train, y_train)
# Predict the target variable for new data
predictions = reg_tscv.predict(X_new)
# Get the summary of recorded metrics
metrics = reg_tscv.metrics_
ml.KNNRegressor
Class
Overview
The
KNNRegressor
class in the ml.py
module of the ds_utils
library is an extension of
the KNeighborsRegressor
from scikit-learn, with modifications to the
predict
method to allow different calculations for predictions and to
optionally return the indices of the nearest neighbors.
Initialization
The initialization parameters are the same as those of the
KNeighborsRegressor
from scikit-learn. Refer to the official
documentation
for details on the parameters.
Methods
predict(self, X, return_match_index=False, pred_calc=‘mean’)
Predicts the target variable for the provided features and allows different calculations for predictions.
Parameters
X
: array-like of shape (n_samples, n_features). Test samples.return_match_index
: bool, default=False. Whether to return the index of the nearest matched neighbor.pred_calc
: str, default=‘mean’. The calculation to use for predictions. Possible values are ‘mean’ and ‘median’.
Returns
y_pred
: array of shape (n_samples,) or (n_samples, n_outputs). The predicted target variable.nearest_matched_index
: array of shape (n_samples,). The index of the nearest matched neighbor. Returned only ifreturn_match_index=True
.neigh_ind
: array of shape (n_samples, n_neighbors). Indices of the neighbors in the training set. Returned only ifreturn_match_index=True
.
Example
from ds_utils.ml import KNNRegressor
# Initialize the KNNRegressor with specific parameters
knn_reg = KNNRegressor(n_neighbors=3)
# Fit the KNNRegressor to the training data
knn_reg.fit(X_train, y_train)
# Predict the target variable for new data and return the index of the nearest matched neighbor
predictions, nearest_matched_index, neigh_ind = knn_reg.predict(X_new, return_match_index=True, pred_calc='median')
ml.AutoRegressor
Class
Overview
The
AutoRegressor
class is designed for performing automated regression tasks, including
preprocessing and model fitting. It supports several regression
algorithms and allows for easy comparison of their performance on a
given dataset. The class provides various methods for model evaluation,
feature importance, and visualization.
Initialization
ar = AutoRegressor(
num_cols,
cat_cols,
target_col,
data=None,
train=None,
test=None,
random_st=42,
log_target=False,
estimator='catboost',
imputer_strategy='simple',
use_catboost_native_cat_features=False,
ohe_min_freq=0.05,
scale_numeric_data=False,
scale_categoric_data=False,
scale_target=False
)
Parameters
- num_cols: list
- List of numerical columns in the dataset.
- cat_cols: list
- List of categorical columns in the dataset.
- target_col: str
- Target column name in the dataset.
- data: pd.DataFrame, optional (default=None)
- Input dataset containing both features and target column.
- train: pd.DataFrame, optional (default=None)
- Training dataset containing both features and target column. Used if
data
is not provided.
- Training dataset containing both features and target column. Used if
- test: pd.DataFrame, optional (default=None)
- Testing dataset containing both features and target column. Used if
data
is not provided.
- Testing dataset containing both features and target column. Used if
- random_st: int, optional (default=42)
- Random state for reproducibility.
- log_target: bool, optional (default=False)
- If the logarithm of the target variable should be used.
- estimator: str or estimator object, optional (default=‘catboost’)
- String or estimator object. Options are ‘catboost’, ‘random_forest’, and ‘linear’.
- imputer_strategy: str, optional (default=‘simple’)
- Imputation strategy for missing values. Options are ‘simple’ and ‘knn’.
- use_catboost_native_cat_features: bool, optional (default=False)
- If the native CatBoost categorical feature handling should be used.
- ohe_min_freq: float, optional (default=0.05)
- Minimum frequency for OneHotEncoder to consider a category in categorical columns.
- scale_numeric_data: bool, optional (default=False)
- If numeric data should be scaled using StandardScaler.
- scale_categoric_data: bool, optional (default=False)
- If categorical data (after one-hot encoding) should be scaled using StandardScaler.
- scale_target: bool, optional (default=False)
- If the target variable should be scaled using StandardScaler.
Methods
fit_report(self)
Fits the model to the training data and prints the R2 Score, RMSE, and MAPE on the test data.
test_binary_column(self, binary_column)
Tests the significance of a binary column on the target variable and returns the p-value for Mann–Whitney U test.
get_coefficients(self)
Returns the coefficients of the model if the estimator is linear, otherwise returns feature importances.
get_feature_importances(self)
Returns the feature importances of the model.
get_shap(self, return_shap_values=False)
Generates and plots SHAP values for the model and returns SHAP values if
return_shap_values
is True.
plot_importance(self, feat_imp, graph_title=“Model feature importance”)
Plots the feature importances provided in feat_imp
with the specified
graph_title
.
Example Usage
# Initialize AutoRegressor
ar = AutoRegressor(num_cols, cat_cols, target_col, data)
# Fit the model and print the report
ar.fit_report()
# Get and plot feature importances
feat_imp = ar.get_feature_importances()
ar.plot_importance(feat_imp)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for python_ds_utils-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de1e9e2ed208a524f3636768158672891afb977438950afca48c1ebc722f3c3d |
|
MD5 | 55dd99dd9973cd583fe627bb4ef26a06 |
|
BLAKE2b-256 | 35eee783849f13a7af1c4dfe7a85c9a6db49964b3eb01f7827e5d398e4041160 |