Handy machine learning tools in the spirit of scikit-learn.
Project description
extrakit-learn
Machine learnings components built to extend scikit-learn. All components use scikit's object API to work interchangably with scikit components. It is mostly a collection of tools that have been useful for Kaggle competitions. extrakit-learn is in no way affiliated with scikit-learn in anyway, just inspired by it.
Installation
pip install xklearn
Components
- CategoryEncoder - Like scikit's LabelEncoder but supports NaNs and unseen values.
- CountEncoder - Categorical feature engineering on a column based on value counts.
- TargetEncoder - Categorical feature engineering on a column based on target means.
- MultiColumnEncoder - Apply a column encoder to multiple columns.
- FoldEstimator - K-fold on scikit estimator wrapped into an estimator.
- FoldLightGBM - K-fold on LGBM wrapped into an estimator.
- FoldXGBoost - K-fold on XGBoost wrapped into an estimator.
- StackClassifier - Stack an ensemble of classifiers with a meta classifier.
- StackRegressor - Stack an ensemble of regressors with a meta regressor.
- compress_dataframe - Reduce memory of a Pandas dataframe.
Hierachy
xklearn
│
├── preprocessing
│ ├── CategoryEncoder
│ ├── CountEncoder
│ ├── TargetEncoder
│ └── MultiColumnEncoder
│
├── models
│ ├── FoldEstimator
│ ├── FoldLightGBM
| ├── FoldXGBoost
| ├── StackClassifier
| └── StackRegressor
|
└── utils
Example
from xklearn.models import FoldEstimator
CategoryEncoder
Wraps scikit's LabelEncoder, allowing missing and unseen values to be handled.
Arguments
unseen
- Strategy for handling unseen values. See replacement strategies below for options.
missing
- Strategy for handling missing values. See replacement strategies below for options.
Replacement strategies
'encode'
- Replace value with -1.
'nan'
- Replace value with np.nan.
'error'
- Raise ValueError.
Example
from xklearn.preprocessing import CategoryEncoder
...
ce = CategoryEncoder(unseen='nan', missing='nan')
X[:, 0] = ce.fit_transform(X[:, 0])
CountEncoder
Replaces categorical values with their respective value count during training. Classes with a count of one and previously unseen classes during prediction are encoded as either one or NaN.
Arguments
unseen
- Strategy for handling unseen values. See replacement strategies below for options.
missing
- Strategy for handling missing values. See replacement strategies below for options.
Replacement strategies
'one'
- Replace value with 1.
'nan'
- Replace value with np.nan.
'error'
- Raise ValueError.
Example
from xklearn.preprocessing import CountEncoder
...
ce = CountEncoder(unseen='one')
X[:, 0] = ce.fit_transform(X[:, 0])
TargetEncoder
Performs target mean encoding of categorical features with optional smoothing.
Arguments
smoothing
- Smoothing weight.
unseen
- Strategy for handling unseen values. See replacement strategies below for options.
missing
- Strategy for handling missing values. See replacement strategies below for options.
Replacement strategies
'global'
- Replace value with global target mean.
'nan'
- Replace value with np.nan.
'error'
- Raise ValueError.
Example
from xklearn.preprocessing import TargetEncoder
...
te = TargetEncoder(smoothing=10)
X[:, 0] = te.fit_transform(X[:, 0], y)
MultiColumnEncoder
Applies a column encoder over multiple columns.
Arguments
enc
- Base encoder that will be applied to selected columns
columns
- Column selection, either bool-mask, indices or None (default=None).
Example
from xklearn.preprocessing import CountEncoder
from xklearn.preprocessing import MultiColumnEncoder
...
columns = [1, 3, 4]
enc = CountEncoder()
mce = MultiColumnEncoder(enc, columns)
X = mce.fit_transform(X)
FoldEstimator
K-fold wrapped into an estimator that performs cross validation over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.
Arguments
est
- Base estimator.
fold
- Folding cross validation object, i.e KFold and StratifedKfold.
metric
- Evaluation metric.
refit_full
- Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.
verbose
- Flag for printing fold scores during fit.
Example
from xklearn.models import FoldEstimator
...
base = RandomForestRegressor(n_estimators=10)
fold = KFold(n_splits=5)
est = FoldEstimator(base, fold=fold, metric=mean_squared_error, verbose=1)
est.fit(X_train, y_train)
est.predict(X_test)
Output:
Finished fold 1 with score: 200.8023
Finished fold 2 with score: 261.2365
Finished fold 3 with score: 169.2404
Finished fold 4 with score: 186.7915
Finished fold 5 with score: 205.0894
Finished with a total score of: 204.6813
FoldLightGBM
K-fold wrapped into an estimator that performs cross validation on a LGBM over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.
Arguments
lgbm
- Base estimator.
fold
- Folding cross validation object, i.e KFold and StratifedKfold.
metric
- Evaluation metric.
fit_params
- Dictionary of parameter that should be fed to the fit method.
refit_full
- Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.
refit_params
- Dictionary of parameter that should be fed to the refit if refit_full=False.
verbose
- Flag for printing fold scores during fit.
Example
from xklearn.models import FoldLightGBM
...
base = LGBMClassifier(n_estimators=1000)
fold = KFold(n_splits=5)
fit_params = {'eval_metric': 'auc',
'early_stopping_rounds': 50,
'verbose': 0}
fold_lgbm = FoldLightGBM(base,
fold=fold,
metric=roc_auc_score,
fit_params=fit_params,
verbose=1)
fold_lgbm.fit(X_train, y_train)
fold_lgbm.predict(X_test)
Output:
Finished fold 1 with score: 0.9114
Finished fold 2 with score: 0.9265
Finished fold 3 with score: 0.9419
Finished fold 4 with score: 0.9189
Finished fold 5 with score: 0.9152
Finished with a total score of: 0.9225
FoldXGBoost
K-fold wrapped into an estimator that performs cross validation on a XGBoost over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.
Arguments
xgb
- Base estimator.
fold
- Folding cross validation object, i.e KFold and StratifedKfold.
metric
- Evaluation metric.
fit_params
- Dictionary of parameter that should be fed to the fit method.
refit_full
- Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.
refit_params
- Dictionary of parameter that should be fed to the refit if refit_full=False.
verbose
- Flag for printing fold scores during fit.
Example
from xklearn.models import FoldXGBoost
...
base = XGBRegressor(objective="reg:linear", random_state=42)
fold = KFold(n_splits=5)
fit_params = {'eval_metric': 'mse',
'early_stopping_rounds': 5,
'verbose': 0}
fold_xgb = FoldXGBoost(base,
fold=fold,
metric=mean_squared_error,
fit_params=fit_params,
verbose=1)
fold_xgb.fit(X_train, y_train)
fold_xgb.predict(X_test)
Output:
Finished fold 1 with score: 3212.8362
Finished fold 2 with score: 2179.7843
Finished fold 3 with score: 2707.8460
Finished fold 4 with score: 2988.6643
Finished fold 5 with score: 3281.4299
Finished with a total score of: 3274.9001
StackClassifier
Ensemble classifier that stacks an ensemble of classifiers by using their outputs as input features.
Arguments
clfs
- List of ensemble of classifiers.
meta_clf
- Meta classifier that stacks the predictions of the ensemble.
keep_features
- Flag to train the meta classifier on the original features too.
refit
- Flag to retrain the ensemble of classifiers during fit.
Example
from xklearn.models import StackClassifier
...
meta_clf = RidgeClassifier()
ensemble = [RandomForestClassifier(), KNeighborsClassifier(), SVC()]
stack_clf = StackClassifier(clfs=ensemble, meta_clf=meta_clf, refit=True)
stack_clf.fit(X_train, y_train)
y_ = stack_clf.predict(X_test)
StackRegressor
Ensemble regressor that stacks an ensemble of regressors by using their outputs as input features.
Arguments
regs
- List of ensemble of regressors.
meta_reg
- Meta regressor that stacks the predictions of the ensemble.
drop_first
: Drop first class probability to avoid multi-collinearity.
keep_features
- Flag to train the meta regressor on the original features too.
refit
- Flag to retrain the ensemble of regressors during fit.
Example
from xklearn.models import StackRegressor
...
meta_reg = RidgeRegressor()
ensemble = [RandomForestRegressor(), KNeighborsRegressor(), SVR()]
stack_reg = StackRegressor(regs=ensemble, meta_reg=meta_reg, refit=True)
stack_reg.fit(X_train, y_train)
y_ = stack_reg.predict(X_test)
compress_dataframe
Reduce memory usage of a Pandas dataframe by finding columns that use larger variable types than unnecessary.
Arguments
df
- Dataframe for memory reduction.
verbose
- Flag for printing result of memory reduction.
Example
from xklearn.utils import compress_dataframe
...
train = compress_dataframe(train, verbose=1)
Output:
Dataframe memory decreased to 169.60 MB (64.6% reduction)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file xklearn-0.0.7.tar.gz
.
File metadata
- Download URL: xklearn-0.0.7.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 916e0a7618b8bdf8988de7df5edc5e93ef3ebab609d0f46c1082a60988a6befb |
|
MD5 | cde33481a71c1cc95377800208e03725 |
|
BLAKE2b-256 | cdd5b4f65e1c390fe013266fb8132c22db607fe31ca35348886a9ccfc2f5e0cb |