Skip to main content

Handy machine learning tools in the spirit of scikit-learn.

Project description

extrakit-learn

PyPI version License

Machine learnings components built to extend scikit-learn. All components use scikit's object API to work interchangably with scikit components. It is mostly a collection of tools that have been useful for Kaggle competitions. extrakit-learn is in no way affiliated with scikit-learn in anyway, just inspired by it.

Installation

pip install xklearn

Components

  • CategoryEncoder - Like scikit's LabelEncoder but supports NaNs and missing values.
  • CountEncoder - Categorical feature engineering based on value counts.
  • TargetEncoder - Categorical feature engineering based on target means.
  • MultiColumnEncoder - Apply a column encoder to multiple columns
  • FoldEstimator - K-fold on scikit estimator wrapped into an estimator.
  • FoldLGBM - K-fold on LGBM wrapped into an estimator.
  • StackingClassifier - Stack an ensemble of classifiers with a meta classifier.
  • StackingRegressor - Stack an ensemble of regressors with a meta regressor.

Hierachy

xklearn
|
├── preprocessing
│   ├── CountEncoder      
│   └── TargetEncoder
|
└── models
    ├── FoldEstimator
    ├── FoldLGBM
    ├── StackingClassifier
    └── StackingRegressor
Example
from xklearn.models import FoldEstimator

CategoryEncoder

Wraps scikit's LabelEncoder, allowing missing and unseen values to be handled.

Arguments

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'encode' - Replace value with -1.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example:

from xklearn.preprocessing import CategoryEncoder
...

ce = CategoryEncoder(unseen='nan', missing='nan')
X[:, 0] = ce.fit_transform(X[:, 0])

CountEncoder

Replaces categorical values with their respective value count during training. Classes with a count of one and previously unseen classes during prediction are encoded as either one or NaN.

Arguments

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'one' - Replace value with 1.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example:

from xklearn.preprocessing import CountEncoder
...

ce = CountEncoder(unseen='one')
X[:, 0] = ce.fit_transform(X[:, 0])

TargetEncoder

Performs target mean encoding of categorical features with optional smoothing.

Arguments

smoothing - Smoothing weight.

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'one' - Replace value with 1.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example:

from xklearn.preprocessing import TargetEncoder
...

te = TargetEncoder(smoothing=10)
X[:, 0] = te.fit_transform(X[:, 0], y)

MultiColumnEncoder

Applies a column encoder over multiple columns.

Arguments

enc - Base encoder that will be applied to selected columns

columns - Column selection, either bool-mask, indices or None (default=None).

Example:

from xklearn.preprocessing import CountEncoder
from xklearn.preprocessing import MultiColumnEncoder
...

columns = [1, 3, 4]
enc = CountEncoder()

mce = MultiColumnEncoder(enc, columns)
X = mce.fit_transform(X)

FoldEstimator

K-fold wrapped into an estimator that performs cross validation over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

Arguments

est - Base estimator.

fold - Folding cross validation object, i.e KFold and StratifedKfold.

metric - Evaluation metric.

ensemble - Flag indicting post fit behaviour. True will make it a stacked ensemble, False will do a full refit on the full data.

verbose - Flag for printing intermediate scores during fit.

Example:

from xklearn.models import FoldEstimator
...

base = RandomForestRegressor(n_estimators=10)
fold = KFold(n_splits=5)

est = FoldEstimator(base, fold=fold, metric=mean_squared_error, verbose=1)

est.fit(X_train, y_train)
est.predict(X_test)

FoldLGBM

K-fold wrapped into an estimator that performs cross validation on a LGBM over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

Arguments

lgbm - Base estimator.

fold - Folding cross validation object, i.e KFold and StratifedKfold.

metric - Evaluation metric.

fit_params - Dictionary of parameter that should be fed to the fit method.

ensemble - Flag indicting post fit behaviour. True will make it a stacked ensemble, False will do a full refit on the full data.

refit_params - Dictionary of parameter that should be fed to the refit if ensemble=False.

verbose - Flag for printing intermediate scores during fit.

Example:

from xklearn.models import FoldLGBM
...

base = LGBMClassifier(n_estimators=1000)
fold = KFold(n_splits=5)
fit_params = {'eval_metric': 'auc',
              'early_stopping_rounds': 50,
              'verbose': 0}
              
fold_lgbm = FoldLGBM(base, 
                     fold=fold, 
                     metric=roc_auc_score,
                     fit_params=fit_params,
                     verbose=1)
               
fold_lgbm.fit(X_train, y_train)
fold_lgbm.predict(X_test)

StackingClassifier

Ensemble classifier that stacks an ensemble of classifiers by using their outputs as input features.

Arguments

clfs - List of ensemble of classifiers.

meta_clf - Meta classifier that stacks the predictions of the ensemble.

keep_features - Flag to train the meta classifier on the original features too.

refit - Flag to retrain the ensemble of classifiers.

Example:

from xklearn.models import StackingClassifier
...

meta_clf = RidgeClassifier()
ensemble = [RandomForestClassifier(), KNeighborsClassifier(), SVC()]

stack_clf = StackingClassifier(clfs=ensemble, meta_clf=meta_clf, refit=True)

stack_clf.fit(X_train, y_train)
y_ = stack_clf.predict(X_test)

StackingRegressor

Ensemble regressor that stacks an ensemble of regressors by using their outputs as input features.

Arguments

regs - List of ensemble of regressors.

meta_reg - Meta regressor that stacks the predictions of the ensemble.

keep_features - Flag to train the meta regressor on the original features too.

refit - Flag to retrain the ensemble of regressors.

Example:

from xklearn.models import StackingRegressor
...

meta_reg = RidgeRegressor()
ensemble = [RandomForestRegressor(), KNeighborsRegressor(), SVR()]

stack_reg = StackingRegressor(regs=ensemble, meta_reg=meta_reg, refit=True)

stack_reg.fit(X_train, y_train)
y_ = stack_reg.predict(X_test)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xklearn-0.0.3.tar.gz (12.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page