Skip to main content

Handy machine learning tools in the spirit of scikit-learn.

Project description

extrakit-learn

PyPI version License

Machine learnings components built to extend scikit-learn. All components use scikit's object API to work interchangably with scikit components. It is mostly a collection of tools that have been useful for Kaggle competitions. extrakit-learn is in no way affiliated with scikit-learn in anyway, just inspired by it.

Installation

pip install xklearn

Components

Hierachy

xklearn
│
├── preprocessing
│   ├── CategoryEncoder
│   ├── CountEncoder
│   ├── TargetEncoder      
│   └── MultiColumnEncoder
│
├── models
│   ├── FoldEstimator
│   ├── FoldLightGBM
|   ├── FoldXGBoost
|   ├── StackClassifier
|   └── StackRegressor
|
└── utils
Example
from xklearn.models import FoldEstimator

CategoryEncoder

Wraps scikit's LabelEncoder, allowing missing and unseen values to be handled.

Arguments

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'encode' - Replace value with -1.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example

from xklearn.preprocessing import CategoryEncoder
...

ce = CategoryEncoder(unseen='nan', missing='nan')
X[:, 0] = ce.fit_transform(X[:, 0])

CountEncoder

Replaces categorical values with their respective value count during training. Classes with a count of one and previously unseen classes during prediction are encoded as either one or NaN.

Arguments

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'one' - Replace value with 1.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example

from xklearn.preprocessing import CountEncoder
...

ce = CountEncoder(unseen='one')
X[:, 0] = ce.fit_transform(X[:, 0])

TargetEncoder

Performs target mean encoding of categorical features with optional smoothing.

Arguments

smoothing - Smoothing weight.

unseen - Strategy for handling unseen values. See replacement strategies below for options.

missing - Strategy for handling missing values. See replacement strategies below for options.

Replacement strategies

'global' - Replace value with global target mean.

'nan' - Replace value with np.nan.

'error' - Raise ValueError.

Example

from xklearn.preprocessing import TargetEncoder
...

te = TargetEncoder(smoothing=10)
X[:, 0] = te.fit_transform(X[:, 0], y)

MultiColumnEncoder

Applies a column encoder over multiple columns.

Arguments

enc - Base encoder that will be applied to selected columns

columns - Column selection, either bool-mask, indices or None (default=None).

Example

from xklearn.preprocessing import CountEncoder
from xklearn.preprocessing import MultiColumnEncoder
...

columns = [1, 3, 4]
enc = CountEncoder()

mce = MultiColumnEncoder(enc, columns)
X = mce.fit_transform(X)

FoldEstimator

K-fold wrapped into an estimator that performs cross validation over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

Arguments

est - Base estimator.

fold - Folding cross validation object, i.e KFold and StratifedKfold.

metric - Evaluation metric.

refit_full - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

verbose - Flag for printing fold scores during fit.

Example

from xklearn.models import FoldEstimator
...

base = RandomForestRegressor(n_estimators=10)
fold = KFold(n_splits=5)

est = FoldEstimator(base, fold=fold, metric=mean_squared_error, verbose=1)

est.fit(X_train, y_train)
est.predict(X_test)

Output:

Finished fold 1 with score: 200.8023
Finished fold 2 with score: 261.2365
Finished fold 3 with score: 169.2404
Finished fold 4 with score: 186.7915
Finished fold 5 with score: 205.0894
Finished with a total score of: 204.6813

FoldLightGBM

K-fold wrapped into an estimator that performs cross validation on a LGBM over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

Arguments

lgbm - Base estimator.

fold - Folding cross validation object, i.e KFold and StratifedKfold.

metric - Evaluation metric.

fit_params - Dictionary of parameter that should be fed to the fit method.

refit_full - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

refit_params - Dictionary of parameter that should be fed to the refit if refit_full=False.

verbose - Flag for printing fold scores during fit.

Example

from xklearn.models import FoldLightGBM
...

base = LGBMClassifier(n_estimators=1000)
fold = KFold(n_splits=5)
fit_params = {'eval_metric': 'auc',
              'early_stopping_rounds': 50,
              'verbose': 0}
              
fold_lgbm = FoldLightGBM(base, 
                         fold=fold, 
                         metric=roc_auc_score,
                         fit_params=fit_params,
                         verbose=1)
               
fold_lgbm.fit(X_train, y_train)
fold_lgbm.predict(X_test)

Output:

Finished fold 1 with score: 0.9114
Finished fold 2 with score: 0.9265
Finished fold 3 with score: 0.9419
Finished fold 4 with score: 0.9189
Finished fold 5 with score: 0.9152
Finished with a total score of: 0.9225

FoldXGBoost

K-fold wrapped into an estimator that performs cross validation on a XGBoost over a selected folding method automatically when fit. Can optionally be used as a stacked ensemble of k estimators after fit.

Arguments

xgb - Base estimator.

fold - Folding cross validation object, i.e KFold and StratifedKfold.

metric - Evaluation metric.

fit_params - Dictionary of parameter that should be fed to the fit method.

refit_full - Flag indicting post fit behaviour. True will do a full refit on the full data, False will make it a stacked ensemble trained on the different folds.

refit_params - Dictionary of parameter that should be fed to the refit if refit_full=False.

verbose - Flag for printing fold scores during fit.

Example

from xklearn.models import FoldXGBoost
...

base = XGBRegressor(objective="reg:linear", random_state=42)
fold = KFold(n_splits=5)
fit_params = {'eval_metric': 'mse',
              'early_stopping_rounds': 5,
              'verbose': 0}
              
fold_xgb = FoldXGBoost(base, 
                       fold=fold, 
                       metric=mean_squared_error,
                       fit_params=fit_params,
                       verbose=1)
               
fold_xgb.fit(X_train, y_train)
fold_xgb.predict(X_test)

Output:

Finished fold 1 with score: 3212.8362
Finished fold 2 with score: 2179.7843
Finished fold 3 with score: 2707.8460
Finished fold 4 with score: 2988.6643
Finished fold 5 with score: 3281.4299
Finished with a total score of: 3274.9001

StackClassifier

Ensemble classifier that stacks an ensemble of classifiers by using their outputs as input features.

Arguments

clfs - List of ensemble of classifiers.

meta_clf - Meta classifier that stacks the predictions of the ensemble.

keep_features - Flag to train the meta classifier on the original features too.

refit - Flag to retrain the ensemble of classifiers during fit.

Example

from xklearn.models import StackClassifier
...

meta_clf = RidgeClassifier()
ensemble = [RandomForestClassifier(), KNeighborsClassifier(), SVC()]

stack_clf = StackClassifier(clfs=ensemble, meta_clf=meta_clf, refit=True)

stack_clf.fit(X_train, y_train)
y_ = stack_clf.predict(X_test)

StackRegressor

Ensemble regressor that stacks an ensemble of regressors by using their outputs as input features.

Arguments

regs - List of ensemble of regressors.

meta_reg - Meta regressor that stacks the predictions of the ensemble.

drop_first : Drop first class probability to avoid multi-collinearity.

keep_features - Flag to train the meta regressor on the original features too.

refit - Flag to retrain the ensemble of regressors during fit.

Example

from xklearn.models import StackRegressor
...

meta_reg = RidgeRegressor()
ensemble = [RandomForestRegressor(), KNeighborsRegressor(), SVR()]

stack_reg = StackRegressor(regs=ensemble, meta_reg=meta_reg, refit=True)

stack_reg.fit(X_train, y_train)
y_ = stack_reg.predict(X_test)

compress_dataframe

Reduce memory usage of a Pandas dataframe by finding columns that use larger variable types than unnecessary.

Arguments

df - Dataframe for memory reduction.

verbose - Flag for printing result of memory reduction.

Example

from xklearn.utils import compress_dataframe
...

train = compress_dataframe(train, verbose=1)

Output:

Dataframe memory decreased to 169.60 MB (64.6% reduction)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xklearn-0.0.7.tar.gz (15.0 kB view details)

Uploaded Source

File details

Details for the file xklearn-0.0.7.tar.gz.

File metadata

  • Download URL: xklearn-0.0.7.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for xklearn-0.0.7.tar.gz
Algorithm Hash digest
SHA256 916e0a7618b8bdf8988de7df5edc5e93ef3ebab609d0f46c1082a60988a6befb
MD5 cde33481a71c1cc95377800208e03725
BLAKE2b-256 cdd5b4f65e1c390fe013266fb8132c22db607fe31ca35348886a9ccfc2f5e0cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page