Skip to main content

SciKIt-learn Pre-processing Pipeline in PAndas

Project description

pypi python versions downloads Build status



logo

Skippa

SciKIt-learn Pre-processing Pipeline in PAndas

Read more in the introduction blog on towardsdatascience

Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as scikit-pandas, but a different (and hopefully better) way to achieve it.

Installation

pip install skippa

Optional, if you want to use the gradio app functionality:

pip install skippa[gradio]

Basic usage

Import Skippa class and columns helper function

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns

Get some data

df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])

Define your pipeline:

pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)

and use it for fitting / predicting like this:

pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)

If you want details on your model, use:

model = pipe.get_model()
print(model.coef_)
print(model.intercept_)

(de)serialization

And of course you can save and load your model pipelines (for deployment). N.B. dill is used for ser/de because joblib and pickle don't provide enough support.

pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)

See the ./examples directory for more examples:

To Do

  • Support pandas assign for creating new columns based on existing columns
  • Support cast / astype transformer
  • Support for .apply transformer: wrapper around pandas.DataFrame.apply
  • Check how GridSearch (or other param search) works with Skippa
  • Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
  • Support PCA transformer
  • Facilitate random seed in Skippa object that is dispatched to all downstream operations
  • fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
  • Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
  • Use sklearn's new dataframe output setting
  • Validation of pipeline steps
  • Input validation in transformers
  • Transformer for replacing values (pandas .replace)
  • Support arbitrary transformer (if column-preserving)
  • Eliminate the need to call columns explicitly

Credits

History

0.1.16 (2023-08-17)

  • Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'

0.1.15 (2022-11-18)

  • Fix: when saving a pipeline, include dependencies in dill serialization.

0.1.14 (2022-05-13)

  • Bugfix in .assign: shouldn't have columns
  • Bugfix in imputer: explicit missing_values arg leads to issues
  • Used space-titanic data in examples
  • Logo added :)

0.1.13 (2022-04-08)

  • Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns

0.1.12 (2022-02-07)

  • Gradio & dependencies are not installed by default, but are declared an optional extra in setup

0.1.11 (2022-01-13)

  • Example added for hyperparameter tuning with Hyperopt

0.1.10 (2021-12-28)

  • Added support for PCA (including example)
  • Gradio app support extended to regression
  • Minor cleanup and improvements

0.1.9 (2021-12-24)

  • Added support for automatic creation of Gradio app for model inspection
  • Added example with Gradio app

0.1.8 (2021-12-23)

  • Removed print statement in SkippaSimpleImputer
  • Added unit tests

0.1.7 (2021-12-20)

  • Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
  • Example added using GridSearch

0.1.6 (2021-12-17)

  • Docs, setup, readme updates
  • Updated .apply() method so that is accepts a columns specifier

0.1.5 (2021-12-13)

  • Fixes for readthedocs

0.1.4 (2021-12-13)

  • Cleanup/fix in examples/full-pipeline.py

0.1.3 (2021-12-10)

  • Added .apply() transformer for pandas.DataFrame.apply() functionality
  • Documentation and examples update

0.1.2 (2021-11-28)

  • Added .assign() transformer for pandas.DataFrame.assign() functionality
  • Added .cast() transformer (with aliases .astype() & .as_type()) for pandas.DataFrame.astype functionality

0.1.1 (2021-11-22)

  • Fixes and documentation.

0.1.0 (2021-11-19)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skippa-0.1.16.tar.gz (27.2 kB view hashes)

Uploaded Source

Built Distribution

skippa-0.1.16-py3-none-any.whl (20.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page