Skip to main content

SciKIt-learn Pre-processing Pipeline in PAndas

Project description

pypi python versions downloads Build status



logo

Skippa

SciKIt-learn Pre-processing Pipeline in PAndas

Read more in the introduction blog on towardsdatascience

Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as scikit-pandas, but a different (and hopefully better) way to achieve it.

Installation

pip install skippa

Optional, if you want to use the gradio app functionality:

pip install skippa[gradio]

Basic usage

Import Skippa class and columns helper function

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns

Get some data

df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])

Define your pipeline:

pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)

and use it for fitting / predicting like this:

pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)

If you want details on your model, use:

model = pipe.get_model()
print(model.coef_)
print(model.intercept_)

(de)serialization

And of course you can save and load your model pipelines (for deployment). N.B. dill is used for ser/de because joblib and pickle don't provide enough support.

pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)

See the ./examples directory for more examples:

To Do

  • Support pandas assign for creating new columns based on existing columns
  • Support cast / astype transformer
  • Support for .apply transformer: wrapper around pandas.DataFrame.apply
  • Check how GridSearch (or other param search) works with Skippa
  • Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
  • Support PCA transformer
  • Facilitate random seed in Skippa object that is dispatched to all downstream operations
  • fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
  • Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
  • Use sklearn's new dataframe output setting
  • Validation of pipeline steps
  • Input validation in transformers
  • Transformer for replacing values (pandas .replace)
  • Support arbitrary transformer (if column-preserving)
  • Eliminate the need to call columns explicitly

Credits

History

0.1.16 (2023-08-17)

  • Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'

0.1.15 (2022-11-18)

  • Fix: when saving a pipeline, include dependencies in dill serialization.

0.1.14 (2022-05-13)

  • Bugfix in .assign: shouldn't have columns
  • Bugfix in imputer: explicit missing_values arg leads to issues
  • Used space-titanic data in examples
  • Logo added :)

0.1.13 (2022-04-08)

  • Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns

0.1.12 (2022-02-07)

  • Gradio & dependencies are not installed by default, but are declared an optional extra in setup

0.1.11 (2022-01-13)

  • Example added for hyperparameter tuning with Hyperopt

0.1.10 (2021-12-28)

  • Added support for PCA (including example)
  • Gradio app support extended to regression
  • Minor cleanup and improvements

0.1.9 (2021-12-24)

  • Added support for automatic creation of Gradio app for model inspection
  • Added example with Gradio app

0.1.8 (2021-12-23)

  • Removed print statement in SkippaSimpleImputer
  • Added unit tests

0.1.7 (2021-12-20)

  • Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
  • Example added using GridSearch

0.1.6 (2021-12-17)

  • Docs, setup, readme updates
  • Updated .apply() method so that is accepts a columns specifier

0.1.5 (2021-12-13)

  • Fixes for readthedocs

0.1.4 (2021-12-13)

  • Cleanup/fix in examples/full-pipeline.py

0.1.3 (2021-12-10)

  • Added .apply() transformer for pandas.DataFrame.apply() functionality
  • Documentation and examples update

0.1.2 (2021-11-28)

  • Added .assign() transformer for pandas.DataFrame.assign() functionality
  • Added .cast() transformer (with aliases .astype() & .as_type()) for pandas.DataFrame.astype functionality

0.1.1 (2021-11-22)

  • Fixes and documentation.

0.1.0 (2021-11-19)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skippa-0.1.16.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

skippa-0.1.16-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file skippa-0.1.16.tar.gz.

File metadata

  • Download URL: skippa-0.1.16.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for skippa-0.1.16.tar.gz
Algorithm Hash digest
SHA256 fdf9f99f1d8dcb2d7d8c21831079bf20388c2ffcc5cf7ae19b019f83d56bc878
MD5 e7348e4a17aed95ac9ec3c677a7e2e15
BLAKE2b-256 2d8a37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610

See more details on using hashes here.

File details

Details for the file skippa-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: skippa-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for skippa-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 4865653843d9ec686a071fcff591e54b369383be6e1037b0e8b96bf7da2f5b5d
MD5 9f0f5c30ce142f4b86393cde83a540b2
BLAKE2b-256 a174159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page