SciKIt-learn Pre-processing Pipeline in PAndas
Project description
Skippa
SciKIt-learn Pre-processing Pipeline in PAndas
Read more in the introduction blog on towardsdatascience
Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.
Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.
So basically the same idea as scikit-pandas
, but a different (and hopefully better) way to achieve it.
Installation
pip install skippa
Optional, if you want to use the gradio app functionality:
pip install skippa[gradio]
Basic usage
Import Skippa
class and columns
helper function
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from skippa import Skippa, columns
Get some data
df = pd.DataFrame({
'q': [0, 0, 0],
'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
'x': ['a', 'b', 'c'],
'x2': ['m', 'n', 'm'],
'y': [1, 16, 1000],
'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])
Define your pipeline:
pipe = (
Skippa()
.select(columns(['x', 'x2', 'y', 'z']))
.cast(columns(['x', 'x2']), 'category')
.impute(columns(dtype_include='number'), strategy='median')
.impute(columns(dtype_include='category'), strategy='most_frequent')
.scale(columns(dtype_include='number'), type='standard')
.onehot(columns(['x', 'x2']))
.model(LogisticRegression())
)
and use it for fitting / predicting like this:
pipe.fit(X=df, y=y)
predictions = pipe.predict_proba(df)
If you want details on your model, use:
model = pipe.get_model()
print(model.coef_)
print(model.intercept_)
(de)serialization
And of course you can save and load your model pipelines (for deployment).
N.B. dill
is used for ser/de because joblib and pickle don't provide enough support.
pipe.save('./models/my_skippa_model_pipeline.dill')
...
my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)
See the ./examples directory for more examples:
- 01-standard-pipeline.py
- 02-preprocessing-only.py
- 03a-gridsearch.py
- 03b-hyperopt.py
- 04-gradio-app.py
- 05-PCA.py
To Do
- Support pandas assign for creating new columns based on existing columns
- Support cast / astype transformer
- Support for .apply transformer: wrapper around
pandas.DataFrame.apply
- Check how GridSearch (or other param search) works with Skippa
- Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
- Support PCA transformer
- Facilitate random seed in Skippa object that is dispatched to all downstream operations
- fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
- Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
- Use sklearn's new dataframe output setting
- Validation of pipeline steps
- Input validation in transformers
- Transformer for replacing values (pandas .replace)
- Support arbitrary transformer (if column-preserving)
- Eliminate the need to call columns explicitly
Credits
- Skippa is powered by Data Science Lab Amsterdam
- This project structure is based on the
audreyr/cookiecutter-pypackage
project template.
History
0.1.16 (2023-08-17)
- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'
0.1.15 (2022-11-18)
- Fix: when saving a pipeline, include dependencies in dill serialization.
0.1.14 (2022-05-13)
- Bugfix in .assign: shouldn't have columns
- Bugfix in imputer: explicit missing_values arg leads to issues
- Used space-titanic data in examples
- Logo added :)
0.1.13 (2022-04-08)
- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns
0.1.12 (2022-02-07)
- Gradio & dependencies are not installed by default, but are declared an optional extra in setup
0.1.11 (2022-01-13)
- Example added for hyperparameter tuning with Hyperopt
0.1.10 (2021-12-28)
- Added support for PCA (including example)
- Gradio app support extended to regression
- Minor cleanup and improvements
0.1.9 (2021-12-24)
- Added support for automatic creation of Gradio app for model inspection
- Added example with Gradio app
0.1.8 (2021-12-23)
- Removed print statement in SkippaSimpleImputer
- Added unit tests
0.1.7 (2021-12-20)
- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
- Example added using GridSearch
0.1.6 (2021-12-17)
- Docs, setup, readme updates
- Updated
.apply()
method so that is accepts a columns specifier
0.1.5 (2021-12-13)
- Fixes for readthedocs
0.1.4 (2021-12-13)
- Cleanup/fix in examples/full-pipeline.py
0.1.3 (2021-12-10)
- Added
.apply()
transformer forpandas.DataFrame.apply()
functionality - Documentation and examples update
0.1.2 (2021-11-28)
- Added
.assign()
transformer forpandas.DataFrame.assign()
functionality - Added
.cast()
transformer (with aliases.astype()
&.as_type()
) forpandas.DataFrame.astype
functionality
0.1.1 (2021-11-22)
- Fixes and documentation.
0.1.0 (2021-11-19)
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.