SciKIt-learn Pre-processing Pipeline in PAndas
Project description
Skippa
SciKIt-learn Pre-processing Pipeline in PAndas
Read more in the introduction blog on towardsdatascience
Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.
Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.
So basically the same idea as scikit-pandas
, but a different (and hopefully better) way to achieve it.
Installation
pip install skippa
Optional, if you want to use the gradio app functionality:
pip install skippa[gradio]
Basic usage
Import Skippa
class and columns
helper function
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from skippa import Skippa, columns
Get some data
df = pd.DataFrame({
'q': [0, 0, 0],
'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
'x': ['a', 'b', 'c'],
'x2': ['m', 'n', 'm'],
'y': [1, 16, 1000],
'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])
Define your pipeline:
pipe = (
Skippa()
.select(columns(['x', 'x2', 'y', 'z']))
.cast(columns(['x', 'x2']), 'category')
.impute(columns(dtype_include='number'), strategy='median')
.impute(columns(dtype_include='category'), strategy='most_frequent')
.scale(columns(dtype_include='number'), type='standard')
.onehot(columns(['x', 'x2']))
.model(LogisticRegression())
)
and use it for fitting / predicting like this:
pipe.fit(X=df, y=y)
predictions = pipe.predict_proba(df)
If you want details on your model, use:
model = pipe.get_model()
print(model.coef_)
print(model.intercept_)
(de)serialization
And of course you can save and load your model pipelines (for deployment).
N.B. dill
is used for ser/de because joblib and pickle don't provide enough support.
pipe.save('./models/my_skippa_model_pipeline.dill')
...
my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)
See the ./examples directory for more examples:
- 01-standard-pipeline.py
- 02-preprocessing-only.py
- 03a-gridsearch.py
- 03b-hyperopt.py
- 04-gradio-app.py
- 05-PCA.py
To Do
- Support pandas assign for creating new columns based on existing columns
- Support cast / astype transformer
- Support for .apply transformer: wrapper around
pandas.DataFrame.apply
- Check how GridSearch (or other param search) works with Skippa
- Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
- Support PCA transformer
- Facilitate random seed in Skippa object that is dispatched to all downstream operations
- fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
- Investigate if Skippa can directly extend sklearn's Pipeline -> using getitem trick
- Use sklearn's new dataframe output setting
- Validation of pipeline steps
- Input validation in transformers
- Transformer for replacing values (pandas .replace)
- Support arbitrary transformer (if column-preserving)
- Eliminate the need to call columns explicitly
Credits
- Skippa is powered by Data Science Lab Amsterdam
- This project structure is based on the
audreyr/cookiecutter-pypackage
project template.
History
0.1.16 (2023-08-17)
- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'
0.1.15 (2022-11-18)
- Fix: when saving a pipeline, include dependencies in dill serialization.
0.1.14 (2022-05-13)
- Bugfix in .assign: shouldn't have columns
- Bugfix in imputer: explicit missing_values arg leads to issues
- Used space-titanic data in examples
- Logo added :)
0.1.13 (2022-04-08)
- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns
0.1.12 (2022-02-07)
- Gradio & dependencies are not installed by default, but are declared an optional extra in setup
0.1.11 (2022-01-13)
- Example added for hyperparameter tuning with Hyperopt
0.1.10 (2021-12-28)
- Added support for PCA (including example)
- Gradio app support extended to regression
- Minor cleanup and improvements
0.1.9 (2021-12-24)
- Added support for automatic creation of Gradio app for model inspection
- Added example with Gradio app
0.1.8 (2021-12-23)
- Removed print statement in SkippaSimpleImputer
- Added unit tests
0.1.7 (2021-12-20)
- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
- Example added using GridSearch
0.1.6 (2021-12-17)
- Docs, setup, readme updates
- Updated
.apply()
method so that is accepts a columns specifier
0.1.5 (2021-12-13)
- Fixes for readthedocs
0.1.4 (2021-12-13)
- Cleanup/fix in examples/full-pipeline.py
0.1.3 (2021-12-10)
- Added
.apply()
transformer forpandas.DataFrame.apply()
functionality - Documentation and examples update
0.1.2 (2021-11-28)
- Added
.assign()
transformer forpandas.DataFrame.assign()
functionality - Added
.cast()
transformer (with aliases.astype()
&.as_type()
) forpandas.DataFrame.astype
functionality
0.1.1 (2021-11-22)
- Fixes and documentation.
0.1.0 (2021-11-19)
- First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file skippa-0.1.16.tar.gz
.
File metadata
- Download URL: skippa-0.1.16.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdf9f99f1d8dcb2d7d8c21831079bf20388c2ffcc5cf7ae19b019f83d56bc878 |
|
MD5 | e7348e4a17aed95ac9ec3c677a7e2e15 |
|
BLAKE2b-256 | 2d8a37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610 |
File details
Details for the file skippa-0.1.16-py3-none-any.whl
.
File metadata
- Download URL: skippa-0.1.16-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4865653843d9ec686a071fcff591e54b369383be6e1037b0e8b96bf7da2f5b5d |
|
MD5 | 9f0f5c30ce142f4b86393cde83a540b2 |
|
BLAKE2b-256 | a174159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd |