A collection of various scikit-learn compatible transformers for all kinds of preprocessing and feature engineering
Project description
sk-transformers
A collection of various scikit-learn transformers for all kinds of preprocessing and feature engineering steps 🛠
Introduction
Every tabular data is different. Every column needs to be treated differently. scikit-learn has already a nice collection of dataset transformers. But the possibilities of data transformation are infinite. This project tries to provide a brought collection of data transformers that can be easily used together with scikit-learn - either in a pipeline or just by its own. See the usage chapter for some examples.
The idea is simple. It is like a well-equipped toolbox 🧰: You always find the tool you need and sometimes you get inspired by seeing a tool you did not know before. Please feel free to contribute your tools and ideas.
Installation
If you are using pip, you can install the package with the following command:
pip install sk-transformers
If you are using Poetry, you can install the package with the following command:
poetry add sk-transformers
installing dependencies
With pip:
pip install -r requirements.txt
With Poetry:
poetry install
Usage
Let's assume you want to use some method from [NumPy's mathematical functions, to sum up the values of column foo
and column bar
. You could
use the MathExpressionTransformer
.
import pandas as pd
from sk_transformers import MathExpressionTransformer
X = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
transformer = MathExpressionTransformer([("foo", "np.sum", "bar", {"axis": 0})])
transformer.fit_transform(X).to_numpy()
array([[1, 4, 5],
[2, 5, 7],
[3, 6, 9]])
Even if we only pass one tuple to the transformer - in this example. Like with most other transformers the idea is to simplify preprocessing by giving the possibility to operate on multiple columns at the same time. In this case, the MathExpressionTransformer
has created an extra column with the name foo_sum_bar
.
In the next example, we additionally add the MapTransformer
.
Together with scikit-learn's pipelines it would look like this:
import pandas as pd
from sk_transformers import MapTransformer, MathExpressionTransformer
from sklearn.pipeline import Pipeline
X = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
map_step = MapTransformer([("foo", lambda x: x + 100)])
sum_step = MathExpressionTransformer([("foo", "np.sum", "bar", {"axis": 0})])
pipeline = Pipeline([("map_step", map_step), ("sum_step", sum_step)])
pipeline.fit_transform(X)
foo bar foo_sum_bar
0 101 4 105
1 102 5 107
2 103 6 109
Contributing
We're all kind of in the same boat. Preprocessing/feature engineering in data science is somehow very individual - every feature is different and must be handled and processed differently. But somehow we all have the same problems: sometimes date columns have to be changed. Sometimes strings have to be formatted, sometimes durations have to be calculated, etc. There is a huge number of preprocessing possibilities but we all use the same tools.
scikit-learns pipelines help to use formalized functions. So why not also share these so-called transformers with others? This open-source project has the goal to collect useful preprocessing pipeline steps. Let us all collect what we used for preprocessing and share it with others. This way we can all benefit from each other's work and save a lot of time. So if you have a preprocessing step that you use regularly, please feel free to contribute it to this project. The idea is that this is not only a toolbox but also an inspiration for what is possible. Maybe you have not thought about this preprocessing step before.
Please check out the guide on how to contribute to this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sk_transformers-0.5.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a1acce46167130fbad2894e66399cea5ddf651d038cc871c8f593e6e9d403b6 |
|
MD5 | 1a24164da7733280a5f8567d1e2b0dfd |
|
BLAKE2b-256 | 7764be74fb1b0c5ecfc2e025c2faa0efc821953bf4309b16e554f19f75c79129 |