Skip to main content

A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering

Project description

The Transformer

sk-transformers

A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering steps 🛠

ChecksAndTesting codecov Release pypi python version downloads docs license pre-commit isort black mypy linting: pylint

Introduction

Every tabular data is different. Every column needs to be treated differently. Pandas is already great! And scikit-learn has a nice collection of dataset transformers. But the possibilities of data transformation are infinite. This project tries to provide a brought collection of data transformers that can be easily used together with scikit-learn - either in a pipeline or just on its own. See the usage chapter for some examples.

The idea is simple. It is like a well-equipped toolbox 🧰: You always find the tool you need and sometimes you get inspired by seeing a tool you did not know before. Please feel free to contribute your tools and ideas.

Check out some examples in the Jupyter notebook.
Open In Colab

Installation

If you are using pip, you can install the package with the following command:

pip install sk-transformers

If you are using Poetry, you can install the package with the following command:

poetry add sk-transformers

installing dependencies

With pip:

pip install -r requirements.txt

With Poetry:

poetry install

Available transformers

Module Transformer Description
Datetime transformer DateColumnsTransformer Splits a date column into multiple columns.
Datetime transformer DurationCalculatorTransformer Calculates the duration between to given dates.
Encoder transformer MeanEncoderTransformer Scikit-learn API for the feature-engine MeanEncoder.
Generic transformer AggregateTransformer This transformer uses Pandas groupby method and aggregate to apply function on a column grouped by another column.
Generic transformer AllowedValuesTransformer This transformer replaces values that are not in a list with another value.
Generic transformer ColumnDropperTransformer Drops columns from a dataframe using Pandas drop method.
Generic transformer ColumnEvalTransformer Provides the possibility to use Pandas methods on columns.
Generic transformer DtypeTransformer Transformer that converts a column to a different dtype.
Generic transformer FunctionsTransformer This transformer is a plain wrapper around the sklearn.preprocessing.FunctionTransformer.
Generic transformer LeftJoinTransformer Uses Pandas merge function to perform a left-join based on the column of a dataframe and the index of another dataframe. The right dataframe is essentially a lookup table.
Generic transformer MapTransformer This transformer iterates over all columns in the features list and applies the given callback to the column. For this it uses the pandas.Series.map method.
Generic transformer NaNTransformer Replace NaN values with a specified value. Internally Pandas fillna method is used.
Generic transformer QueryTransformer Applies a list of queries to a dataframe. If it operates on a dataset used for supervised learning this transformer should be applied on the dataframe containing X and y.
Generic transformer ValueIndicatorTransformer Adds a column to a dataframe indicating if a value is equal to a specified value.
Generic transformer ValueReplacerTransformer Uses Pandas replace method to replace values in a column.
Number transformer MathExpressionTransformer Applies an operation to a column and a given value or column. The operation can be any operation from the numpy or operator package.
Number transformer GeoDistanceTransformer Calculates the distance in kilometers between two places on the earth using the latitudes and longitudes.
String transformer EmailTransformer Transforms an email address into multiple features.
String transformer IPAddressEncoderTransformer Encodes IPv4 and IPv6 strings addresses to a float representation.
String transformer PhoneTransformer Transforms a phone number into multiple features.
String transformer StringSimilarityTransformer Calculates the similarity between two strings using the gestalt pattern matching algorithm from the SequenceMatcher class.
String transformer StringSlicerTransformer Slices all entries of specified string features using the slice() function.
String transformer StringSplitterTransformer Splits a string column into multiple columns based on the occurrence of a character.
String transformer StringCombinationTransformer Contatenates two string columns after ordering them alphabetically first.

Usage

Let's assume you want to use some method from [NumPy's mathematical functions, to sum up the values of column foo and column bar. You could use the MathExpressionTransformer.

import pandas as pd
from sk_transformers import MathExpressionTransformer

X = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
transformer = MathExpressionTransformer([("foo", "np.sum", "bar", {"axis": 0})])
transformer.fit_transform(X).to_numpy()
array([[1, 4, 5],
       [2, 5, 7],
       [3, 6, 9]])

Even if we only pass one tuple to the transformer - in this example. Like with most other transformers the idea is to simplify preprocessing by giving the possibility to operate on multiple columns at the same time. In this case, the MathExpressionTransformer has created an extra column with the name foo_sum_bar.

In the next example, we additionally add the MapTransformer. Together with scikit-learn's pipelines it would look like this:

import pandas as pd
from sk_transformers import MathExpressionTransformer
from sk_transformers import MapTransformer
from sklearn.pipeline import Pipeline

X = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
map_step = MapTransformer([("foo", lambda x: x + 100)])
sum_step = MathExpressionTransformer([("foo", "np.sum", "bar", {"axis": 0})])
pipeline = Pipeline([("map_step", map_step), ("sum_step", sum_step)])
pipeline.fit_transform(X)
   foo  bar  foo_sum_bar
0  101    4          105
1  102    5          107
2  103    6          109

Contributing

We're all kind of in the same boat. Preprocessing/feature engineering in data science is somehow very individual - every feature is different and must be handled and processed differently. But somehow we all have the same problems: sometimes date columns have to be changed. Sometimes strings have to be formatted, sometimes durations have to be calculated, etc. There is a huge number of preprocessing possibilities but we all use the same tools.

scikit-learns pipelines help to use formalized functions. So why not also share these so-called transformers with others? This open-source project has the goal to collect useful preprocessing pipeline steps. Let us all collect what we used for preprocessing and share it with others. This way we can all benefit from each other's work and save a lot of time. So if you have a preprocessing step that you use regularly, please feel free to contribute it to this project. The idea is that this is not only a toolbox but also an inspiration for what is possible. Maybe you have not thought about this preprocessing step before.

Please check out the guide on how to contribute to this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sk_transformers-0.11.0.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

sk_transformers-0.11.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file sk_transformers-0.11.0.tar.gz.

File metadata

  • Download URL: sk_transformers-0.11.0.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.6 Linux/5.15.0-1035-azure

File hashes

Hashes for sk_transformers-0.11.0.tar.gz
Algorithm Hash digest
SHA256 8f3842b64c5a79de1a0cd03915d059b26000c8171353f47593db2422e9edea70
MD5 db11a46b373403f2b359cd15fbf42faa
BLAKE2b-256 1777b1ed90d427998b85b610370c1b787fdafb82cf7f91adf0d954928fb73043

See more details on using hashes here.

File details

Details for the file sk_transformers-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: sk_transformers-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.6 Linux/5.15.0-1035-azure

File hashes

Hashes for sk_transformers-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e472c0e5fd32b63adb47cdec485813c9549964ae81fa908e8c09f7eff005bc1
MD5 e6acd68ea5c3115b39e99b75cf4e2d9e
BLAKE2b-256 508d62cba28447721cae0adcf587afb6655d2a15d4fa1388103ba6624df788a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page