Skip to main content

Pipeline Optimizer is a Python library that aims to simplify and automate the machine learning pipeline, from preprocessing and testing to deployment. By providing a reusable infrastructure, the library allows you to manage custom preprocessing functions and reuse them effortlessly during the deployment of your project. This is particularly useful when dealing with a large number of custom functions.

Project description

pipe-logo

Tests and Coverage

Pipeline Optimizer is a Python library that aims to simplify and automate the machine learning pipeline, from preprocessing and testing to deployment. By providing a reusable infrastructure, the library allows you to manage custom preprocessing functions and reuse them effortlessly during the deployment of your project. This is particularly useful when dealing with a large number of custom functions.

The library currently features a single class called SequentialTransformer which allows you to add custom preprocessing functions using a simple decorator. This class also integrates with scikit-learn's TransformerMixin, making it compatible with the widely-used scikit-learn library.

Installation

pip install pipeline_optimizer

SequentialTransformer

SequentialTransformer is a class that stores a list of preprocessing steps and applies them sequentially to input data. You can easily add a custom preprocessing function to its memory using the @add_step decorator. The class also provides methods to transform the input data, save the transformer to disk, and load it for later use.

Here's a quick demonstration of how to use the SequentialTransformer class:

Step 1: Import necessary libraries

import pandas as pd
from pipeline_optimizer import SequentialTransformer, add_step

Step 2: Load your dataset

data = pd.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [5, 4, 3, 2, 1],
    "C": [10, 20, 30, 40, 50]
})

labels = pd.Series([0, 1, 0, 1, 1])

Step 3: Define preprocessing functions and add them to the pipeline

pipe = SequentialTransformer()

@add_step(pipe)
def drop_column(df: pd.DataFrame, col: str = "B") -> pd.DataFrame:
    return df.drop(columns=[col])

@add_step(pipe)
def multiply(df: pd.DataFrame, col: str = "A", multiplier: float = 2) -> pd.DataFrame:
    df[col] = df[col] * multiplier
    return df

Step 4: Transform the input data

After applying the preprocessing functions, the SequentialTransformer will drop column "B" and multiply column "A" by 2.

transformed_data = pipe.transform(data)
print(transformed_data)

Output:


    A   C
0   2  10
1   4  20
2   6  30
3   8  40
4  10  50

Step 5: Save the transformer object

pipe.save("transformer.pkl")

Step 6: Load the saved transformer and apply it to deployment data

You can load the saved transformer using the pickle module and apply it to new deployment data to preprocess it.

import pickle

# Load the saved transformer
with open("transformer.pkl", "rb") as f:
    loaded_pipe = pickle.load(f)

# Deployment data
deployment_data = pd.DataFrame({
    "A": [6],
    "B": [3],
    "C": [60]
})

# Transform the deployment data using the loaded transformer
transformed_deployment_data = loaded_pipe.transform(deployment_data)
print(transformed_deployment_data)

Output:

   A   C
0  12  60

Integration with scikit-learn Pipeline

A noteworthy feature of the SequentialTransformer is that it can be seamlessly integrated with scikit-learn's Pipeline class. This further simplifies the preprocessing and deployment processes, enabling you to create an end-to-end machine learning pipeline that combines custom preprocessing steps with scikit-learn estimators.

By incorporating the SequentialTransformer into an sklearn Pipeline, you can benefit from the full range of features provided by scikit-learn, such as cross-validation, grid search, and model evaluation.

Here's a quick example of how to integrate initialized SequentialTransformer with an sklearn Pipeline:

pipe = SequentialTransformer()

@add_step(pipe)
def drop_column(df: pd.DataFrame, col: str = "B") -> pd.DataFrame:
    return df.drop(columns=[col])

@add_step(pipe)
def multiply(df: pd.DataFrame, col: str = "A", multiplier: float = 2) -> pd.DataFrame:
    df[col] = df[col] * multiplier
    return df

# Create an sklearn pipeline with the custom SequentialTransformer and a Linear Discriminant Analysis
pipeline = Pipeline([
    ("preprocessor", pipe),  # Ensure the SequentialTransformer has been initialized and steps have been added
    ("lda", LinearDiscriminantAnalysis())
])

# Fit the pipeline 
pipeline.fit_transform(X, y)

Comparison with scikit-learn

When working with custom preprocessing functions using the scikit-learn library, you would typically define a custom class that inherits from TransformerMixin and implement fit and transform methods for each function. This can be time-consuming and may lead to code duplication.

Alternatively, you can use scikit-learn's FunctionTransformer to create transformers from user-defined functions. However, using FunctionTransformer can become unwieldy when you have many preprocessing functions, as you need to create an instance of FunctionTransformer for each function and manage them individually.

Here's an example of how you would use FunctionTransformer to accomplish the same preprocessing steps as in the previous example:

from sklearn.preprocessing import FunctionTransformer

# Define the preprocessing functions
def drop_column(df: pd.DataFrame, col: str) -> pd.DataFrame:
    return df.drop(columns=[col])

def multiply(df: pd.DataFrame, col: str, multiplier: float) -> pd.DataFrame:
    df[col] = df[col] * multiplier
    return df

# Create FunctionTransformer instances for each function
drop_column_transformer = FunctionTransformer(drop_column, kw_args={"col": "B"})
multiply_transformer = FunctionTransformer(multiply, kw_args={"col": "A", "multiplier": 2})

# Apply the preprocessing functions to the toy dataset
data_dropped = drop_column_transformer.transform(data)
data_transformed = multiply_transformer.transform(data_dropped)

As you can see, using FunctionTransformer requires creating separate instances for each preprocessing function and managing them individually. This approach can become cumbersome when dealing with a large number of custom functions. In contrast, the SequentialTransformer class in the Pipeline Optimizer library provides a more streamlined and efficient way to manage and apply multiple preprocessing functions.

With the Pipeline Optimizer library, you can easily define preprocessing functions and add them to the SequentialTransformer pipeline using the @add_step decorator. This approach is more concise and allows you to reuse your preprocessing functions across different projects effortlessly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipeline_optimizer-0.1.7.tar.gz (5.5 kB view hashes)

Uploaded Source

Built Distribution

pipeline_optimizer-0.1.7-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page