Pipeline Optimizer is a Python library that aims to simplify and automate the machine learning pipeline, from preprocessing and testing to deployment. By providing a reusable infrastructure, the library allows you to manage custom preprocessing functions and reuse them effortlessly during the deployment of your project. This is particularly useful when dealing with a large number of custom functions.
Project description
Pipeline Optimizer is a Python library that aims to simplify and automate the machine learning pipeline, from preprocessing and testing to deployment. By providing a reusable infrastructure, the library allows you to manage custom preprocessing functions and reuse them effortlessly during the deployment of your project. This is particularly useful when dealing with a large number of custom functions.
The library currently features a single class called SequentialTransformer
which allows you to add custom preprocessing functions using a simple decorator. This class also integrates with scikit-learn's TransformerMixin
, making it compatible with the widely-used scikit-learn library.
Installation
pip install pipeline_optimizer
SequentialTransformer
SequentialTransformer
is a class that stores a list of preprocessing steps and applies them sequentially to input data. You can easily add a custom preprocessing function to its memory using the @add_step
decorator. The class also provides methods to transform the input data, save the transformer to disk, and load it for later use.
Here's a quick demonstration of how to use the SequentialTransformer
class:
Step 1: Import necessary libraries
import pandas as pd
from pipeline_optimizer import SequentialTransformer, add_step
Step 2: Load your dataset
data = pd.DataFrame({
"A": [1, 2, 3, 4, 5],
"B": [5, 4, 3, 2, 1],
"C": [10, 20, 30, 40, 50]
})
labels = pd.Series([0, 1, 0, 1, 1])
Step 3: Define preprocessing functions and add them to the pipeline
pipe = SequentialTransformer()
@add_step(pipe)
def drop_column(df: pd.DataFrame, col: str = "B") -> pd.DataFrame:
return df.drop(columns=[col])
@add_step(pipe)
def multiply(df: pd.DataFrame, col: str = "A", multiplier: float = 2) -> pd.DataFrame:
df[col] = df[col] * multiplier
return df
Step 4: Transform the input data
After applying the preprocessing functions, the SequentialTransformer will drop column "B" and multiply column "A" by 2.
transformed_data = pipe.transform(data)
print(transformed_data)
Output:
A C
0 2 10
1 4 20
2 6 30
3 8 40
4 10 50
Step 5: Save the transformer object
pipe.save("transformer.pkl")
Step 6: Load the saved transformer and apply it to deployment data
You can load the saved transformer using the pickle module and apply it to new deployment data to preprocess it.
import pickle
# Load the saved transformer
with open("transformer.pkl", "rb") as f:
loaded_pipe = pickle.load(f)
# Deployment data
deployment_data = pd.DataFrame({
"A": [6],
"B": [3],
"C": [60]
})
# Transform the deployment data using the loaded transformer
transformed_deployment_data = loaded_pipe.transform(deployment_data)
print(transformed_deployment_data)
Output:
A C
0 12 60
Integration with scikit-learn Pipeline
A noteworthy feature of the SequentialTransformer
is that it can be seamlessly integrated with scikit-learn's Pipeline
class. This further simplifies the preprocessing and deployment processes, enabling you to create an end-to-end machine learning pipeline that combines custom preprocessing steps with scikit-learn estimators.
By incorporating the SequentialTransformer
into an sklearn Pipeline
, you can benefit from the full range of features provided by scikit-learn, such as cross-validation, grid search, and model evaluation.
Here's a quick example of how to integrate initialized SequentialTransformer
with an sklearn Pipeline
:
pipe = SequentialTransformer()
@add_step(pipe)
def drop_column(df: pd.DataFrame, col: str = "B") -> pd.DataFrame:
return df.drop(columns=[col])
@add_step(pipe)
def multiply(df: pd.DataFrame, col: str = "A", multiplier: float = 2) -> pd.DataFrame:
df[col] = df[col] * multiplier
return df
# Create an sklearn pipeline with the custom SequentialTransformer and a Linear Discriminant Analysis
pipeline = Pipeline([
("preprocessor", pipe), # Ensure the SequentialTransformer has been initialized and steps have been added
("lda", LinearDiscriminantAnalysis())
])
# Fit the pipeline
pipeline.fit_transform(X, y)
Comparison with scikit-learn
When working with custom preprocessing functions using the scikit-learn library, you would typically define a custom class that inherits from TransformerMixin
and implement fit
and transform
methods for each function. This can be time-consuming and may lead to code duplication.
Alternatively, you can use scikit-learn's FunctionTransformer
to create transformers from user-defined functions. However, using FunctionTransformer
can become unwieldy when you have many preprocessing functions, as you need to create an instance of FunctionTransformer
for each function and manage them individually.
Here's an example of how you would use FunctionTransformer
to accomplish the same preprocessing steps as in the previous example:
from sklearn.preprocessing import FunctionTransformer
# Define the preprocessing functions
def drop_column(df: pd.DataFrame, col: str) -> pd.DataFrame:
return df.drop(columns=[col])
def multiply(df: pd.DataFrame, col: str, multiplier: float) -> pd.DataFrame:
df[col] = df[col] * multiplier
return df
# Create FunctionTransformer instances for each function
drop_column_transformer = FunctionTransformer(drop_column, kw_args={"col": "B"})
multiply_transformer = FunctionTransformer(multiply, kw_args={"col": "A", "multiplier": 2})
# Apply the preprocessing functions to the toy dataset
data_dropped = drop_column_transformer.transform(data)
data_transformed = multiply_transformer.transform(data_dropped)
As you can see, using FunctionTransformer
requires creating separate instances for each preprocessing function and managing them individually. This approach can become cumbersome when dealing with a large number of custom functions. In contrast, the SequentialTransformer
class in the Pipeline Optimizer library provides a more streamlined and efficient way to manage and apply multiple preprocessing functions.
With the Pipeline Optimizer library, you can easily define preprocessing functions and add them to the SequentialTransformer
pipeline using the @add_step
decorator. This approach is more concise and allows you to reuse your preprocessing functions across different projects effortlessly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pipeline_optimizer-0.1.7.tar.gz
.
File metadata
- Download URL: pipeline_optimizer-0.1.7.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.8.8 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b670e5ddeecf6e10de370aee05dd6a75c86be57d7a001b87488b8a2a30b96505 |
|
MD5 | 2ab6f46dd5864745456bc8e19f092312 |
|
BLAKE2b-256 | 3aa6976c146b93769af91e3e9d0b3928dad7a7f0acc054421d72d2f4494cafef |
File details
Details for the file pipeline_optimizer-0.1.7-py3-none-any.whl
.
File metadata
- Download URL: pipeline_optimizer-0.1.7-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.8.8 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14a1b8c0c45a9ee7194bb91056168dbdcd6eb2073a4477f9d4486b5c481cd687 |
|
MD5 | b42a5aa8d21db1fed190392e07d6d783 |
|
BLAKE2b-256 | 7c631fa2bebbdbf942c9b0369e0c4ff5cee1419ca1e302bb14a8ebd568ae5aa7 |