A Python package for generation and optimization of machine learning pipelines using an evolutionary algorithm
Project description
PipeGenie
Table of Contents
- Introduction
- Features
- Installation
- Usage
- PipeGenie's Internals
- Tutorials
- FAQ
- Contributors
- Citing PipeGenie
Introduction
PipeGenie is a powerful Python tool for automating the generation and optimization of machine learning pipelines. By leveraging grammar-guided genetic programming (G3P), PipeGenie enables both novices and experts to efficiently create robust machine learning workflows. The intuitive design ensures compatibility with the widely-used scikit-learn library and its pipeline API, making it accessible for a broad audience.
Whether you're a data scientist looking to streamline your workflow or a developer exploring machine learning, PipeGenie simplifies the process of pipeline optimization, allowing you to focus on achieving the best results. With support for both classification and regression tasks, PipeGenie is a versatile tool that can help you tackle a wide range of machine learning challenges.
Features
- Automated Pipeline Generation: Simplifies the creation of machine learning pipelines.
- Grammar-Guided Genetic Programming: Uses a set of predefined rules to evolve pipelines efficiently.
- Compatibility: Seamlessly integrates with scikit-learn.
- User-Friendly: Minimal machine learning knowledge required to get started.
- Support for Classification and Regression: Capable of handling both types of tasks.
- Future Plans: Support for time series.
Installation
Requirements
- Windows: Python ≥ 3.10, < 3.13
- macOS/Linux: Python ≥ 3.10, < 3.13
pipandsetuptoolsshould be up to date:
python -m pip install --upgrade pip setuptools
Installing the Package
Before installing PipeGenie, it is recommended to create a virtual environment to avoid conflicts with other packages. To create a virtual environment, you can use the following command:
python -m venv pipegenie-env
After creating the virtual environment, you can activate it using the following command:
source pipegenie-env/bin/activate
You can install PipeGenie using pip:
pip install pipegenie
You can also install PipeGenie from source by cloning the repository, navigating to the root directory of the repository, and running:
pip install .
or installing it in editable mode:
pip install -e .
If you see a Python version error, ensure your system meets the minimum requirements.
Usage
NOTE: for Windows compatibility, you need to include multiprocessing.freeze_support() within a conditional block that checks if the script is being run as the main program, and specifically call it at the very beginning of that block. However, PipeGenie uses multiprocessing to speed up the evolutionary process and an evaluation timeout to avoid long-running pipelines. When the evaluation timeout is reached, the process is terminated, and the pipeline is considered invalid. While completing the task and generating all related files, Windows has some issues terminating this process, raising warnings and errors every time it tries to do so.
...
from sklearn.datasets import load_iris
import multiprocessing # Import the multiprocessing module
if __name__ == '__main__':
# Call freeze_support() only if running on Windows
multiprocessing.freeze_support()
# Load the dataset
X, y = load_iris(return_X_y=True)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
...
Classification Example
You can try the following code to generate a model for a classification task using the PipegenieClassifier class:
from pipegenie.classification import PipegenieClassifier
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import balanced_accuracy_score
from sklearn.datasets import load_iris
# Load the dataset
X, y = load_iris(return_X_y=True)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Define the grammar file
grammar_file = "tutorials/sample-grammar-classification.xml"
# Create an instance of the PipegenieClassifier
model = PipegenieClassifier(
grammar=grammar_file,
generations=5,
pop_size=50,
elite_size=5,
n_jobs=5,
seed=42,
outdir="sample-results",
)
# Fit the classifier
model.fit(X_train, y_train)
# Predict using the classifier
y_pred = model.predict(X_test)
# Evaluate the classifier using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")
# Evaluate the classifier using another metric
print(f"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}")
You can also check the following Jupyter notebook for a more detailed example: Classification Example.
Regression Example
You can try the following code to generate a model for a regression task using the PipegenieRegressor class:
from pipegenie.regression import PipegenieRegressor
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import root_mean_squared_error
from sklearn.datasets import load_diabetes
# Load the dataset
X, y = load_diabetes(return_X_y=True)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=9)
# Define the grammar file
grammar_file = "tutorials/sample-grammar-regression.xml"
# Create an instance of the PipegenieRegressor
model = PipegenieRegressor(
grammar=grammar_file,
generations=5,
pop_size=50,
elite_size=5,
maximization=False,
n_jobs=5,
seed=9,
outdir="sample-results",
)
# Fit the regressor
model.fit(X_train, y_train)
# Predict using the regressor
y_pred = model.predict(X_test)
# Evaluate the regressor using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")
# Evaluate the regressor using another metric
print(f"Root mean squared error: {root_mean_squared_error(y_test, y_pred)}")
You can also check the following Jupyter notebook for a more detailed example: Regression Example.
Saving and Loading Models
Once you have trained a model, you may want to save it to disk for later use. To save a model, you can use the provided save_model method:
model.save_model("model.pkl")
This method will save the internal ensemble created after the evolutionary process.
In case you want to save the model manually, you can use the following code:
import pickle
with open("model.pkl", "wb") as file:
pickle.dump(model.ensemble, file)
To load a model, you can use the pickle module:
import pickle
with open("model.pkl", "rb") as file:
ensemble = pickle.load(file)
PipeGenie's Internals
Grammar Definition
To be able to use PipeGenie, first you need to define the grammar that will be used to generate the pipelines. The grammar is a XML or EBNF-like file that contains the rules that will be used to generate the pipelines.
A full example of grammar files can be found in the tutorials folder. To modify the grammar, you can create your own grammar file or modify the existing ones. An explanation of how to change the grammar file can be found here.
XML format
The grammar file should be structured as follows:
-
<grammar> </grammar> : The root element of the file. All components of the grammar must be within these tags.
-
<root-symbol> </root-symbol> : The initial symbol of the grammar. Derivations start from this symbol to create an algorithm flow and its corresponding hyperparameters. Should be written within the
<grammar>tags. -
<non-terminals> </non-terminals> : This tag contains all the production rules tags that can be used by the evolutionary process to generate a pipeline. Should be written within the
<grammar>tags. -
<non-terminal> </non-terminal> : Specifies a context-free grammar non terminal symbol. Should be written within the
<non-terminals>tags. Its attribute is:- name : Name of the antecedent of the production rule.
-
<production-rule> </production-rule> : Specifies the consequent of a production rule. Terminals and non-terminals can be used in the production rule and should be separated by a
;. The symbols of the consequent have to be written with the same name as the<terminal>and<non-terminal>tags. To specify that an algorithm has hyperparameters, either terminals (list of hyperparameter<terminal>tags separated by a;) or a non-terminal (name of the non-terminal tag that contains the hyperparameters) should be used and written between parentheses (e.g.,RandomForestClassifier(rf::n_estimators;rf::max_depth)orRandomForestClassifier(rf_hyperparameters)). Should be written within the<non-terminal>tags. -
<terminals> </terminals> : This tag contains both algorithms and hyperparameters tags that can be used by the evolutionary process to generate a pipeline. Should be written within the
<grammar>tags. -
<terminal> </terminal> : Used to specify a terminal symbol of the grammar. It can be an algorithm or a hyperparameter. Should be written within the
<terminals>tags. To define an algorithm, the attributes are:- name : Identifier name for the algorithm for string representation.
- type : Category of the algorithm for grouping purposes. This string can contain any value, allowing grouping of terminals with the same string value. Optional.
- code : Package and name of the algorithm to be imported to use in the evolutionary process (e.g.,
sklearn.ensemble.RandomForestClassifier). It can be any algorithm compatible with scikit-learn API. Otherwise, a custom wrapper should be created to adapt the API.
To define a hyperparameter, the attributes are:
- name : Identifier name for the hyperparameter. Should be written as
algorithm_id::hyperparameter_name(e.g.,rf::n_estimators) and should be unique for each hyperparameter.algorithm_idcan be any string that identifies the algorithm to which the hyperparameter belongs. It is used in order to avoid conflicts between hyperparameters with the same name in different algorithms.hyperparameter_nameshould be the same as the hyperparameter name in the algorithm's constructor. - type : Data type of the hyperparameter (int, float, categorical, bool, estimator, tuple or union). For a fix value, use the prefix
fix_. - lower : Lower limit for numerical hyperparameters.
- upper : Upper limit for numerical hyperparameters.
- log : Specifies whether the hyperparameter should be sampled in log scale. Only used for numerical hyperparameters.
- step : Specifies the step when sampling the hyperparameter. Only used for numerical hyperparameters.
- values : Specifies the possible values for categorical hyperparameters. Only used for categorical hyperparameters.
- estimators : Points to a
<non-terminal>tag that contains the possible algorithms that can be used as hyperparameters. Only used for estimator hyperparameters. - value : Specifies the value for fixed hyperparameters. Only used for fixed hyperparameters.
- default : Default value for the hyperparameter.
-
<tuple> </tuple> : Specifies a tuple of values for a hyperparameter (e.g.,
MLPClassifier::hidden_layer_sizes). Should be written within a<terminal>tag whose type istuple. Its attributes are the same as the<terminal>tag to define hyperparameters, except for thenameattribute. -
<union> </union> : Specifies a union of value types that can be used for a hyperparameter (e.g.
RandomForestClassifier::max_featurescan be either a categorical, integer or float hyperparameter). Should be written within a<terminal>tag whose type isunion. Its attributes are the same as the<terminal>tag to define hyperparameters, except for thenameattribute.
EBNF-like format
The main components of the grammar are described below.
- @root : Specifies the initial symbol of the grammar. Derivations start from this symbol to create an algorithm flow and its corresponding hyperparameters. The identifier that follows
@rootmust correspond to a non-terminal rule defined in the grammar. - @algorithm : Used to define an algorithm that can be used by the evolutionary process to generate pipelines. The syntax of this directive is:
@algorithm <name> <code> <type>- name : Identifier used in the grammar to reference the algorithm when defining production rules.
- code : Full Python import path of the algorithm implementation (e.g., sklearn.ensemble.RandomForestClassifier). The algorithm must be compatible with the scikit-learn API. Otherwise, a wrapper should be implemented to adapt the API.
- type : Category used to group algorithms with similar functionality. This attribute is optional and is only used for organizational purposes.
- @hyperparameter : Defines a hyperparameter that can be optimized during the evolutionary process. The syntax of this directive is:
@hyperparameter <name> <type> [attributes]-name : Identifier of the hyperparameter. It must follow the formatalgorithm_id::hyperparameter_name. The algorithm_id is used to avoid conflicts between hyperparameters with the same name in different algorithms, while hyperparameter_name must match the parameter name expected by the algorithm constructor.- type : Data type of the hyperparameter. Supported types include int, float, categorical, bool, estimator, tuple, and union. For fixed values, the prefix fix_ can be used
- attributes : Additional properties that define how the hyperparameter should be sampled or configured. They should follow a key=value configuration. The following attributes can be used depending on the hyperparameter type:
- lower : Lower bound for numerical hyperparameters.
- upper : Upper bound for numerical hyperparameters.
- log : Indicates that the hyperparameter should be sampled in logarithmic scale. Only used for numerical hyperparameters.
- step : Sampling step size for numerical hyperparameters.
- values : Possible values for categorical hyperparameters.
- estimators : Reference to a non-terminal rule that defines the possible algorithms that can be used as estimator hyperparameters.
- value : Specifies the value of a fixed hyperparameter.
- default : Default value of the hyperparameter.
- Non-terminal rules : Define the production rules used to generate valid pipelines. Each production represents a possible expansion of the non-terminal symbol. A non-terminal rule follows the format:
non_terminal =
production1
production2
...
- Production rules : Specify the sequence of terminals and non-terminals that replace a non-terminal symbol. Elements within a production are separated using ;. Production rules can contain algorithm identifiers, hyperparameter identifiers, or other non-terminals defined in the grammar. To specify that an algorithm has hyperparameters, the hyperparameters must be written between parentheses after the algorithm name. The content inside the parentheses can be a non-terminal rule that groups those hyperparameters.
algorithm(algorithm_hp) - Hyperparameter groups : Hyperparameters belonging to a specific algorithm can be grouped using a non-terminal rule. This allows production rules to reference the entire hyperparameter set through a single identifier.
algorithm_hp =
id::hyperparameter1;id::hyperparameter2;
...
Outputs
After running the code, you will find a folder named sample-results (or the name you specified in the outdir parameter) containing the following files:
- config.txt: Contains the configuration (parameters passed to the constructor) used for the evolutionary process. An example of this file can be found here.
- best_pipeline.txt: Contains the best pipeline found by the evolutionary process, its fitness value and its predictions. An example of this file can be found here.
- ensemble.txt: Contains the ensemble of pipelines found by the evolutionary process, their individual fitness values, the global fitness value, the weights of the pipelines in the ensemble and the predictions of the ensemble. An example of this file can be found here.
- evolution.txt: Contains the evolution of the population during the evolutionary process. For each generation, it shows the number of evaluated individuals, the minimum, maximum, average and standard deviation of the fitness values of the population, fitness values of the elite, pipeline sizes of the population and pipeline sizes of the elite. At the end of the process, it will show the time in seconds that the process took. An example of this file can be found here.
- individuals.tsv: Contains all the individuals evaluated during the evolutionary process. For each individual, it shows the pipeline, its fitness value and the time it took to evaluate it. If the individual was invalid, it will show the reason. An example of this file can be found here.
For more information about the outputs, you can check the Understanding PipeGenie Outputs tutorial.
FAQ
Q: What is grammar-guided genetic programming (G3P)?
A: G3P is a form of genetic programming that uses a grammar to guide the evolution of programs, ensuring they adhere to predefined structures and rules.
Q: Can I use PipeGenie with other machine learning libraries?
A: PipeGenie is designed to work with scikit-learn, so any library that is API-compatible with scikit-learn should work. If you want to use PipeGenie with a different library, you may need to make wrapper classes to adapt the API.
Q: Why the save_model method only saves the ensemble and not the Pipegenie object?
A: The ensemble is the most important part of the Pipegenie object, as it is responsible for making predictions. The Pipegenie object is only used to create and train the ensemble, so it is not necessary to save it. Also, due to limitations in the pickle module, it is not possible to save the Pipegenie object. In case you want to save the Pipegenie object (save the ensemble as well as the loaded grammar and configuration), you should use other serialization libraries like dill or cloudpickle.
Contributors
-
Rafael Barbudo Lunar
-
Mario Berrios Carmona
-
Carlos García Martínez
-
Vincent Gardies
-
Ángel Fuentes Almoguera
-
Aurora Ramírez Quesada
Citing PipeGenie
If you are using PipeGenie in your research or project, please cite using the following BibTeX entry:
@article{BARBUDO2024111292,
title = {Grammar-based evolutionary approach for automated workflow composition with domain-specific operators and ensemble diversity},
journal = {Applied Soft Computing},
volume = {153},
pages = {111292},
year = {2024},
issn = {1568-4946},
doi = {https://doi.org/10.1016/j.asoc.2024.111292},
url = {https://www.sciencedirect.com/science/article/pii/S1568494624000668},
author = {Rafael Barbudo and Aurora Ramírez and José Raúl Romero},
}
License
This project is distributed under the "MIT License – HUMAINS Research Group Attribution Variant". Based on the MIT License, with the additional requirement that proper attribution be given to the original authors and the HUMAINS Research Group (University of Córdoba, Spain) - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipegenie-0.8.1.tar.gz.
File metadata
- Download URL: pipegenie-0.8.1.tar.gz
- Upload date:
- Size: 72.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3db70950de11f1271005fb8c894cf4b7d39c236b6b301c161c7ac585885ce54d
|
|
| MD5 |
47d57b7795e6c83c3d3fad1c977f967e
|
|
| BLAKE2b-256 |
081a4a4c8c8bf4eb3fb4217c25d903410c69e927403a59b8ecd127ececf2ae2c
|
File details
Details for the file pipegenie-0.8.1-py3-none-any.whl.
File metadata
- Download URL: pipegenie-0.8.1-py3-none-any.whl
- Upload date:
- Size: 81.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfcf6ddf65d5ee5b527c32f5474daf5c4cc5585093c21c1f0b60caafafbaa432
|
|
| MD5 |
0fdcc1921dab673ab3bdb139d6bd004a
|
|
| BLAKE2b-256 |
bb543251f1155089fe9c37dee63782bf38d011fea66e73cb5b91e0699f3a84f7
|