A Python package for generation and optimization of machine learning pipelines using an evolutionary algorithm

These details have not been verified by PyPI

Project links

Project description

PipeGenie

Introduction
Features
Installation
- Requirements
- Installing the Package
Usage
PipeGenie's Internals
Tutorials
FAQ
Contributors
Citing PipeGenie

Introduction

PipeGenie is a powerful Python tool for automating the generation and optimization of machine learning pipelines. By leveraging grammar-guided genetic programming (G3P), PipeGenie enables both novices and experts to efficiently create robust machine learning workflows. The intuitive design ensures compatibility with the widely-used scikit-learn library and its pipeline API, making it accessible for a broad audience.

Whether you're a data scientist looking to streamline your workflow or a developer exploring machine learning, PipeGenie simplifies the process of pipeline optimization, allowing you to focus on achieving the best results. With support for both classification and regression tasks, PipeGenie is a versatile tool that can help you tackle a wide range of machine learning challenges.

Features

Automated Pipeline Generation: Simplifies the creation of machine learning pipelines.
Grammar-Guided Genetic Programming: Uses a set of predefined rules to evolve pipelines efficiently.
Compatibility: Seamlessly integrates with scikit-learn.
User-Friendly: Minimal machine learning knowledge required to get started.
Support for Classification and Regression: Capable of handling both types of tasks.
Future Plans: Support for time series.

Installation

Requirements

Windows: Python ≥ 3.10, < 3.13
macOS/Linux: Python ≥ 3.10, < 3.13
pip and setuptools should be up to date:

python -m pip install --upgrade pip setuptools

Installing the Package

Before installing PipeGenie, it is recommended to create a virtual environment to avoid conflicts with other packages. To create a virtual environment, you can use the following command:

python -m venv pipegenie-env

After creating the virtual environment, you can activate it using the following command:

source pipegenie-env/bin/activate

You can install PipeGenie using pip:

pip install pipegenie

You can also install PipeGenie from source by cloning the repository, navigating to the root directory of the repository, and running:

pip install .

or installing it in editable mode:

pip install -e .

If you see a Python version error, ensure your system meets the minimum requirements.

Usage

NOTE: for Windows compatibility, you need to include multiprocessing.freeze_support() within a conditional block that checks if the script is being run as the main program, and specifically call it at the very beginning of that block. However, PipeGenie uses multiprocessing to speed up the evolutionary process and an evaluation timeout to avoid long-running pipelines. When the evaluation timeout is reached, the process is terminated, and the pipeline is considered invalid. While completing the task and generating all related files, Windows has some issues terminating this process, raising warnings and errors every time it tries to do so.

...
from sklearn.datasets import load_iris
import multiprocessing # Import the multiprocessing module

if __name__ == '__main__':
    # Call freeze_support() only if running on Windows
    multiprocessing.freeze_support()

    # Load the dataset
    X, y = load_iris(return_X_y=True)

    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    ...

Classification Example

You can try the following code to generate a model for a classification task using the PipegenieClassifier class:

from pipegenie.classification import PipegenieClassifier
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import balanced_accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
X, y = load_iris(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Define the grammar file
grammar_file = "tutorials/sample-grammar-classification.xml"

# Create an instance of the PipegenieClassifier
model = PipegenieClassifier(
    grammar=grammar_file,
    generations=5,
    pop_size=50,
    elite_size=5,
    n_jobs=5,
    seed=42,
    outdir="sample-results",
)

# Fit the classifier
model.fit(X_train, y_train)

# Predict using the classifier
y_pred = model.predict(X_test)

# Evaluate the classifier using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")

# Evaluate the classifier using another metric
print(f"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}")

You can also check the following Jupyter notebook for a more detailed example: Classification Example.

Regression Example

You can try the following code to generate a model for a regression task using the PipegenieRegressor class:

from pipegenie.regression import PipegenieRegressor
from pipegenie.model_selection import train_test_split
from pipegenie.metrics import root_mean_squared_error
from sklearn.datasets import load_diabetes

# Load the dataset
X, y = load_diabetes(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=9)

# Define the grammar file
grammar_file = "tutorials/sample-grammar-regression.xml"

# Create an instance of the PipegenieRegressor
model = PipegenieRegressor(
    grammar=grammar_file,
    generations=5,
    pop_size=50,
    elite_size=5,
    maximization=False,
    n_jobs=5,
    seed=9,
    outdir="sample-results",
)

# Fit the regressor
model.fit(X_train, y_train)

# Predict using the regressor
y_pred = model.predict(X_test)

# Evaluate the regressor using the default scoring method
print(f"Model score: {model.score(X_test, y_test)}")

# Evaluate the regressor using another metric
print(f"Root mean squared error: {root_mean_squared_error(y_test, y_pred)}")

You can also check the following Jupyter notebook for a more detailed example: Regression Example.

Saving and Loading Models

Once you have trained a model, you may want to save it to disk for later use. To save a model, you can use the provided save_model method:

model.save_model("model.pkl")

This method will save the internal ensemble created after the evolutionary process.

In case you want to save the model manually, you can use the following code:

import pickle

with open("model.pkl", "wb") as file:
    pickle.dump(model.ensemble, file)

To load a model, you can use the pickle module:

import pickle

with open("model.pkl", "rb") as file:
    ensemble = pickle.load(file)

PipeGenie's Internals

Grammar Definition

To be able to use PipeGenie, first you need to define the grammar that will be used to generate the pipelines. The grammar is a XML or EBNF-like file that contains the rules that will be used to generate the pipelines.

A full example of grammar files can be found in the tutorials folder. To modify the grammar, you can create your own grammar file or modify the existing ones. An explanation of how to change the grammar file can be found here.

XML format

The grammar file should be structured as follows:

<grammar> </grammar> : The root element of the file. All components of the grammar must be within these tags.
<root-symbol> </root-symbol> : The initial symbol of the grammar. Derivations start from this symbol to create an algorithm flow and its corresponding hyperparameters. Should be written within the <grammar> tags.
<non-terminals> </non-terminals> : This tag contains all the production rules tags that can be used by the evolutionary process to generate a pipeline. Should be written within the <grammar> tags.
<non-terminal> </non-terminal> : Specifies a context-free grammar non terminal symbol. Should be written within the <non-terminals> tags. Its attribute is:
- name : Name of the antecedent of the production rule.
<production-rule> </production-rule> : Specifies the consequent of a production rule. Terminals and non-terminals can be used in the production rule and should be separated by a ;. The symbols of the consequent have to be written with the same name as the <terminal> and <non-terminal> tags. To specify that an algorithm has hyperparameters, either terminals (list of hyperparameter <terminal> tags separated by a ;) or a non-terminal (name of the non-terminal tag that contains the hyperparameters) should be used and written between parentheses (e.g., RandomForestClassifier(rf::n_estimators;rf::max_depth) or RandomForestClassifier(rf_hyperparameters)). Should be written within the <non-terminal> tags.
<terminals> </terminals> : This tag contains both algorithms and hyperparameters tags that can be used by the evolutionary process to generate a pipeline. Should be written within the <grammar> tags.
<terminal> </terminal> : Used to specify a terminal symbol of the grammar. It can be an algorithm or a hyperparameter. Should be written within the <terminals> tags. To define an algorithm, the attributes are:
- name : Identifier name for the algorithm for string representation.
- type : Category of the algorithm for grouping purposes. This string can contain any value, allowing grouping of terminals with the same string value. Optional.
- code : Package and name of the algorithm to be imported to use in the evolutionary process (e.g., sklearn.ensemble.RandomForestClassifier). It can be any algorithm compatible with scikit-learn API. Otherwise, a custom wrapper should be created to adapt the API.
To define a hyperparameter, the attributes are:
- name : Identifier name for the hyperparameter. Should be written as algorithm_id::hyperparameter_name (e.g., rf::n_estimators) and should be unique for each hyperparameter. algorithm_id can be any string that identifies the algorithm to which the hyperparameter belongs. It is used in order to avoid conflicts between hyperparameters with the same name in different algorithms. hyperparameter_name should be the same as the hyperparameter name in the algorithm's constructor.
- type : Data type of the hyperparameter (int, float, categorical, bool, estimator, tuple or union). For a fix value, use the prefix fix_.
- lower : Lower limit for numerical hyperparameters.
- upper : Upper limit for numerical hyperparameters.
- log : Specifies whether the hyperparameter should be sampled in log scale. Only used for numerical hyperparameters.
- step : Specifies the step when sampling the hyperparameter. Only used for numerical hyperparameters.
- values : Specifies the possible values for categorical hyperparameters. Only used for categorical hyperparameters.
- estimators : Points to a <non-terminal> tag that contains the possible algorithms that can be used as hyperparameters. Only used for estimator hyperparameters.
- value : Specifies the value for fixed hyperparameters. Only used for fixed hyperparameters.
- default : Default value for the hyperparameter.
<tuple> </tuple> : Specifies a tuple of values for a hyperparameter (e.g., MLPClassifier::hidden_layer_sizes). Should be written within a <terminal> tag whose type is tuple. Its attributes are the same as the <terminal> tag to define hyperparameters, except for the name attribute.
<union> </union> : Specifies a union of value types that can be used for a hyperparameter (e.g. RandomForestClassifier::max_features can be either a categorical, integer or float hyperparameter). Should be written within a <terminal> tag whose type is union. Its attributes are the same as the <terminal> tag to define hyperparameters, except for the name attribute.

EBNF-like format

The main components of the grammar are described below.

@root : Specifies the initial symbol of the grammar. Derivations start from this symbol to create an algorithm flow and its corresponding hyperparameters. The identifier that follows @root must correspond to a non-terminal rule defined in the grammar.
@algorithm : Used to define an algorithm that can be used by the evolutionary process to generate pipelines. The syntax of this directive is: @algorithm <name> <code> <type>
- name : Identifier used in the grammar to reference the algorithm when defining production rules.
- code : Full Python import path of the algorithm implementation (e.g., sklearn.ensemble.RandomForestClassifier). The algorithm must be compatible with the scikit-learn API. Otherwise, a wrapper should be implemented to adapt the API.
- type : Category used to group algorithms with similar functionality. This attribute is optional and is only used for organizational purposes.
@hyperparameter : Defines a hyperparameter that can be optimized during the evolutionary process. The syntax of this directive is: @hyperparameter <name> <type> [attributes] -name : Identifier of the hyperparameter. It must follow the format algorithm_id::hyperparameter_name. The algorithm_id is used to avoid conflicts between hyperparameters with the same name in different algorithms, while hyperparameter_name must match the parameter name expected by the algorithm constructor.
- type : Data type of the hyperparameter. Supported types include int, float, categorical, bool, estimator, tuple, and union. For fixed values, the prefix fix_ can be used
- attributes : Additional properties that define how the hyperparameter should be sampled or configured. They should follow a key=value configuration. The following attributes can be used depending on the hyperparameter type:
  - lower : Lower bound for numerical hyperparameters.
  - upper : Upper bound for numerical hyperparameters.
  - log : Indicates that the hyperparameter should be sampled in logarithmic scale. Only used for numerical hyperparameters.
  - step : Sampling step size for numerical hyperparameters.
  - values : Possible values for categorical hyperparameters.
  - estimators : Reference to a non-terminal rule that defines the possible algorithms that can be used as estimator hyperparameters.
  - value : Specifies the value of a fixed hyperparameter.
  - default : Default value of the hyperparameter.
Non-terminal rules : Define the production rules used to generate valid pipelines. Each production represents a possible expansion of the non-terminal symbol. A non-terminal rule follows the format:

non_terminal =
    production1
    production2
    ...

Production rules : Specify the sequence of terminals and non-terminals that replace a non-terminal symbol. Elements within a production are separated using ;. Production rules can contain algorithm identifiers, hyperparameter identifiers, or other non-terminals defined in the grammar. To specify that an algorithm has hyperparameters, the hyperparameters must be written between parentheses after the algorithm name. The content inside the parentheses can be a non-terminal rule that groups those hyperparameters. algorithm(algorithm_hp)
Hyperparameter groups : Hyperparameters belonging to a specific algorithm can be grouped using a non-terminal rule. This allows production rules to reference the entire hyperparameter set through a single identifier.

algorithm_hp =
    id::hyperparameter1;id::hyperparameter2;
    ...

Outputs

After running the code, you will find a folder named sample-results (or the name you specified in the outdir parameter) containing the following files:

config.txt: Contains the configuration (parameters passed to the constructor) used for the evolutionary process. An example of this file can be found here.
best_pipeline.txt: Contains the best pipeline found by the evolutionary process, its fitness value and its predictions. An example of this file can be found here.
ensemble.txt: Contains the ensemble of pipelines found by the evolutionary process, their individual fitness values, the global fitness value, the weights of the pipelines in the ensemble and the predictions of the ensemble. An example of this file can be found here.
evolution.txt: Contains the evolution of the population during the evolutionary process. For each generation, it shows the number of evaluated individuals, the minimum, maximum, average and standard deviation of the fitness values of the population, fitness values of the elite, pipeline sizes of the population and pipeline sizes of the elite. At the end of the process, it will show the time in seconds that the process took. An example of this file can be found here.
individuals.tsv: Contains all the individuals evaluated during the evolutionary process. For each individual, it shows the pipeline, its fitness value and the time it took to evaluate it. If the individual was invalid, it will show the reason. An example of this file can be found here.

For more information about the outputs, you can check the Understanding PipeGenie Outputs tutorial.

FAQ

Q: What is grammar-guided genetic programming (G3P)?

A: G3P is a form of genetic programming that uses a grammar to guide the evolution of programs, ensuring they adhere to predefined structures and rules.

Q: Can I use PipeGenie with other machine learning libraries?

A: PipeGenie is designed to work with scikit-learn, so any library that is API-compatible with scikit-learn should work. If you want to use PipeGenie with a different library, you may need to make wrapper classes to adapt the API.

Q: Why the save_model method only saves the ensemble and not the Pipegenie object?

A: The ensemble is the most important part of the Pipegenie object, as it is responsible for making predictions. The Pipegenie object is only used to create and train the ensemble, so it is not necessary to save it. Also, due to limitations in the pickle module, it is not possible to save the Pipegenie object. In case you want to save the Pipegenie object (save the ensemble as well as the loaded grammar and configuration), you should use other serialization libraries like dill or cloudpickle.

Contributors

Rafael Barbudo Lunar
Mario Berrios Carmona
Carlos García Martínez
Vincent Gardies
Ángel Fuentes Almoguera
Aurora Ramírez Quesada
José Raúl Romero Salguero

Citing PipeGenie

If you are using PipeGenie in your research or project, please cite using the following BibTeX entry:

@article{BARBUDO2024111292,
    title   = {Grammar-based evolutionary approach for automated workflow composition with domain-specific operators and ensemble diversity},
    journal = {Applied Soft Computing},
    volume  = {153},
    pages   = {111292},
    year    = {2024},
    issn    = {1568-4946},
    doi     = {https://doi.org/10.1016/j.asoc.2024.111292},
    url     = {https://www.sciencedirect.com/science/article/pii/S1568494624000668},
    author  = {Rafael Barbudo and Aurora Ramírez and José Raúl Romero},
}

License

This project is distributed under the "MIT License – HUMAINS Research Group Attribution Variant". Based on the MIT License, with the additional requirement that proper attribution be given to the original authors and the HUMAINS Research Group (University of Córdoba, Spain) - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.1

Apr 17, 2026

0.8.0

Dec 23, 2025

0.7.0

Jul 4, 2025

0.0.0

Mar 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipegenie-0.8.1.tar.gz (72.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pipegenie-0.8.1-py3-none-any.whl (81.7 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file pipegenie-0.8.1.tar.gz.

File metadata

Download URL: pipegenie-0.8.1.tar.gz
Upload date: Apr 17, 2026
Size: 72.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pipegenie-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`3db70950de11f1271005fb8c894cf4b7d39c236b6b301c161c7ac585885ce54d`
MD5	`47d57b7795e6c83c3d3fad1c977f967e`
BLAKE2b-256	`081a4a4c8c8bf4eb3fb4217c25d903410c69e927403a59b8ecd127ececf2ae2c`

See more details on using hashes here.

File details

Details for the file pipegenie-0.8.1-py3-none-any.whl.

File metadata

Download URL: pipegenie-0.8.1-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 81.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pipegenie-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfcf6ddf65d5ee5b527c32f5474daf5c4cc5585093c21c1f0b60caafafbaa432`
MD5	`0fdcc1921dab673ab3bdb139d6bd004a`
BLAKE2b-256	`bb543251f1155089fe9c37dee63782bf38d011fea66e73cb5b91e0699f3a84f7`

See more details on using hashes here.

pipegenie 0.8.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PipeGenie

Table of Contents

Introduction

Features

Installation

Requirements

Installing the Package

Usage

Classification Example

Regression Example

Saving and Loading Models

PipeGenie's Internals

Grammar Definition

XML format

EBNF-like format

Outputs

FAQ

Contributors

Citing PipeGenie

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes