Skip to main content

A modular preprocessing package for Pandas Dataframe

Project description

recipies logo

ReciPies 🥧

A declarative pipeline for reproducible ML preprocessing

CI Platform License PyPI version shields.io Python Version Downloads arXiv codecov

Modern machine learning (ML) workflows live or die by their data‑preprocessing steps, yet in Python—a language with a rich ecosystem for data science and ML—these steps are often scattered across ad‑hoc scripts or opaque Scikit-Learn (sklearn) snippets that are hard to read, audit, or reuse. ReciPies provides a concise, human‑readable, and fully reproducible way to declare, execute, and share preprocessing pipelines, adhering to Configuration as Code principles. It lets users describe transformations as a recipe made of ordered steps (e.g., imputing, encoding, normalizing) applied to variables identified by semantic roles (predictor, outcome, ID, time stamp, etc.). Recipes can be prepped (trained) once, baked many times, and cleanly separated between training and new data—preventing data leakage by construction. Under the hood, ReciPies targets both Pandas and Polars backends for performance and flexibility, and it is easily extensible: users can register custom steps with minimal boilerplate. Each recipe is serializable to JSON/YAML for provenance tracking, collaboration, and publication, and integrates smoothly with downstream modeling libraries. Packaging preprocessing as clear, declarative objects, ReciPies lowers the cognitive load of feature engineering, improves reproducibility, and makes methodological choices explicit, benefiting individual researchers, engineering teams, and peer reviewers alike.

The backend can either be Polars or Pandas dataframes. The operation of this package is inspired by the R-package recipes. Please check the documentation for more details.

Installation

Using pip

You can install ReciPies from pip using:

pip install recipies

Using uv

You can install ReciPies using uv (the unified package manager) with the following command:

uv add recipies

Note that the package is called recipies on pip.

Developer / Editable install

# with conda (optional)
conda env update -f environment.yml
conda activate ReciPies
# with pip
pip install -e .
# with uv venv
uv venv && source .venv/bin/activate

Getting Start

Here's a simple example of using ReciPies:

# Import necessary libraries
import polars as pl
import numpy as np
from datetime import datetime, MINYEAR
from recipies import Ingredients, Recipe
from recipies.selector import all_numeric_predictors, all_predictors
from recipies.step import StepSklearn, StepHistorical, Accumulator, StepImputeFill
from sklearn.impute import MissingIndicator

# Set up random state for reproducible results
rand_state = np.random.RandomState(42)

# Create time columns for two different groups
timecolumn = pl.concat(
    [
        pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
        pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True),
    ]
)

# Create sample DataFrame
df = pl.DataFrame(
    {
        "id": [1] * 6 + [2] * 4,
        "time": timecolumn,
        "y": rand_state.normal(size=(10,)),
        "x1": rand_state.normal(loc=10, scale=5, size=(10,)),
        "x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
        "x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
        "x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
    }
)

# Introduce some missing values
df = df.with_columns(pl.when(pl.int_range(pl.len()).is_in([1, 2, 4, 7])).then(None).otherwise(pl.col("x1")).alias("x1"))

df2 = df.clone()

# Create Ingredients and Recipe
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])

rec.add_step(StepSklearn(MissingIndicator(features="all"), sel=all_predictors()))
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))

# Apply the recipe to the ingredients
df = rec.prep()

# Apply the recipe to a new DataFrame (e.g., test set)
df2 = rec.bake(df2)

Core Concepts

Below is a schematic overview of ReciPies' architecture. We 1) load a Pandas or Polars (training) dataframe, then 2) wrap it in an Ingredients object that maintains column role information (i.e., what does this column do in this dataset). Next, we 3) define a Recipe consisting of multiple Steps that operate on selected columns. Finally, we 4) prep the Recipe on the training data and 5) bake it on new data. We can then 6) run our ML pipeline on train and test data.

recipies flowchart
The main building blocks of ReciPies are:
  • Ingredients: A wrapper around DataFrames that maintains column role information, ensuring data semantics are preserved during transformations.
  • Recipe: A collection of processing steps that can be applied to Ingredients objects to create reproducible data pipelines.
  • Step: Individual data transformation operations that understand column roles and can work with both Polars and Pandas backends.
  • Selector: Utilities for selecting columns based on their roles or other criteria.

Backend Support

ReciPies supports both Polars and Pandas backends:

  • Polars: High-performance DataFrame library with lazy evaluation
  • Pandas: Traditional DataFrame library with extensive ecosystem support

The package automatically detects the backend and provides a consistent API regardless of the underlying DataFrame implementation.

Examples

Check out the examples/ directory for Jupyter notebooks demonstrating various use cases of ReciPies. Check out the benchmarks/ directory for performance comparisons between Polars and Pandas backends.

Contributing

Contributions are welcome! Please see our contributing guidelines and open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for details.

How to cite

If you use this code in your research, please cite the following publication which uses ReciPys extensively to create a customisable preprocessing pipeline (a standalone paper is in preparation):

@inproceedings{vandewaterYetAnotherICUBenchmark2024,
  title = {Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML},
  shorttitle = {Yet Another ICU Benchmark},
  booktitle = {The Twelfth International Conference on Learning Representations},
  author = {van de Water, Robin and Schmidt, Hendrik Nils Aurel and Elbers, Paul and Thoral, Patrick and Arnrich, Bert and Rockenschaub, Patrick},
  year = {2024},
  month = oct,
  urldate = {2024-02-19},
  langid = {english},
}

This paper can also be found on arxiv: arxiv.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recipies-1.3.0.tar.gz (3.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

recipies-1.3.0-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file recipies-1.3.0.tar.gz.

File metadata

  • Download URL: recipies-1.3.0.tar.gz
  • Upload date:
  • Size: 3.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for recipies-1.3.0.tar.gz
Algorithm Hash digest
SHA256 3e530c99b3d0550c6e64ed5ec00a32ccc6268b6eb1c04b1e20bce10d21ae54dc
MD5 5de61cd09c6cdb8c13cd433c5732e333
BLAKE2b-256 2f03aeb0e67438b61607153b8f403cd9a995b068bb8ab9f829e30441851702b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for recipies-1.3.0.tar.gz:

Publisher: python-build.yaml on rvandewater/ReciPies

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file recipies-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: recipies-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for recipies-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 35d17c19048029f844d3b548ca8b428f8448fb907a745d8b98b35a3a2e3f9ae6
MD5 50f566f90c6feef94919deae1d7aa34b
BLAKE2b-256 c7a356df0addc48e86abe3fcaabab47eaebb238a86f47c71603e253ab53c84d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for recipies-1.3.0-py3-none-any.whl:

Publisher: python-build.yaml on rvandewater/ReciPies

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page