A modular preprocessing package for Pandas Dataframe
Project description
ReciPies 🥧
A declarative pipeline for reproducible ML preprocessing
Modern machine learning (ML) workflows live or die by their data‑preprocessing steps, yet in Python—a language with a
rich ecosystem for data science and ML—these steps are often scattered across ad‑hoc scripts or opaque Scikit-Learn
(sklearn) snippets that are hard to read, audit, or reuse. ReciPies provides a concise, human‑readable, and fully
reproducible way to declare, execute, and share preprocessing pipelines, adhering to Configuration as Code principles.
It lets users describe transformations as a recipe made of ordered steps (e.g., imputing, encoding, normalizing)
applied to variables identified by semantic roles (predictor, outcome, ID, time stamp, etc.). Recipes can be prepped
(trained) once, baked many times, and cleanly separated between training and new data—preventing data leakage by
construction. Under the hood, ReciPies targets both Pandas and Polars backends for performance and flexibility, and
it is easily extensible: users can register custom steps with minimal boilerplate. Each recipe is serializable to
JSON/YAML for provenance tracking, collaboration, and publication, and integrates smoothly with downstream modeling
libraries. Packaging preprocessing as clear, declarative objects, ReciPies lowers the cognitive load of feature
engineering, improves reproducibility, and makes methodological choices explicit, benefiting individual researchers,
engineering teams, and peer reviewers alike.
The backend can either be Polars or Pandas dataframes. The operation of this package is inspired by the R-package recipes. Please check the documentation for more details.
Installation
Using pip
You can install ReciPies from pip using:
pip install recipies
Using uv
You can install ReciPies using uv (the unified package manager) with the following command:
uv add recipies
Note that the package is called
recipieson pip.
Developer / Editable install
# with conda (optional)
conda env create -n ReciPies python=3.12
conda activate ReciPies
Then, from the root of the repository, run with pip:
pip install -e .
Or with uv (if you have uv installed):
uv venv && source .venv/bin/activate
Getting Start
Here's a simple example of using ReciPies:
# Import necessary libraries
import polars as pl
import numpy as np
from datetime import datetime, MINYEAR
from recipies import Ingredients, Recipe
from recipies.selector import all_numeric_predictors, all_predictors
from recipies.step import StepSklearn, StepHistorical, Accumulator, StepImputeFill
from sklearn.impute import MissingIndicator
# Set up random state for reproducible results
rand_state = np.random.RandomState(42)
# Create time columns for two different groups
timecolumn = pl.concat(
[
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True),
]
)
# Create sample DataFrame
df = pl.DataFrame(
{
"id": [1] * 6 + [2] * 4,
"time": timecolumn,
"y": rand_state.normal(size=(10,)),
"x1": rand_state.normal(loc=10, scale=5, size=(10,)),
"x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
"x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
"x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
}
)
# Introduce some missing values
df = df.with_columns(pl.when(pl.int_range(pl.len()).is_in([1, 2, 4, 7])).then(None).otherwise(pl.col("x1")).alias("x1"))
df2 = df.clone()
# Create Ingredients and Recipe
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])
rec.add_step(StepSklearn(MissingIndicator(features="all"), sel=all_predictors()))
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix="mean_hist"))
# Apply the recipe to the ingredients
df = rec.prep()
# Apply the recipe to a new DataFrame (e.g., test set)
df2 = rec.bake(df2)
Core Concepts
Below is a schematic overview of ReciPies' architecture. We 1) load a Pandas or Polars (training) dataframe, then 2) wrap it in an Ingredients object that maintains column role information (i.e., what does this column do in this dataset). Next, we 3) define a Recipe consisting of multiple Steps that operate on selected columns. Finally, we 4) prep the Recipe on the training data and 5) bake it on new data. We can then 6) run our ML pipeline on train and test data.
- Ingredients: A wrapper around DataFrames that maintains column role information, ensuring data semantics are preserved during transformations.
- Recipe: A collection of processing steps that can be applied to Ingredients objects to create reproducible data pipelines.
- Step: Individual data transformation operations that understand column roles and can work with both Polars and Pandas backends.
- Selector: Utilities for selecting columns based on their roles or other criteria.
Backend Support
ReciPies supports both Polars and Pandas backends:
- Polars: High-performance DataFrame library with lazy evaluation
- Pandas: Traditional DataFrame library with extensive ecosystem support
The package automatically detects the backend and provides a consistent API regardless of the underlying DataFrame implementation.
Examples
Check out the examples/ directory for Jupyter notebooks demonstrating various use cases of ReciPies.
Check out the benchmarks/ directory for performance comparisons between Polars and Pandas backends.
Contributing
Contributions are welcome! Please see our contributing guidelines and open an issue or submit a pull request on the GitHub repository.
License
This project is licensed under the MIT License. See the LICENSE file for details.
How to cite
If you use this code in your research, please cite the following publication which uses ReciPys extensively to create a customisable preprocessing pipeline (a standalone paper is in preparation):
@inproceedings{vandewaterYetAnotherICUBenchmark2024,
title = {Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML},
shorttitle = {Yet Another ICU Benchmark},
booktitle = {The Twelfth International Conference on Learning Representations},
author = {van de Water, Robin and Schmidt, Hendrik Nils Aurel and Elbers, Paul and Thoral, Patrick and Arnrich, Bert and Rockenschaub, Patrick},
year = {2024},
month = oct,
urldate = {2024-02-19},
langid = {english},
}
This paper can also be found on arxiv: arxiv.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file recipies-1.3.1.tar.gz.
File metadata
- Download URL: recipies-1.3.1.tar.gz
- Upload date:
- Size: 3.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7147a6228598429b639ee64eceb3895180f161c19b8810abf8a79ed642fd681
|
|
| MD5 |
e528fa78e6fabc27937d4c045951e849
|
|
| BLAKE2b-256 |
692a63225567f24fcd631d3b83418e4c75ec42d6e4c4f8103499deb29d6dc128
|
Provenance
The following attestation bundles were made for recipies-1.3.1.tar.gz:
Publisher:
python-build.yaml on rvandewater/ReciPies
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
recipies-1.3.1.tar.gz -
Subject digest:
f7147a6228598429b639ee64eceb3895180f161c19b8810abf8a79ed642fd681 - Sigstore transparency entry: 788344049
- Sigstore integration time:
-
Permalink:
rvandewater/ReciPies@2ad6454dc00627fc32d2a2d0c8c9f5946a06881b -
Branch / Tag:
refs/tags/1.3.1 - Owner: https://github.com/rvandewater
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@2ad6454dc00627fc32d2a2d0c8c9f5946a06881b -
Trigger Event:
push
-
Statement type:
File details
Details for the file recipies-1.3.1-py3-none-any.whl.
File metadata
- Download URL: recipies-1.3.1-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5d1b2fd555bf64e495f77bbf1634a70af5d3b582115debe8d938e4f4d15d9a9
|
|
| MD5 |
d0bd107e1c2375eca932fda413f78aca
|
|
| BLAKE2b-256 |
658673aa97827d2948189ed1c0ab81ef6253cc936aea93b9ebb0cb09140a8571
|
Provenance
The following attestation bundles were made for recipies-1.3.1-py3-none-any.whl:
Publisher:
python-build.yaml on rvandewater/ReciPies
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
recipies-1.3.1-py3-none-any.whl -
Subject digest:
e5d1b2fd555bf64e495f77bbf1634a70af5d3b582115debe8d938e4f4d15d9a9 - Sigstore transparency entry: 788344050
- Sigstore integration time:
-
Permalink:
rvandewater/ReciPies@2ad6454dc00627fc32d2a2d0c8c9f5946a06881b -
Branch / Tag:
refs/tags/1.3.1 - Owner: https://github.com/rvandewater
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yaml@2ad6454dc00627fc32d2a2d0c8c9f5946a06881b -
Trigger Event:
push
-
Statement type: