A python package for preprocessing tabular data
Project description
📦 pretab
pretab is a modular, extensible, and scikit-learn-compatible preprocessing library for tabular data. It supports all sklearn transformers out of the box, and extends functionality with a rich set of custom encoders, splines, and neural basis expansions.
✨ Features
-
🔢 Numerical preprocessing via:
- Polynomial and spline expansions:
B-splines,natural cubic splines,thin plate splines,tensor product splines,P-splines - Neural-inspired basis:
RBF,ReLU,Sigmoid,Tanh - Custom binning: rule-based or tree-based
- Piecewise Linear Encoding (
PLE)
- Polynomial and spline expansions:
-
🌤 Categorical preprocessing:
- Ordinal encodings
- One-hot encodings
- Language embeddings (
pretrained vectorizers) - Custom encoders like
OneHotFromOrdinalTransformer
-
🔧 Composable pipeline interface:
- Fully compatible with
sklearn.pipeline.Pipelineandsklearn.compose.ColumnTransformer - Accepts all sklearn-native transformers and parameters seamlessly
- Fully compatible with
-
🧠 Smart preprocessing:
- Automatically detects feature types (categorical vs numerical)
- Supports both
pandas.DataFrameandnumpy.ndarrayinputs
-
🧪 Comprehensive test coverage
-
🤝 Community-driven and open to contributions
💠 Installation
Install via pip:
pip install pretab
Or install in editable mode for development:
git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e .
🚀 Quickstart
import pandas as pd
import numpy as np
from pretab.preprocessor import Preprocessor
# Simulated tabular dataset
df = pd.DataFrame({
"age": np.random.randint(18, 65, size=100),
"income": np.random.normal(60000, 15000, size=100).astype(int),
"job": np.random.choice(["nurse", "engineer", "scientist", "teacher", "artist", "manager"], size=100),
"city": np.random.choice(["Berlin", "Munich", "Hamburg", "Cologne"], size=100),
"experience": np.random.randint(0, 40, size=100)
})
y = np.random.randn(100, 1)
# Optional feature-specific preprocessing config
config = {
"age": "ple",
"income": "rbf",
"experience": "quantile",
"job": "one-hot",
"city": "none"
}
# Initialize Preprocessor
preprocessor = Preprocessor(
feature_preprocessing=config,
task="regression"
)
# Fit and transform the data into a dictionary of feature arrays
X_dict = preprocessor.fit_transform(df, y)
# Optionally get a stacked array instead of a dictionary
X_array = preprocessor.transform(df, return_array=True)
# Get feature metadata
preprocessor.get_feature_info(verbose=True)
🪰 Included Transformers
pretab includes both sklearn-native and custom-built transformers:
🌈 Splines
CubicSplineTransformerNaturalCubicSplineTransformerPSplineTransformerTensorProductSplineTransformerThinPlateSplineTransformer
🧠 Feature Maps
RBFExpansionTransformerReLUExpansionTransformerSigmoidExpansionTransformerTanhExpansionTransformer
📊 Encodings and Binning
PLETransformerCustomBinTransformerOneHotFromOrdinalTransformerContinuousOrdinalTransformerLanguageEmbeddingTransformer
🔧 Utilities
NoTransformerToFloatTransformer
Plus: any
sklearntransformer can be passed directly with full support for hyperparameters.
Using Transformers
Using the transformers follows the standard sklearn.preprocessing steps. I.e. using PLE
import numpy as np
from pretab.transformers import PLETransformer
x = np.random.randn(100, 1)
y = np.random.randn(100, 1)
x_ple = PLETransformer(n_bins=15, task="regression").fit_transform(x, y)
assert x_ple.shape[1] == 15
For splines, the penalty matrices can be extracted via .get_penalty_matrix()
import numpy as np
from pretab.transformers import ThinPlateSplineTransformer
x = np.random.randn(100, 1)
tp = ThinPlateSplineTransformer(n_basis=15)
x_tp = tp.fit_transform(x)
assert x_tp.shape[1] == 15
penalty = tp.get_penalty_matrix()
🧪 Running Tests
pytest --maxfail=2 --disable-warnings -v
🤝 Contributing
pretab is community-driven! Whether you’re fixing bugs, adding new encoders, or improving the docs — contributions are welcome.
git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e ".[dev]"
Then create a pull request 🚀
📄 License
MIT License. See LICENSE for details.
❤️ Acknowledgements
pretab builds on the strengths of:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pretab-0.0.2.tar.gz.
File metadata
- Download URL: pretab-0.0.2.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b2429e41c0f9768e698dee605ee96ceb10ef819e52b28ff7360b4a4586536e4
|
|
| MD5 |
54f237f11d2fb4a336af7ff5262555e8
|
|
| BLAKE2b-256 |
664b7e8dc2d8421c023f35fe942ba75146b28cf3c25ae6c3f63330ba2bb91481
|
File details
Details for the file pretab-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pretab-0.0.2-py3-none-any.whl
- Upload date:
- Size: 42.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83b39920aa5feac3b89225eed14c17ffd54476ddae612daccd39c2ee0ab4ba24
|
|
| MD5 |
efeb504bd854df9d85ccddef653466a1
|
|
| BLAKE2b-256 |
186e98bca39b4225c14cb8f08183d74ace48f5ffef2907350f0987cb81518261
|