Skip to main content

A python package for preprocessing tabular data

Project description

📦 pretab

pretab is a modular, extensible, and scikit-learn-compatible preprocessing library for tabular data. It supports all sklearn transformers out of the box, and extends functionality with a rich set of custom encoders, splines, and neural basis expansions.


✨ Features

  • 🔢 Numerical preprocessing via:

    • Polynomial and spline expansions: B-splines, natural cubic splines, thin plate splines, tensor product splines, P-splines
    • Neural-inspired basis: RBF, ReLU, Sigmoid, Tanh
    • Custom binning: rule-based or tree-based
    • Piecewise Linear Encoding (PLE)
  • 🌤 Categorical preprocessing:

    • Ordinal encodings
    • One-hot encodings
    • Language embeddings (pretrained vectorizers)
    • Custom encoders like OneHotFromOrdinalTransformer
  • 🔧 Composable pipeline interface:

    • Fully compatible with sklearn.pipeline.Pipeline and sklearn.compose.ColumnTransformer
    • Accepts all sklearn-native transformers and parameters seamlessly
  • 🧠 Smart preprocessing:

    • Automatically detects feature types (categorical vs numerical)
    • Supports both pandas.DataFrame and numpy.ndarray inputs
  • 🧪 Comprehensive test coverage

  • 🤝 Community-driven and open to contributions


💠 Installation

Install via pip:

pip install pretab

Or install in editable mode for development:

git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e .

🚀 Quickstart

import pandas as pd
import numpy as np
from pretab.preprocessor import Preprocessor

# Simulated tabular dataset
df = pd.DataFrame({
    "age": np.random.randint(18, 65, size=100),
    "income": np.random.normal(60000, 15000, size=100).astype(int),
    "job": np.random.choice(["nurse", "engineer", "scientist", "teacher", "artist", "manager"], size=100),
    "city": np.random.choice(["Berlin", "Munich", "Hamburg", "Cologne"], size=100),
    "experience": np.random.randint(0, 40, size=100)
})

y = np.random.randn(100, 1)

# Optional feature-specific preprocessing config
config = {
    "age": "ple",
    "income": "rbf",
    "experience": "quantile",
    "job": "one-hot",
    "city": "none"
}

# Initialize Preprocessor
preprocessor = Preprocessor(
    feature_preprocessing=config,
    task="regression"
)

# Fit and transform the data into a dictionary of feature arrays
X_dict = preprocessor.fit_transform(df, y)

# Optionally get a stacked array instead of a dictionary
X_array = preprocessor.transform(df, return_array=True)

# Get feature metadata
preprocessor.get_feature_info(verbose=True)

🪰 Included Transformers

pretab includes both sklearn-native and custom-built transformers:

🌈 Splines

  • CubicSplineTransformer
  • NaturalCubicSplineTransformer
  • PSplineTransformer
  • TensorProductSplineTransformer
  • ThinPlateSplineTransformer

🧠 Feature Maps

  • RBFExpansionTransformer
  • ReLUExpansionTransformer
  • SigmoidExpansionTransformer
  • TanhExpansionTransformer

📊 Encodings and Binning

  • PLETransformer
  • CustomBinTransformer
  • OneHotFromOrdinalTransformer
  • ContinuousOrdinalTransformer
  • LanguageEmbeddingTransformer

🔧 Utilities

  • NoTransformer
  • ToFloatTransformer

Plus: any sklearn transformer can be passed directly with full support for hyperparameters.

Using Transformers

Using the transformers follows the standard sklearn.preprocessing steps. I.e. using PLE

import numpy as np
from pretab.transformers import PLETransformer

x = np.random.randn(100, 1)
y = np.random.randn(100, 1)

x_ple = PLETransformer(n_bins=15, task="regression").fit_transform(x, y)

assert x_ple.shape[1] == 15

For splines, the penalty matrices can be extracted via .get_penalty_matrix()

import numpy as np
from pretab.transformers import ThinPlateSplineTransformer

x = np.random.randn(100, 1)

tp = ThinPlateSplineTransformer(n_basis=15)

x_tp = tp.fit_transform(x)

assert x_tp.shape[1] == 15

penalty = tp.get_penalty_matrix()

🧪 Running Tests

pytest --maxfail=2 --disable-warnings -v

🤝 Contributing

pretab is community-driven! Whether you’re fixing bugs, adding new encoders, or improving the docs — contributions are welcome.

git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e ".[dev]"

Then create a pull request 🚀


📄 License

MIT License. See LICENSE for details.


❤️ Acknowledgements

pretab builds on the strengths of:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pretab-0.0.3.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pretab-0.0.3-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file pretab-0.0.3.tar.gz.

File metadata

  • Download URL: pretab-0.0.3.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for pretab-0.0.3.tar.gz
Algorithm Hash digest
SHA256 3e349de4b82b1c072956f751dc38c480215402f83458ad7b9bab631ccd45ab58
MD5 c30c585393f7b55eba34de216c546a62
BLAKE2b-256 8ab3904e77e580fa60ee7ab0114240f530fdc4f8733292ba3973fc9d7efa3951

See more details on using hashes here.

File details

Details for the file pretab-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pretab-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for pretab-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4da1cf8ca6ed9121cda116905a64a1f853cee0b2c7a79e1a4480e0d728b5b451
MD5 40a7eae9a205b75e327d7f7f227a55f1
BLAKE2b-256 ec8d7791cca2984f26b0dc085a0d81973f2e6888fd23209c487c57b43d2ccb50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page