Skip to main content

A python package for preprocessing tabular data

Project description

📦 pretab

pretab is a modular, extensible, and scikit-learn-compatible preprocessing library for tabular data. It supports all sklearn transformers out of the box, and extends functionality with a rich set of custom encoders, splines, and neural basis expansions.


✨ Features

  • 🔢 Numerical preprocessing via:

    • Polynomial and spline expansions: B-splines, natural cubic splines, thin plate splines, tensor product splines, P-splines
    • Neural-inspired basis: RBF, ReLU, Sigmoid, Tanh
    • Custom binning: rule-based or tree-based
    • Piecewise Linear Encoding (PLE)
  • 🌤 Categorical preprocessing:

    • Ordinal encodings
    • One-hot encodings
    • Language embeddings (pretrained vectorizers)
    • Custom encoders like OneHotFromOrdinalTransformer
  • 🔧 Composable pipeline interface:

    • Fully compatible with sklearn.pipeline.Pipeline and sklearn.compose.ColumnTransformer
    • Accepts all sklearn-native transformers and parameters seamlessly
  • 🧠 Smart preprocessing:

    • Automatically detects feature types (categorical vs numerical)
    • Supports both pandas.DataFrame and numpy.ndarray inputs
  • 🧪 Comprehensive test coverage

  • 🤝 Community-driven and open to contributions


💠 Installation

Install via pip:

pip install pretab

Or install in editable mode for development:

git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e .

🚀 Quickstart

import pandas as pd
from pretab import Preprocessor

df = pd.DataFrame({
    "age": [22, 35, 46, 59],
    "income": [40000, 52000, 98000, 87000],
    "job": ["nurse", "engineer", "scientist", "teacher"]
})

# Optional feature-specific config
config = {
    "age": "ple",
    "income": "rbf",
    "job": "one-hot"
}

preprocessor = Preprocessor(
    feature_preprocessing=config,
    task="regression"
)

# Fit and transform
X_dict = preprocessor.fit_transform(df)

# Optionally get stacked array
X_array = preprocessor.transform(df, return_dict=False)

# Get feature info
preprocessor.get_feature_info()

🪰 Included Transformers

pretab includes both sklearn-native and custom-built transformers:

🌈 Splines

  • CubicSplineTransformer
  • NaturalCubicSplineTransformer
  • PSplineTransformer
  • TensorProductSplineTransformer
  • ThinPlateSplineTransformer

🧠 Feature Maps

  • RBFExpansionTransformer
  • ReLUExpansionTransformer
  • SigmoidExpansionTransformer
  • TanhExpansionTransformer

📊 Encodings and Binning

  • PLETransformer
  • CustomBinTransformer
  • OneHotFromOrdinalTransformer
  • ContinuousOrdinalTransformer
  • LanguageEmbeddingTransformer

🔧 Utilities

  • NoTransformer
  • ToFloatTransformer

Plus: any sklearn transformer can be passed directly with full support for hyperparameters.

Using Transformers

Using the transformers follows the standard sklearn.preprocessing steps. I.e. using PLE

import numpy as np
from pretab.transformers import PLETransformer

x = np.random.randn(100, 1)
y = np.random.randn(100, 1)

x_ple = PLETransformer(n_bins=15, task="regression").fit_transform(x, y)

assert x_ple.shape[1] == 15

For splines, the penalty matrices can be extracted via .get_penalty_matrix()

import numpy as np
from pretab.transformers import ThinPlateSplineTransformer

x = np.random.randn(100, 1)

tp = ThinPlateSplineTransformer(n_basis=15)

x_tp = tp.fit_transform(x)

assert x_tp.shape[1] == 15

penalty = tp.get_penalty_matrix()

🧪 Running Tests

pytest --maxfail=2 --disable-warnings -v

🤝 Contributing

pretab is community-driven! Whether you’re fixing bugs, adding new encoders, or improving the docs — contributions are welcome.

git clone https://github.com/OpenTabular/pretab.git
cd pretab
pip install -e ".[dev]"

Then create a pull request 🚀


📄 License

MIT License. See LICENSE for details.


❤️ Acknowledgements

pretab builds on the strengths of:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stream_topic-0.0.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stream_topic-0.0.1-py3-none-any.whl (42.5 kB view details)

Uploaded Python 3

File details

Details for the file stream_topic-0.0.1.tar.gz.

File metadata

  • Download URL: stream_topic-0.0.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for stream_topic-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4fb5298b2aba04885692c85b75ade0ce935ba854f9832e7fa6b2527fbb559e0c
MD5 2a638d931aefaac4488887ed25d18541
BLAKE2b-256 109170fd0887b592782725e762393c2fd870d209960c5f03c4e259548be77d00

See more details on using hashes here.

File details

Details for the file stream_topic-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: stream_topic-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 42.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for stream_topic-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc421c3e32a79cf31bf1ee847459a0320d4991f3db58c2f582f3eeb642ab4085
MD5 057e7b66fe9c77088fcb455c37147c3a
BLAKE2b-256 a38f56ea935472ab286892806e47bf57a99b92f055849d06b06d315782b03b51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page