Kernel density integral transformation
Project description
kditransform
The kernel-density integral transformation (McCarter, 2023, TMLR), like min-max scaling and quantile transformation, maps continuous features to the range [0, 1]
.
It achieves a happy balance between these two transforms, preserving the shape of the input distribution like min-max scaling, while nonlinearly attenuating the effect of outliers like quantile transformation.
It can also be used to discretize features, offering a data-driven alternative to univariate clustering or K-bins discretization.
You can tune the interpolation $\alpha$ between 0 (quantile transform) and $\infty$ (min-max transform), but a good default is $\alpha=1$, which is equivalent to using scipy.stats.gaussian_kde(bw_method=1)
. This is an easy way to improves performance for a lot of supervised learning problems. See this notebook for example usage and the paper for a detailed description of the method.
Installation
Installation from PyPI
pip install kditransform
Installation from source
After cloning this repo, install the dependencies on the command-line, then install kditransform:
pip install -r requirements.txt
pip install -e .
pytest
Usage
kditransform.KDTransformer
is a drop-in replacement for sklearn.preprocessing.QuantileTransformer. When alpha
(defaults to 1.0) is small, our method behaves like the QuantileTransformer; when alpha
is large, it behaves like sklearn.preprocessing.MinMaxScaler.
import numpy as np
from kditransform import KDITransformer
X = np.random.uniform(size=(500, 1))
kdt = KDITransformer(alpha=1.)
Y = kdt.fit_transform(X)
kditransform.KDIDiscretizer
offers an API based on sklearn.preprocessing.KBinsDiscretizer. It encodes each feature ordinally, similarly to KBinsDiscretizer(encode='ordinal')
.
from kditransform import KDIDiscretizer
rng = np.random.default_rng(1)
x1 = rng.normal(1, 0.75, size=int(0.55*N))
x2 = rng.normal(4, 1, size=int(0.3*N))
x3 = rng.uniform(0, 20, size=int(0.15*N))
X = np.sort(np.r_[x1, x2, x3]).reshape(-1, 1)
kdd = KDIDiscretizer()
T = kdd.fit_transform(X)
Initialized as KDIDiscretizer(enable_predict_proba=True)
, we can also output one-hot encodings and probabilistic one-hot encodings of single-feature input data.
kdd = KDIDiscretizer(enable_predict_proba=True).fit(X)
P = kdd.predict(X) # one-hot encoding
P = kdd.predict_proba(X) # probabilistic one-hot encoding
Citing this method
If you use this tool, please cite KDITransform using the following reference to our TMLR paper:
In Bibtex format:
@article{
mccarter2023the,
title={The Kernel Density Integral Transformation},
author={Calvin McCarter},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=6OEcDKZj5j},
note={}
}
Usage with TabPFN
TabPFN is a meta-learned Transformer model for tabular classification. In the TabPFN paper, features are preprocessed with the concatenation of z-scored & power-transformed features. After simply adding KDITransform'ed features, I observed improvements on the reported benchmarks. In particular, on the 30 test datasets in OpenML-CC18, mean AUC OVO increases from 0.8943 to 0.8950; on the subset of 18 numerical datasets in Table 1 of the TabPFN paper, mean AUC OVO increases from 0.9335 to 0.9344.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kditransform-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d11ef1ae39eff3419dd9256b6531e4d666ac0b63210faf72b073f16f9563c67 |
|
MD5 | a746066dd2b54035c9cc2a6f62842f42 |
|
BLAKE2b-256 | 5a863a9dc920850fe36e01a406c08420b307e26fe84174a002666290a8626326 |