A lightweight library for encoding categorical features in your dataset with robust k-fold target statistics in training.
Project description
target statistic encoding
Table of contents:
Install
from pypi
pip install target_statistic_encoding
from source
python -m pip install git+https://github.com/CircArgs/target_statistic_encoding.git
What?
There are many means to convert categorical features to numeric ones from one-hot to embeddings. Then there are target statistic methods. These methods take statistics based on the target feature.
Why?
Even within this simple technique there is variation in implementations. Some implement a time-mimicking approach such as Catboost to gain robustness over target leakage. However, one issue with this approach is that while it introduces some variation to the encoding, for a some samples the statistic is possibly excessively biased. This small package takes a different approach for this reason. Instead, it uses stratified folds of the training set and aggregates target statistics on each fold independently.
Benefits of this implementation
- stratified split target statistic helps prevent target leakage thus making your models more robust
- credibility factor allows categories with low support to be ignored additionally making your models more robust
- clean api
- variety of target statistic functions in addition to allowing custom implemented ones
- easy productionalization - everything is 100% serializable with pickle
ex.
#save for prod/test time environment pd.to_pickle(cat2num, "cat2num_for_production.pkl") #read into prod env cat2num=pd.read_pickle("cat2num_for_production.pkl") ... model.predict(cat2num.transform(prod_data))
How?
This is just a simple utility library that performs the following sample operation: See this example notebook
keep in mind this is simply an example. The example target is random here so no real signal is expected
API
Instantiate
Init signature:
Cat2Num(
cat_vars: List[str],
target_var: str,
stat_func: target_statistic_encoding.stat_funcs.stat_funcs._StatFunc = <function mean.<locals>.stat_func at 0x7fea58a85950>,
)
Args:
cat_vars (List[str]): a list of strings representing the categorical features to be encoded
target_var (str): string of the name of the target feature in `data`
stat_func (optional Function(*args, **kwargs) -> Function({pd.Series, pd.DataFrameGroupBy}) -> {float, pd.Series})): function which returns a closure to aggregate statistics over target - default stat_funcs.mean()
fit
prefer.fit_transform
on your training set
Note: running .fit
followed by .transform
on your training set is not equivalent to simply running .fit_transform
. There wil be no differentiation amongst category statistics as they will all be mapped to the mean.
cat2num.fit_transform(
data: pandas.core.frame.DataFrame,
split: str = None,
n_splits: int = 5,
credibility: Union[float, int] = 0,
drop: bool = False,
suffix: str = '_Cat2Num',
inplace: bool = False,
)
Args:
data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
split (str): name of a column to use in the data for folding the data.
- if this is use then n_splits is ignored
n_splits (int): number of splits to use for target statistic
credibility (float or int):
- if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
- if int must be >=0 as number of records in fitting data level must exist within to be credible
- levels not above this threshold will be given the overall target mean
drop (bool): drop the original columns
suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`
Returns:
the passed dataframe with encoded columns added if inplace is `False` else `None`
cat2num.fit(
data: pandas.core.frame.DataFrame,
credibility: Union[float, int] = 0,
)
Args:
data (pd.DataFrame): pandas dataframe with categorical features to fit numeric target statistic from
credibility (float or int):
- if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
- if int must be >=0 as number of records in fitting data level must exist within to be credible
- levels not above this threshold will be given the overall target mean
Returns:
fit Cat2Num instance
use .transform
on your non-training set
cat2num.transform(
data: pandas.core.frame.DataFrame,
drop: bool = False,
suffix: str = '_Cat2Num',
inplace: bool = False,
)
Args:
data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
drop (bool): drop the original columns
suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`
Returns:
the passed dataframe with encoded columns added if inplace is `False` else `None`
Custom target statistic functions
You may optionally opt for a target statistic based on a statistic other than the mean although this is usually unwanted/unnecessary.
Several are included and you can implement your own with a few considerations.
Given:
- mean (
target_statistic_encoding.stat_funcs.Mean()
) - the default - median (
target_statistic_encoding.stat_funcs.Median()
) - std (
target_statistic_encoding.stat_funcs.Std()
) - var (
target_statistic_encoding.stat_funcs.Var()
) - quantile (
target_statistic_encoding.stat_funcs.Quantile(quantile=0.5)
)
Implement your own:
You may optionally implement your own target statistic function. It must be a callable that operates on the pandas.core.groupby.DataFrameGroupby
type i.e. the result of a pandas.DataFrame.groupby
e.g.: something akin to
target | |
---|---|
X1 | |
a | 0.287356 |
b | 0.298795 |
c | 0.336879 |
d | 0.287037 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file target_statistic_encoding-0.1.2.tar.gz
.
File metadata
- Download URL: target_statistic_encoding-0.1.2.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fde456ef5354be9ee157a1598fa8e832cbcc611bf9115d702cc0ec6ca8a97c10 |
|
MD5 | 9fbe58ca2cbd12232076b79448c2c37b |
|
BLAKE2b-256 | 14b1961188086950b6894c2202495e6ef08242a093ec793fe8ef01935fafb80c |
File details
Details for the file target_statistic_encoding-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: target_statistic_encoding-0.1.2-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5ce5ebc41df682400d526f656c21f04428417ec9bbb07475b56085cd86a9e2b |
|
MD5 | 9a37bd3c119345ab30dec59b61e1e9eb |
|
BLAKE2b-256 | 4868c29879392d44384fc4fac22faea1671950f62bfbe52d9c028c7e864d4bfe |