Skip to main content

A lightweight library for encoding categorical features in your dataset with robust k-fold target statistics in training.

Project description

target statistic encoding


Table of contents:

Install

from pypi

pip install target_statistic_encoding

from source

python -m pip install git+https://github.com/CircArgs/target_statistic_encoding.git

What?

There are many means to convert categorical features to numeric ones from one-hot to embeddings. Then there are target statistic methods. These methods take statistics based on the target feature.

Why?

Even within this simple technique there is variation in implementations. Some implement a time-mimicking approach such as Catboost to gain robustness over target leakage. However, one issue with this approach is that while it introduces some variation to the encoding, for a some samples the statistic is possibly excessively biased. This small package takes a different approach for this reason. Instead, it uses stratified folds of the training set and aggregates target statistics on each fold independently.

Benefits of this implementation

  • stratified split target statistic helps prevent target leakage thus making your models more robust
  • credibility factor allows categories with low support to be ignored additionally making your models more robust
  • clean api
  • variety of target statistic functions in addition to allowing custom implemented ones
  • easy productionalization - everything is 100% serializable with pickle ex.
    #save for prod/test time environment
    pd.to_pickle(cat2num, "cat2num_for_production.pkl")
    
    #read into prod env
    cat2num=pd.read_pickle("cat2num_for_production.pkl")
    ...
    model.predict(cat2num.transform(prod_data))
    

How?

This is just a simple utility library that performs the following sample operation: See this example notebook

keep in mind this is simply an example. The example target is random here so no real signal is expected example usage

API

Instantiate

Init signature:
Cat2Num(
    cat_vars: List[str],
    target_var: str,
    stat_func: target_statistic_encoding.stat_funcs.stat_funcs._StatFunc = <function mean.<locals>.stat_func at 0x7fea58a85950>,
)
Args:
    cat_vars (List[str]): a list of strings representing the categorical features to be encoded
    target_var (str): string of the name of the target feature in `data`
    stat_func (optional Function(*args, **kwargs) -> Function({pd.Series, pd.DataFrameGroupBy}) -> {float, pd.Series})): function which returns a closure to aggregate statistics over target - default stat_funcs.mean()

fit

prefer.fit_transform on your training set

Note: running .fit followed by .transform on your training set is not equivalent to simply running .fit_transform. There wil be no differentiation amongst category statistics as they will all be mapped to the mean.

cat2num.fit_transform(
    data: pandas.core.frame.DataFrame,
    split: str = None,
    n_splits: int = 5,
    credibility: Union[float, int] = 0,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    split (str): name of a column to use in the data for folding the data.
        - if this is use then n_splits is ignored
    n_splits (int): number of splits to use for target statistic
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`
cat2num.fit(
    data: pandas.core.frame.DataFrame,
    credibility: Union[float, int] = 0,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to fit numeric target statistic from
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean

Returns:
    fit Cat2Num instance

use .transform on your non-training set

cat2num.transform(
    data: pandas.core.frame.DataFrame,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`

Custom target statistic functions

You may optionally opt for a target statistic based on a statistic other than the mean although this is usually unwanted/unnecessary.

Several are included and you can implement your own with a few considerations.

Given:

  • mean (target_statistic_encoding.stat_funcs.Mean()) - the default
  • median (target_statistic_encoding.stat_funcs.Median())
  • std (target_statistic_encoding.stat_funcs.Std())
  • var (target_statistic_encoding.stat_funcs.Var())
  • quantile (target_statistic_encoding.stat_funcs.Quantile(quantile=0.5))

Implement your own:

You may optionally implement your own target statistic function. It must be a callable that operates on the pandas.core.groupby.DataFrameGroupby type i.e. the result of a pandas.DataFrame.groupby e.g.: something akin to

target
X1
a 0.287356
b 0.298795
c 0.336879
d 0.287037

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

target_statistic_encoding-0.1.2.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file target_statistic_encoding-0.1.2.tar.gz.

File metadata

  • Download URL: target_statistic_encoding-0.1.2.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-53-generic

File hashes

Hashes for target_statistic_encoding-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fde456ef5354be9ee157a1598fa8e832cbcc611bf9115d702cc0ec6ca8a97c10
MD5 9fbe58ca2cbd12232076b79448c2c37b
BLAKE2b-256 14b1961188086950b6894c2202495e6ef08242a093ec793fe8ef01935fafb80c

See more details on using hashes here.

File details

Details for the file target_statistic_encoding-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for target_statistic_encoding-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a5ce5ebc41df682400d526f656c21f04428417ec9bbb07475b56085cd86a9e2b
MD5 9a37bd3c119345ab30dec59b61e1e9eb
BLAKE2b-256 4868c29879392d44384fc4fac22faea1671950f62bfbe52d9c028c7e864d4bfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page