Skip to main content

A lightweight library for encoding categorical features in your dataset with robust k-fold target statistics in training.

Project description

target statistic encoding


Table of contents:

Install

from pypi

pip install target-statistic-encoding

from source

python -m pip install git+https://github.com/CircArgs/target_statistic_encoding.git

What?

There are many means to convert categorical features to numeric ones from one-hot to embeddings. Then there are target statistic methods. These methods take statistics based on the target feature.

Why?

Even within this simple technique there is variation in implementations. Some implement a time-mimicking approach such as Catboost to gain robustness over target leakage. However, one issue with this approach is that while it introduces some variation to the encoding, for a some samples the statistic is possibly excessively biased. This small package takes a different approach for this reason. Instead, it uses stratified folds of the training set and aggregates target statistics on each fold independently.

Benefits of this implementation

  • stratified split target statistic helps prevent target leakage thus making your models more robust
  • credibility factor allows categories with low support to be ignored additionally making your models more robust
  • clean api
  • variety of target statistic functions in addition to allowing custom implemented ones
  • easy productionalization - everything is 100% serializable with pickle ex.
    #save for prod/test time environment
    pd.to_pickle(cat2num, "cat2num_for_production.pkl")
    
    #read into prod env
    cat2num=pd.read_pickle("cat2num_for_production.pkl")
    ...
    model.predict(cat2num.transform(prod_data))
    

How?

This is just a simple utility library that performs the following sample operation: See this example notebook

keep in mind this is simply an example. The example target is random here so no real signal is expected example usage

API

Instantiate

Init signature:
Cat2Num(
    cat_vars: List[str],
    target_var: str,
    stat_func: target_statistic_encoding.stat_funcs.stat_funcs._StatFunc = <function mean.<locals>.stat_func at 0x7fea58a85950>,
)
Args:
    cat_vars (List[str]): a list of strings representing the categorical features to be encoded
    target_var (str): string of the name of the target feature in `data`
    stat_func (optional Function(*args, **kwargs) -> Function({pd.Series, pd.DataFrameGroupBy}) -> {float, pd.Series})): function which returns a closure to aggregate statistics over target - default stat_funcs.mean()

fit

prefer.fit_transform on your training set

Note: running .fit followed by .transform on your training set is not equivalent to simply running .fit_transform. There wil be no differentiation amongst category statistics as they will all be mapped to the mean.

cat2num.fit_transform(
    data: pandas.core.frame.DataFrame,
    split: str = None,
    n_splits: int = 5,
    credibility: Union[float, int] = 0,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    split (str): name of a column to use in the data for folding the data.
        - if this is use then n_splits is ignored
    n_splits (int): number of splits to use for target statistic
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`
cat2num.fit(
    data: pandas.core.frame.DataFrame,
    credibility: Union[float, int] = 0,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to fit numeric target statistic from
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean

Returns:
    fit Cat2Num instance

use .transform on your non-training set

cat2num.transform(
    data: pandas.core.frame.DataFrame,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`

Custom target statistic functions

You may optionally opt for a target statistic based on a statistic other than the mean although this is usually unwanted/unnecessary.

Several are included and you can implement your own with a few considerations.

Given:

  • mean (target_statistic_encoding.stat_funcs.Mean()) - the default
  • median (target_statistic_encoding.stat_funcs.Median())
  • std (target_statistic_encoding.stat_funcs.Std())
  • var (target_statistic_encoding.stat_funcs.Var())
  • quantile (target_statistic_encoding.stat_funcs.Quantile(quantile=0.5))

Implement your own:

You may optionally implement your own target statistic function. It must be a callable that operates on the pandas.core.groupby.DataFrameGroupby type i.e. the result of a pandas.DataFrame.groupby e.g.: something akin to

target
X1
a 0.287356
b 0.298795
c 0.336879
d 0.287037

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

target_statistic_encoding-0.1.4.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file target_statistic_encoding-0.1.4.tar.gz.

File metadata

  • Download URL: target_statistic_encoding-0.1.4.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.7.6 Linux/5.3.0-59-generic

File hashes

Hashes for target_statistic_encoding-0.1.4.tar.gz
Algorithm Hash digest
SHA256 979290a121c88c0f03ff6788d4a69e6542d8e2721393b42e1e7d185a65c85164
MD5 db373fa4c1d5b9e0e0267274f108b24c
BLAKE2b-256 a730e7f7f74beac578b0e61153e68a6d1a78574b910cbf9fcd38e2c91a12cb93

See more details on using hashes here.

File details

Details for the file target_statistic_encoding-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for target_statistic_encoding-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 952d6d751bd7c53bd597194ea19a54c106b8993d1aa24c1aaac837fa07c3e469
MD5 8be8c48fc230440fe1c3d53944480465
BLAKE2b-256 806e00c179eb348cbc8d1d2fbd506b8cdb06f98cbbdc06196782161f98898245

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page