Skip to main content

Automatic Bucketizing of Features with Optimal Association

Project description

\n

PyPI PyPI - Python Version GitHub

This is a work in progress.

AutoCarver

AutoCarver is a powerful set of tools designed for binary classification problems. It offers a range of functionalities to enhance the feature engineering process and improve the performance of binary classification models. It provides:

  1. Discretizers: Discretization of qualitative (ordinal or not) and quantitative features
  2. AutoCarver: Bucketization of qualitative features that maximizes association with a binary target feature
  3. FeatureSelector: Feature selection that maximizes association with binary target that offers control over inter-feature association.

Install

AutoCarver can be installed from PyPI:

pip install autocarver

Quick-Start Examples

Setting up Samples, initiating Pipeline

AutoCarver is able to test the robustness of buckets on a dev sample X_dev.

One of the great advantages of the AutoCarver package is its seamless integration with scikit-learn pipelines, making it incredibly convenient for production-level implementations. By leveraging scikit-learn's pipeline functionality, AutoCarver can be effortlessly incorporated into the end-to-end machine learning workflow.

# defining training and testing sets
X_train, y_train = ...  # used to fit the AutoCarver and the model
X_dev, y_dev = ...  # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
X_test, y_test = ...  # used to evaluate the final model's performances

pipe = []  # initiating as an empty list that will be filled along the feature engineering

Quickly build basic buckets with Discretizer

The AutoCarver.Discretizers is a user-friendly tool that enables the discretization of various types of data into basic buckets. With this package, users can easily transform qualitative, qualitative ordinal, and quantitative data into discrete categories for further analysis and modeling.

TODO: add info from QuantitativeDiscretizer and QuantitativeDiscretizer TODO: add stringconverter Discretizer is the combination of QuantitativeDiscretizer and QuantitativeDiscretizer.

Following parameters must be set for Discretizer:

  • quanti_features, list of column names of quantitative data to discretize
  • quanli_features, list of column names of qualitative and qualitative ordinal data to discretize
  • min_freq, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).
    • For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the default_value='__OTHER__' modality. Values are ordered based on y_train bucket mean.
    • For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the values_orders dictionnary).
    • For quantitative data: Equivalent to the inverse of QuantitativeDiscretizer's q parameter. Number of quantiles to initialy cut the feature in. Values more frequent than min_freq will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (1/min_freq:=max(round(non_frequent * 1/min_freq), 1)).
  • values_orders, dict of qualitative ordinal features matched to the order of their modalities
    • For qualitative ordinal data: dict of features values and GroupedList of their values. Modalities less frequent than min_freq are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
from AutoCarver.Discretizers import Discretizer

quanti_features = ['amount', 'distance', 'length', 'height']  # quantitative features to be discretized
quali_features = ['age', 'type', 'grade', 'city']  # qualitative features to be discretized

# specifying orders of qualitative ordinal features
values_orders = {
    'age': ['0-18', '18-30', '30-50', '50+'],
    'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}

# pre-processing of features into categorical ordinal features
discretizer = Discretizer(quanti_features=quanti_features, quali_features=quali_features, min_freq=0.02, values_orders=values_orders)
discretizer.fit_transform(X_train, y_train)
discretizer.transform(X_dev)

# storing built buckets
values_orders.update(discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe += [('Discretizer', discretizer)]

Overall, the Discretizers package provides a straightforward and efficient solution for discretizing qualitative, qualitative ordinal, and quantitative data into simple buckets. By transforming data into discrete categories, it enables researchers, analysts, and data scientists to gain insights, perform statistical analyses, and build models on discretized data.

For more details and further functionnalities look into AutoCarver.Discretizers README.

For qualitative features, unknown modalities passed to Discretizer.transform (that where not passed to Discretizer.fit) are automaticaly grouped to the default_value='__OTHER__' modality.

By default, samples are modified and not copied (recommanded for large datasets). Use copy=True if you want a new DataFrame to be returned.

Maximize target association of features' buckets with AutoCarver

All features need to be discretized via a Discretizer so AutoCarver can group their modalities. Following parameters must be set for Discretizer:

All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for AutoCarver:

  • values_orders, dict of all features matched to the order of their modalities
  • sort_by, association measure used to find the optimal group modality combination.
    • Use sort_by='cramerv' for more modalities, less robust.
    • Use sort_by='tschuprowt' for more robust modalities.
    • Tip: a combination of features carved with sort_by='cramerv' and sort_by='tschuprowt' can sometime prove to be better than only one of those.
  • max_n_mod, maximum number of modalities for the carved features (excluding numpy.nan). All possible combinations of less than max_n_mod groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).
  • keep_nans, whether or not to try groupin missing values to non-missing values. Use keep_nans=True if you want numpy.nan to remain as a specific modality.
from AutoCarver.AutoCarver import AutoCarver

# intiating AutoCarver
auto_carver = AutoCarver(values_orders=values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)

# fitting on training sample, a test sample can be specified to evaluate carving robustness
auto_carver.fit_transform(X_train, y_train, X_test=X_dev, y_test=y_dev)
auto_carver.transform(X_dev)

# append the auto_carver to the feature engineering pipeline
pipe += [('AutoCarver', auto_carver)]

Cherry picking the most target-associated features with FeatureSelector

Following parameters must be set for FeatureSelector:

  • features, list of candidate features by column name
  • n_best, number of features to select
  • sample_size=1, size of sampled list of features speeds up computation. By default, all features are used. For sample_size=0.5, FeatureSelector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:]. Should be set between ]0, 1].
    • Tip: for a DataFrame of 100 000 rows, sample_size could be set such as len(features)*sample_size equals 100-200.
  • measures, list of FeatureSelector's association measures to be evaluated. Ranks features based on last measure of the list.
    • For qualitative data implemented association measures are chi2_measure, cramerv_measure, tschuprowt_measure
    • For quantitative data implemented association measures are kruskal_measure, R_measure and implemented outlier metrics are zscore_measure, iqr_measure
  • filters, list of FeatureSelector's filters used to put aside features.
    • For qualitative data implemented correlation-based filters are cramerv_filter, tschuprowt_filter
    • For quantitative data implemented linear filters are spearman_filter, pearson_filter and vif_filter for multicolinearity filtering

TODO: add by default measures and filters + add ranking according to several measures + say that it filters out non-selected columns

TODO; add pictures say that it does not make sense to use zscore_measure as last measure

from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import tschuprowt_measure, cramerv_measure, cramerv_filter, tschuprowt_filter, measure_filter

features = quanti_features + quali_features  # after AutoCarver, everything is qualitative

measures = [cramerv_measure, tschuprowt_measure]  # measures of interest (the last one is used for ranking)
filters = [tschuprowt_filter, measure_filter]  # filtering out by inter-feature correlation

# select the best 25 most target associated qualitative features
quali_selector = FeatureSelector(
    features=features,  # features to select from
    n_best=25,  # best 25 features
    measures=measures, filters=filters,   # selected measures and filters
    thresh_mode=0.9,  # filters out features with more than 90% of their mode
    thresh_nan=0.9,  # filters out features with more than 90% of missing values
    thresh_corr=0.5,  # filters out features with spearman greater than 0.5 with a better feature
    name_measure='cramerv_measure', thresh_measure=0.06,  # filters out features with cramerv_measure lower than 0.06
    verbose=True  # displays statistics
)
X_train = quali_selector.fit_transform(X_train, y_train)
X_dev = quali_selector.transform(X_dev)

# append the selector to the feature engineering pipeline
pipe += [('QualiFeatureSelector', quali_selector)]

Storing, reusing the AutoCarver

The Discretizer and AutoCarver steps can be stored in a Pipeline and can than be stored as a pickle file.

from pickle import dump
from sklearn.pipeline import Pipeline

# storing fitted Discretizer, AutoCarver and FeatureSelector in a Pipeline
pipe = Pipeline(pipe)

# storing as pickle file
dump(pipe, open('my_pipe.pkl', 'wb'))

The stored Pipeline, can then be used to transform new datasets.

Detailed Examples

StringConverter Example

from AutoCarver.Converters import StringConverter

stringer = StringConverter(features=quali_features)
X_train = stringer.fit_transform(X_train)
X_dev = stringer.transform(X_dev)

# append the string converter to the feature engineering pipeline
pipe.steps.append(['StringConverter', stringer])

Discretizers Examples

The AutoCarver.Discretizers is a user-friendly tool that enables the discretization of various types of data into basic buckets. With this package, users can easily transform qualitative, qualitative ordinal, and quantitative data into discrete categories for further analysis and modeling.

QualitativeDiscretizer Example

TODO: add StringConverter

QualitativeDiscretizer enables the transformation of qualitative data into statistically relevant categories, facilitating model robustness.

  • Qualitative Data consists of categorical variables without any inherent order
  • Qualitative Ordinal Data consists of categorical variables with a predefined order or hierarchy

Following parameters must be set for QualitativeDiscretizer:

  • features, list of column names of qualitative and qualitative ordinal data to discretize
  • min_freq, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).
    • For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the default_value='__OTHER__' modality. Values are ordered based on y_train bucket mean.
    • For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the values_orders dictionnary).
  • values_orders, dict of qualitative ordinal features matched to the order of their modalities
    • For qualitative ordinal data: dict of features values and GroupedList of their values. Modalities less frequent than min_freq are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
from AutoCarver.Discretizers import QualitativeDiscretizer

quali_features = ['age', 'type', 'grade', 'city']  # qualitative features to be discretized

# specifying orders of qualitative ordinal features
values_orders = {
    'age': ['0-18', '18-30', '30-50', '50+'],
    'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}

# pre-processing of features into categorical ordinal features
quali_discretizer = QualitativeDiscretizer(features=quali_features, min_freq=0.02, values_orders=values_orders)
quali_discretizer.fit_transform(X_train, y_train)
quali_discretizer.transform(X_dev)

# storing built buckets
values_orders.update(quali_discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QualitativeDiscretizer', quali_discretizer])

QualitativeDiscretizer ensures that the ordinal nature of the data is preserved during the discretization process, resulting in meaningful and interpretable categories.

At this step, all numpy.nan are kept as their own modality. not all of them

QuantitativeDiscretizer Example

TODO: change q for min_freq

QuantitativeDiscretizer enables the transformation of quantitative data into automatically determined intervals of ranges of values, facilitating model robustness.

  • Quantitative Data consists of continuous and discrete numerical variables.

Following parameters must be set for QuantitativeDiscretizer:

  • features, list of column names of quantitative data to discretize
  • q, should be set from 20 (faster, increased stability) to 50 (preciser, decreased stability).
    • For quantitative data: Number of quantiles to initialy cut the feature in. Values more frequent than 1/q will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (q:=max(round(non_frequent * q), 1)).
from AutoCarver.Discretizers import QuantitativeDiscretizer

quanti_features = ['amount', 'distance', 'length', 'height']  # quantitative features to be discretized

# pre-processing of features into categorical ordinal features
quanti_discretizer = QuantitativeDiscretizer(features=quanti_features, q=40)
quanti_discretizer.fit_transform(X_train, y_train)
quanti_discretizer.transform(X_dev)

# storing built buckets
values_orders.update(quanti_discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QuantitativeDiscretizer', quanti_discretizer])

At this step, all numpy.nan are kept as their own modality.

from pickle import load

# restoring the pipeline
pipe = load(open('my_pipe.pkl', 'rb'))

# applying pipe to a validation set or in production
X_val = pipe.transform(X_val)

TODO: add before after picture

FeatureSelector Examples

Quantitative data

from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import zscore_measure, iqr_measure, kruskal_measure, R_measure, measure_filter, spearman_filter

measures = [zscore_measure, iqr_measure, kruskal_measure, R_measure]  # measures of interest (the last one is used for ranking)
filters = [measure_filter, spearman_filter]  # filtering out by inter-feature correlation

# select the best 25 most target associated quantitative features
quanti_selector = FeatureSelector(
    features=quanti_features,  # features to select from
    n_best=25,  # best 25 features
    measures=measures, filters=filters,   # selected measures and filters
    thresh_mode=0.9,  # filters out features with more than 90% of their mode
    thresh_nan=0.9,  # filters out features with more than 90% of missing values
    thresh_corr=0.5,  # filters out features with spearman greater than 0.5 with a better feature
    name_measure='R_measure', thresh_measure=0.06,  # filters out features with R_measure lower than 0.06
    verbose=True  # displays statistics
)
X_train = quanti_selector.fit_transform(X_train, y_train)
X_dev = quanti_selector.transform(X_dev)

# append the selector to the feature engineering pipeline
pipe.steps.append(['QuantiFeatureSelector', quanti_selector])
pipe += [('QuantiFeatureSelector', quanti_selector)]

FeatureSelector TODO: add how to build on measures and filters

Converters Examples

CrossConverter

from AutoCarver.Converters import CrossConverter

# qualitative and quantitative features should be discretized (and bucketized with AutoCarver)
to_cross = quali_features + quanti_features

cross_converter = CrossConverter(to_cross)
X_train = cross_converter.fit_transform(X_train, y_train)
X_dev = cross_converter.transform(X_dev)

# append the crosser to the feature engineering pipeline
pipe += [('CrossConverter', cross_converter)]

quali_features_built = crosser.new_features  # adding to qualitative_features_built for no further feature engineering
print(f"Qualitative features built: total {len(quali_features_built)}")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoCarver-5.0.0.tar.gz (44.3 kB view hashes)

Uploaded Source

Built Distribution

AutoCarver-5.0.0-py3-none-any.whl (45.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page