Skip to main content

Automatic Carving of Features with Optimal Association

Reason this release was yanked:

test on py37 to py311

Project description

AutoCarver

AutoCarver is an approach for maximising a qualitative feature's association with a binary target feature while reducing it's number of distinct modalities. Can also be used to discretize quantitative features, that are prealably cut in quantiles.

Modalities/values of features are carved/regrouped according to a computed specific order defined based on their types:

  • Qualitative features grouped based on target rate per modality.
  • Qualitative ordinal features grouped based on specified modality order.
  • Quantitative features grouped based on the order of their values.

Uses Tschurpow's T or Cramer's V to find the optimal carving (regrouping) of modalities/values of features.

AutoCarver is an sklearn transformer.

Only implementend for binary classification problems.

Install

AutoCarver can be installed from PyPI:

pip install --upgrade autocarver

Complete Example

Setting up Samples

AutoCarver tests the robustness of carvings on a specific sample. For this purpose, the use of an out of time sample is recommended.

# defining training and testing sets
X_train, y_train = ...
X_test, y_test = ...
X_val, y_val = ...

Formatting features to be carved

All features need to be discretized via a Discretizer so AutoCarver can group their modalities. Following parameters must be set for Discretizer:

  • min_freq, should be set from 0.01 (preciser) to 0.05 (faster, increased stability).

    • For qualitative features: Minimal frequency of a modality, less frequent modalities are grouped in the default_value='__OTHER__' modality.
    • For qualitative ordinal features: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the values_orders dictionnary).
  • q, should be set from 10 (faster) to 20 (preciser).

    • For quantitative features: Number of quantiles to initialy cut the feature. Values more frequent than 1/q will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (q:=max(round(non_frequent * q), 1)).
  • values_orders

    • For qualitative ordinal features: dict of features values and GroupedList of their values. Modalities less frequent than min_freq are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.

At this step, all numpy.nan are kept as their own modality.

For qualitative features, unknown modalities passed to Discretizer.transform (that where not passed to Discretizer.fit) are automaticaly grouped to the default_value='__OTHER__' modality.

By default, samples are modified and not copied (recommanded for large datasets). Use copy=True if you want a new DataFrame to be returned.

from AutoCarver.Discretizers import Discretizer

# specifying features to be carved
quantitatives = ['amount', 'distance', 'length', 'height']
qualitatives = ['age', 'type', 'grade', 'city']

# specifying orders of categorical ordinal features
values_orders = {
    'age': ['0-18', '18-30', '30-50', '50+'],
    'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}

# pre-processing of features into categorical ordinal features
discretizer = Discretizer(quantitatives, qualitatives, min_freq=0.02, q=20, values_orders=values_orders)
discretizer.fit_transform(X_train, y_train)
discretizer.transform(X_test)

# updating features' values orders (at this step every features are qualitative ordinal)
values_orders = discretizer.values_orders

Automatic Carving of features

All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for AutoCarver:

  • sort_by, association measure used to find the optimal group modality combination.

    • Use sort_by='cramerv' for more modalities, less robust.
    • Use sort_by='tschuprowt' for more robust modalities.
    • Tip: a combination of features carved with sort_by='cramerv' and sort_by='tschuprowt' can sometime prove to be better than only one of those.
  • max_n_mod, maximum number of modalities for the carved features (excluding numpy.nan). All possible combinations of less than max_n_mod groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).

At this step, all numpy.nan are grouped to the best non-NaN value (after they were grouped). Use keep_nans=True if you want numpy.nan to remain as a specific modality.

from AutoCarver.AutoCarver import AutoCarver

# intiating AutoCarver
auto_carver = AutoCarver(values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)

# fitting on training sample 
# a test sample can be specified to evaluate carving robustness
auto_carver.fit_transform(X_train, y_train, X_test, y_test)

# applying transformation on test sample
auto_carver.transform(X_test)

Storing, reusing an AutoCarver

The Discretizer and AutoCarver steps can be stored in a Pipeline and can than be stored as a pickle file.

from pickle import dump
from sklearn.pipeline import Pipeline

# storing Discretizer
pipe = [('Discretizer', discretizer)]

# storing fitted AutoCarver in a Pipeline
pipe += [('AutoCarver', auto_carver)]
pipe = Pipeline(pipe)

# storing as pickle file
dump(pipe, open('my_pipe.pkl', 'wb'))

The stored Pipeline, can then be used to transform new datasets.

from pickle import load

# restoring the pipeline
pipe = load(open('my_pipe.pkl', 'rb'))

# applying pipe to a validation set or in production
X_val = pipe.transform(X_val)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoCarver-3.0.7.tar.gz (19.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page