Automatic Bucketizing of Features with Optimal Association
Project description
AutoCarver
AutoCarver is an approach for maximising a qualitative feature's association with a binary target feature while reducing it's number of distinct modalities. Can also be used to discretize quantitative features, that are prealably cut in quantiles.
Modalities/values of features are carved/regrouped according to a computed specific order defined based on their types:
- Qualitative features grouped based on target rate per modality.
- Qualitative ordinal features grouped based on specified modality order.
- Quantitative features grouped based on the order of their values.
Uses Tschurpow's T or Cramer's V to find the optimal carving (regrouping) of modalities/values of features.
AutoCarver
is an sklearn
transformer.
Only implementend for binary classification problems.
Install
AutoCarver can be installed from PyPI:
pip install --upgrade autocarver
Complete Example
Setting up Samples
AutoCarver
tests the robustness of carvings on a specific sample. For this purpose, the use of an out of time sample is recommended.
# defining training and testing sets
X_train, y_train = ...
X_test, y_test = ...
X_val, y_val = ...
Formatting features to be carved
All features need to be discretized via a Discretizer
so AutoCarver
can group their modalities. Following parameters must be set for Discretizer
:
-
min_freq
, should be set from 0.01 (preciser) to 0.05 (faster, increased stability).- For qualitative features: Minimal frequency of a modality, less frequent modalities are grouped in the
default_value='__OTHER__'
modality. - For qualitative ordinal features: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the
values_orders
dictionnary).
- For qualitative features: Minimal frequency of a modality, less frequent modalities are grouped in the
-
q
, should be set from 10 (faster) to 20 (preciser).- For quantitative features: Number of quantiles to initialy cut the feature. Values more frequent than
1/q
will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (q:=max(round(non_frequent * q), 1)
).
- For quantitative features: Number of quantiles to initialy cut the feature. Values more frequent than
-
values_orders
- For qualitative ordinal features:
dict
of features values andGroupedList
of their values. Modalities less frequent thanmin_freq
are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
- For qualitative ordinal features:
At this step, all numpy.nan
are kept as their own modality.
For qualitative features, unknown modalities passed to Discretizer.transform
(that where not passed to Discretizer.fit
) are automaticaly grouped to the default_value='__OTHER__'
modality.
By default, samples are modified and not copied (recommanded for large datasets). Use copy=True
if you want a new DataFrame
to be returned.
from AutoCarver.Discretizers import Discretizer
# specifying features to be carved
quantitatives = ['amount', 'distance', 'length', 'height']
qualitatives = ['age', 'type', 'grade', 'city']
# specifying orders of categorical ordinal features
values_orders = {
'age': ['0-18', '18-30', '30-50', '50+'],
'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}
# pre-processing of features into categorical ordinal features
discretizer = Discretizer(quantitatives, qualitatives, min_freq=0.02, q=20, values_orders=values_orders)
discretizer.fit_transform(X_train, y_train)
discretizer.transform(X_test)
# updating features' values orders (at this step every features are qualitative ordinal)
values_orders = discretizer.values_orders
Automatic Carving of features
All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for AutoCarver
:
-
sort_by
, association measure used to find the optimal group modality combination.- Use
sort_by='cramerv'
for more modalities, less robust. - Use
sort_by='tschuprowt'
for more robust modalities. - Tip: a combination of features carved with
sort_by='cramerv'
andsort_by='tschuprowt'
can sometime prove to be better than only one of those.
- Use
-
max_n_mod
, maximum number of modalities for the carved features (excludingnumpy.nan
). All possible combinations of less thanmax_n_mod
groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).
At this step, all numpy.nan
are grouped to the best non-NaN value (after they were grouped). Use keep_nans=True
if you want numpy.nan
to remain as a specific modality.
from AutoCarver.AutoCarver import AutoCarver
# intiating AutoCarver
auto_carver = AutoCarver(values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)
# fitting on training sample
# a test sample can be specified to evaluate carving robustness
auto_carver.fit_transform(X_train, y_train, X_test, y_test)
# applying transformation on test sample
auto_carver.transform(X_test)
Storing, reusing an AutoCarver
The Discretizer
and AutoCarver
steps can be stored in a Pipeline
and can than be stored as a pickle
file.
from pickle import dump
from sklearn.pipeline import Pipeline
# storing Discretizer
pipe = [('Discretizer', discretizer)]
# storing fitted AutoCarver in a Pipeline
pipe += [('AutoCarver', auto_carver)]
pipe = Pipeline(pipe)
# storing as pickle file
dump(pipe, open('my_pipe.pkl', 'wb'))
The stored Pipeline
, can then be used to transform new datasets.
from pickle import load
# restoring the pipeline
pipe = load(open('my_pipe.pkl', 'rb'))
# applying pipe to a validation set or in production
X_val = pipe.transform(X_val)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.