Automatic Bucketizing of Features with Optimal Association
Project description
\n
This is a work in progress.
AutoCarver
AutoCarver is a powerful set of tools designed for binary classification problems. It offers a range of functionalities to enhance the feature engineering process and improve the performance of binary classification models. It provides:
- Discretizers: Discretization of qualitative (ordinal or not) and quantitative features
- AutoCarver: Bucketization of qualitative features that maximizes association with a binary target feature
- FeatureSelector: Feature selection that maximizes association with binary target that offers control over inter-feature association.
Install
AutoCarver can be installed from PyPI:
pip install autocarver
Quick-Start Examples
Setting up Samples
AutoCarver
is able to test the robustness of buckets on a dev sample X_dev
.
# defining training and testing sets
X_train, y_train = ... # used to fit the AutoCarver and the model
X_dev, y_dev = ... # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
X_test, y_test = ... # used to evaluate the final model's performances
Initiating Pipeline
One of the great advantages of the AutoCarver
package is its seamless integration with scikit-learn pipelines, making it incredibly convenient for production-level implementations. By leveraging scikit-learn's pipeline functionality, AutoCarver
can be effortlessly incorporated into the end-to-end machine learning workflow.
from sklearn.pipeline import Pipeline
pipe = Pipeline()
Quickly build basic buckets with Discretizer
The AutoCarver.Discretizers
is a user-friendly tool that enables the discretization of various types of data into basic buckets. With this package, users can easily transform qualitative, qualitative ordinal, and quantitative data into discrete categories for further analysis and modeling.
TODO: add info from QuantitativeDiscretizer
and QuantitativeDiscretizer
TODO: add stringconverter
Discretizer
is the combination of QuantitativeDiscretizer
and QuantitativeDiscretizer
.
Following parameters must be set for Discretizer
:
quantitative_features
, list of column names of quantitative data to discretizequantitative_features
, list of column names of qualitative and qualitative ordinal data to discretizemin_freq
, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the
default_value='__OTHER__'
modality. Values are ordered based ony_train
bucket mean. - For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the
values_orders
dictionnary). - For quantitative data: Equivalent to the inverse of
QuantitativeDiscretizer
'sq
parameter. Number of quantiles to initialy cut the feature in. Values more frequent thanmin_freq
will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (1/min_freq:=max(round(non_frequent * 1/min_freq), 1)
).
- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the
values_orders
, dict of qualitative ordinal features matched to the order of their modalities- For qualitative ordinal data:
dict
of features values andGroupedList
of their values. Modalities less frequent thanmin_freq
are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
- For qualitative ordinal data:
from AutoCarver.Discretizers import Discretizer
quanti_features = ['amount', 'distance', 'length', 'height'] # quantitative features to be discretized
quali_features = ['age', 'type', 'grade', 'city'] # qualitative features to be discretized
# specifying orders of qualitative ordinal features
values_orders = {
'age': ['0-18', '18-30', '30-50', '50+'],
'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}
# pre-processing of features into categorical ordinal features
discretizer = Discretizer(quantitative_features=quanti_features, qualitative_features=quali_features, min_freq=0.02, values_orders=values_orders)
discretizer.fit_transform(X_train, y_train)
discretizer.transform(X_dev)
# storing built buckets
values_orders.update(discretizer.values_orders)
# append the discretizer to the feature engineering pipeline
pipe.steps.append(['Discretizer', discretizer])
Overall, the Discretizers package provides a straightforward and efficient solution for discretizing qualitative, qualitative ordinal, and quantitative data into simple buckets. By transforming data into discrete categories, it enables researchers, analysts, and data scientists to gain insights, perform statistical analyses, and build models on discretized data.
For more details and further functionnalities look into AutoCarver.Discretizers README.
For qualitative features, unknown modalities passed to Discretizer.transform
(that where not passed to Discretizer.fit
) are automaticaly grouped to the default_value='__OTHER__'
modality.
By default, samples are modified and not copied (recommanded for large datasets). Use copy=True
if you want a new DataFrame
to be returned.
Maximize target association of features' buckets with AutoCarver
All features need to be discretized via a Discretizer
so AutoCarver
can group their modalities. Following parameters must be set for Discretizer
:
All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for AutoCarver
:
values_orders
, dict of all features matched to the order of their modalitiessort_by
, association measure used to find the optimal group modality combination.- Use
sort_by='cramerv'
for more modalities, less robust. - Use
sort_by='tschuprowt'
for more robust modalities. - Tip: a combination of features carved with
sort_by='cramerv'
andsort_by='tschuprowt'
can sometime prove to be better than only one of those.
- Use
max_n_mod
, maximum number of modalities for the carved features (excludingnumpy.nan
). All possible combinations of less thanmax_n_mod
groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).keep_nans
, whether or not to try groupin missing values to non-missing values. Usekeep_nans=True
if you wantnumpy.nan
to remain as a specific modality.
from AutoCarver.AutoCarver import AutoCarver
# intiating AutoCarver
auto_carver = AutoCarver(values_orders=values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)
# fitting on training sample, a test sample can be specified to evaluate carving robustness
auto_carver.fit_transform(X_train, y_train, X_dev, y_dev)
auto_carver.transform(X_dev)
# append the auto_carver to the feature engineering pipeline
pipe.steps.append(['AutoCarver', auto_carver])
Cherry picking the most target-associated features with FeatureSelector
Following parameters must be set for FeatureSelector
:
features
, list of candidate features by column namen_best
, number of features to selectsample_size=1
, size of sampled list of features speeds up computation. By default, all features are used. For sample_size=0.5, FeatureSelector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:]. Should be set between ]0, 1].- Tip: for a DataFrame of 100 000 rows,
sample_size
could be set such aslen(features)*sample_size
equals 100-200.
- Tip: for a DataFrame of 100 000 rows,
measures
, list ofFeatureSelector
's association measures to be evaluated. Ranks features based on last measure of the list.- For qualitative data implemented association measures are
chi2_measure
,cramerv_measure
,tschuprowt_measure
- For quantitative data implemented association measures are
kruskal_measure
,R_measure
and implemented outlier metrics arezscore_measure
,iqr_measure
- For qualitative data implemented association measures are
filters
, list ofFeatureSelector
's filters used to put aside features.- For qualitative data implemented correlation-based filters are
cramerv_filter
,tschuprowt_filter
- For quantitative data implemented linear filters are
spearman_filter
,pearson_filter
andvif_filter
for multicolinearity filtering
- For qualitative data implemented correlation-based filters are
TODO: add by default measures and filters + add ranking according to several measures + say that it filters out non-selected columns
TODO; add pictures say that it does not make sense to use zscore_measure as last measure
from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import tschuprowt_measure, cramerv_measure, cramerv_filter, tschuprowt_filter, measure_filter
features = quanti_features + quali_features # after AutoCarver, everything is qualitative
measures = [cramerv_measure, tschuprowt_measure] # measures of interest (the last one is used for ranking)
filters = [tschuprowt_filter, measure_filter] # filtering out by inter-feature correlation
# select the best 25 most target associated qualitative features
quali_selector = FeatureSelector(
features=features, # features to select from
n_best=25, # best 25 features
measures=measures, filters=filters, # selected measures and filters
thresh_mode=0.9, # filters out features with more than 90% of their mode
thresh_nan=0.9, # filters out features with more than 90% of missing values
thresh_corr=0.5, # filters out features with spearman greater than 0.5 with a better feature
name_measure='cramerv_measure', thresh_measure=0.06, # filters out features with cramerv_measure lower than 0.06
verbose=True # displays statistics
)
X_train = quali_selector.fit_transform(X_train, y_train)
X_dev = quali_selector.transform(X_dev)
# append the selector to the feature engineering pipeline
pipe.steps.append(['QualiFeatureSelector', quali_selector])
Storing, reusing the AutoCarver
The Discretizer
and AutoCarver
steps can be stored in a Pipeline
and can than be stored as a pickle
file.
from pickle import dump
from sklearn.pipeline import Pipeline
# storing Discretizer
pipe = [('Discretizer', discretizer)]
# storing fitted AutoCarver in a Pipeline
pipe += [('AutoCarver', auto_carver)]
pipe = Pipeline(pipe)
# storing as pickle file
dump(pipe, open('my_pipe.pkl', 'wb'))
The stored Pipeline
, can then be used to transform new datasets.
Detailed Examples
StringConverter Example
from AutoCarver.Converters import StringConverter
stringer = StringConverter(features=quali_features)
X_train = stringer.fit_transform(X_train)
X_dev = stringer.transform(X_dev)
# append the string converter to the feature engineering pipeline
pipe.steps.append(['StringConverter', stringer])
Discretizers Examples
The AutoCarver.Discretizers
is a user-friendly tool that enables the discretization of various types of data into basic buckets. With this package, users can easily transform qualitative, qualitative ordinal, and quantitative data into discrete categories for further analysis and modeling.
QualitativeDiscretizer Example
TODO: add StringConverter
QualitativeDiscretizer
enables the transformation of qualitative data into statistically relevant categories, facilitating model robustness.
- Qualitative Data consists of categorical variables without any inherent order
- Qualitative Ordinal Data consists of categorical variables with a predefined order or hierarchy
Following parameters must be set for QualitativeDiscretizer
:
features
, list of column names of qualitative and qualitative ordinal data to discretizemin_freq
, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the
default_value='__OTHER__'
modality. Values are ordered based ony_train
bucket mean. - For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the
values_orders
dictionnary).
- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the
values_orders
, dict of qualitative ordinal features matched to the order of their modalities- For qualitative ordinal data:
dict
of features values andGroupedList
of their values. Modalities less frequent thanmin_freq
are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.
- For qualitative ordinal data:
from AutoCarver.Discretizers import QualitativeDiscretizer
quali_features = ['age', 'type', 'grade', 'city'] # qualitative features to be discretized
# specifying orders of qualitative ordinal features
values_orders = {
'age': ['0-18', '18-30', '30-50', '50+'],
'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}
# pre-processing of features into categorical ordinal features
quali_discretizer = QualitativeDiscretizer(features=quali_features, min_freq=0.02, values_orders=values_orders)
quali_discretizer.fit_transform(X_train, y_train)
quali_discretizer.transform(X_dev)
# storing built buckets
values_orders.update(quali_discretizer.values_orders)
# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QualitativeDiscretizer', quali_discretizer])
QualitativeDiscretizer
ensures that the ordinal nature of the data is preserved during the discretization process, resulting in meaningful and interpretable categories.
At this step, all numpy.nan
are kept as their own modality. not all of them
QuantitativeDiscretizer Example
TODO: change q for min_freq
QuantitativeDiscretizer
enables the transformation of quantitative data into automatically determined intervals of ranges of values, facilitating model robustness.
- Quantitative Data consists of continuous and discrete numerical variables.
Following parameters must be set for QuantitativeDiscretizer
:
features
, list of column names of quantitative data to discretizeq
, should be set from 20 (faster, increased stability) to 50 (preciser, decreased stability).- For quantitative data: Number of quantiles to initialy cut the feature in. Values more frequent than
1/q
will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (q:=max(round(non_frequent * q), 1)
).
- For quantitative data: Number of quantiles to initialy cut the feature in. Values more frequent than
from AutoCarver.Discretizers import QuantitativeDiscretizer
quanti_features = ['amount', 'distance', 'length', 'height'] # quantitative features to be discretized
# pre-processing of features into categorical ordinal features
quanti_discretizer = QuantitativeDiscretizer(features=quanti_features, q=40)
quanti_discretizer.fit_transform(X_train, y_train)
quanti_discretizer.transform(X_dev)
# storing built buckets
values_orders.update(quanti_discretizer.values_orders)
# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QuantitativeDiscretizer', quanti_discretizer])
At this step, all numpy.nan
are kept as their own modality.
from pickle import load
# restoring the pipeline
pipe = load(open('my_pipe.pkl', 'rb'))
# applying pipe to a validation set or in production
X_val = pipe.transform(X_val)
TODO: add before after picture
FeatureSelector Examples
Quantitative data
from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import zscore_measure, iqr_measure, kruskal_measure, R_measure, measure_filter, spearman_filter
measures = [zscore_measure, iqr_measure, kruskal_measure, R_measure] # measures of interest (the last one is used for ranking)
filters = [measure_filter, spearman_filter] # filtering out by inter-feature correlation
# select the best 25 most target associated quantitative features
quanti_selector = FeatureSelector(
features=quanti_features, # features to select from
n_best=25, # best 25 features
measures=measures, filters=filters, # selected measures and filters
thresh_mode=0.9, # filters out features with more than 90% of their mode
thresh_nan=0.9, # filters out features with more than 90% of missing values
thresh_corr=0.5, # filters out features with spearman greater than 0.5 with a better feature
name_measure='R_measure', thresh_measure=0.06, # filters out features with R_measure lower than 0.06
verbose=True # displays statistics
)
X_train = quanti_selector.fit_transform(X_train, y_train)
X_dev = quanti_selector.transform(X_dev)
# append the selector to the feature engineering pipeline
pipe.steps.append(['QuantiFeatureSelector', quanti_selector])
FeatureSelector TODO: add how to build on measures and filters
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for AutoCarver-4.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fd098c86db1e45ac5807cd4846a2ef33a022c847b6152bb28614995ef2954ab |
|
MD5 | ae446416f1aebe1754d322c2c4cabf20 |
|
BLAKE2b-256 | cd3cbfaeeb38bb22069cf60e54cbb0c9ddc15d411432d75fc3d6ceecccab87f2 |