Automatic Bucketizing of Features with Optimal Association

These details have not been verified by PyPI

Project links

Project description

PyPI PyPI - Python Version GitHub

This is a work in progress.

AutoCarver

AutoCarver is a powerful set of tools designed for binary classification problems. It offers a range of functionalities to enhance the feature engineering process and improve the performance of binary classification models. It provides:

Discretizers: Discretization of qualitative (ordinal or not) and quantitative features
AutoCarver: Bucketization of qualitative features that maximizes association with a binary target feature
FeatureSelector: Feature selection that maximizes association with binary target that offers control over inter-feature association.

Install

AutoCarver can be installed from PyPI:

pip install autocarver

Quick-Start Examples

Setting up Samples

AutoCarver is able to test the robustness of buckets on a dev sample X_dev.

# defining training and testing sets
X_train, y_train = ...  # used to fit the AutoCarver and the model
X_dev, y_dev = ...  # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
X_test, y_test = ...  # used to evaluate the final model's performances

Initiating Pipeline

One of the great advantages of the AutoCarver package is its seamless integration with scikit-learn pipelines, making it incredibly convenient for production-level implementations. By leveraging scikit-learn's pipeline functionality, AutoCarver can be effortlessly incorporated into the end-to-end machine learning workflow.

from sklearn.pipeline import Pipeline

pipe = Pipeline()

Quickly build basic buckets with Discretizer

The AutoCarver.Discretizers is a user-friendly tool that enables the discretization of various types of data into basic buckets. With this package, users can easily transform qualitative, qualitative ordinal, and quantitative data into discrete categories for further analysis and modeling.

TODO: add info from QuantitativeDiscretizer and QuantitativeDiscretizer TODO: add stringconverter Discretizer is the combination of QuantitativeDiscretizer and QuantitativeDiscretizer.

Following parameters must be set for Discretizer:

quantitative_features, list of column names of quantitative data to discretize
quantitative_features, list of column names of qualitative and qualitative ordinal data to discretize
min_freq, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).
- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the default_value='__OTHER__' modality. Values are ordered based on y_train bucket mean.
- For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the values_orders dictionnary).
- For quantitative data: Equivalent to the inverse of QuantitativeDiscretizer's q parameter. Number of quantiles to initialy cut the feature in. Values more frequent than min_freq will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (1/min_freq:=max(round(non_frequent * 1/min_freq), 1)).
values_orders, dict of qualitative ordinal features matched to the order of their modalities
- For qualitative ordinal data: dict of features values and GroupedList of their values. Modalities less frequent than min_freq are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.

from AutoCarver.Discretizers import Discretizer

quanti_features = ['amount', 'distance', 'length', 'height']  # quantitative features to be discretized
quali_features = ['age', 'type', 'grade', 'city']  # qualitative features to be discretized

# specifying orders of qualitative ordinal features
values_orders = {
    'age': ['0-18', '18-30', '30-50', '50+'],
    'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}

# pre-processing of features into categorical ordinal features
discretizer = Discretizer(quantitative_features=quanti_features, qualitative_features=quali_features, min_freq=0.02, values_orders=values_orders)
discretizer.fit_transform(X_train, y_train)
discretizer.transform(X_dev)

# storing built buckets
values_orders.update(discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe.steps.append(['Discretizer', discretizer])

Overall, the Discretizers package provides a straightforward and efficient solution for discretizing qualitative, qualitative ordinal, and quantitative data into simple buckets. By transforming data into discrete categories, it enables researchers, analysts, and data scientists to gain insights, perform statistical analyses, and build models on discretized data.

For more details and further functionnalities look into AutoCarver.Discretizers README.

For qualitative features, unknown modalities passed to Discretizer.transform (that where not passed to Discretizer.fit) are automaticaly grouped to the default_value='__OTHER__' modality.

By default, samples are modified and not copied (recommanded for large datasets). Use copy=True if you want a new DataFrame to be returned.

Maximize target association of features' buckets with AutoCarver

All features need to be discretized via a Discretizer so AutoCarver can group their modalities. Following parameters must be set for Discretizer:

All specified features can now automatically be carved in an association maximising grouping of their modalities while reducing their number. Following parameters must be set for AutoCarver:

values_orders, dict of all features matched to the order of their modalities
sort_by, association measure used to find the optimal group modality combination.
- Use sort_by='cramerv' for more modalities, less robust.
- Use sort_by='tschuprowt' for more robust modalities.
- Tip: a combination of features carved with sort_by='cramerv' and sort_by='tschuprowt' can sometime prove to be better than only one of those.
max_n_mod, maximum number of modalities for the carved features (excluding numpy.nan). All possible combinations of less than max_n_mod groups of modalities will be tested. Should be set from 4 (faster) to 6 (preciser).
keep_nans, whether or not to try groupin missing values to non-missing values. Use keep_nans=True if you want numpy.nan to remain as a specific modality.

from AutoCarver.AutoCarver import AutoCarver

# intiating AutoCarver
auto_carver = AutoCarver(values_orders=values_orders, sort_by='cramerv', max_n_mod=5, verbose=True)

# fitting on training sample, a test sample can be specified to evaluate carving robustness
auto_carver.fit_transform(X_train, y_train, X_dev, y_dev)
auto_carver.transform(X_dev)

# append the auto_carver to the feature engineering pipeline
pipe.steps.append(['AutoCarver', auto_carver])

Cherry picking the most target-associated features with FeatureSelector

Following parameters must be set for FeatureSelector:

features, list of candidate features by column name
n_best, number of features to select
sample_size=1, size of sampled list of features speeds up computation. By default, all features are used. For sample_size=0.5, FeatureSelector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:]. Should be set between ]0, 1].
- Tip: for a DataFrame of 100 000 rows, sample_size could be set such as len(features)*sample_size equals 100-200.
measures, list of FeatureSelector's association measures to be evaluated. Ranks features based on last measure of the list.
- For qualitative data implemented association measures are chi2_measure, cramerv_measure, tschuprowt_measure
- For quantitative data implemented association measures are kruskal_measure, R_measure and implemented outlier metrics are zscore_measure, iqr_measure
filters, list of FeatureSelector's filters used to put aside features.
- For qualitative data implemented correlation-based filters are cramerv_filter, tschuprowt_filter
- For quantitative data implemented linear filters are spearman_filter, pearson_filter and vif_filter for multicolinearity filtering

TODO: add by default measures and filters + add ranking according to several measures + say that it filters out non-selected columns

TODO; add pictures say that it does not make sense to use zscore_measure as last measure

from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import tschuprowt_measure, cramerv_measure, cramerv_filter, tschuprowt_filter, measure_filter

features = quanti_features + quali_features  # after AutoCarver, everything is qualitative

measures = [cramerv_measure, tschuprowt_measure]  # measures of interest (the last one is used for ranking)
filters = [tschuprowt_filter, measure_filter]  # filtering out by inter-feature correlation

# select the best 25 most target associated qualitative features
quali_selector = FeatureSelector(
    features=features,  # features to select from
    n_best=25,  # best 25 features
    measures=measures, filters=filters,   # selected measures and filters
    thresh_mode=0.9,  # filters out features with more than 90% of their mode
    thresh_nan=0.9,  # filters out features with more than 90% of missing values
    thresh_corr=0.5,  # filters out features with spearman greater than 0.5 with a better feature
    name_measure='cramerv_measure', thresh_measure=0.06,  # filters out features with cramerv_measure lower than 0.06
    verbose=True  # displays statistics
)
X_train = quali_selector.fit_transform(X_train, y_train)
X_dev = quali_selector.transform(X_dev)

# append the selector to the feature engineering pipeline
pipe.steps.append(['QualiFeatureSelector', quali_selector])

Storing, reusing the AutoCarver

The Discretizer and AutoCarver steps can be stored in a Pipeline and can than be stored as a pickle file.

from pickle import dump
from sklearn.pipeline import Pipeline

# storing Discretizer
pipe = [('Discretizer', discretizer)]

# storing fitted AutoCarver in a Pipeline
pipe += [('AutoCarver', auto_carver)]
pipe = Pipeline(pipe)

# storing as pickle file
dump(pipe, open('my_pipe.pkl', 'wb'))

The stored Pipeline, can then be used to transform new datasets.

Detailed Examples

StringConverter Example

from AutoCarver.Converters import StringConverter

stringer = StringConverter(features=quali_features)
X_train = stringer.fit_transform(X_train)
X_dev = stringer.transform(X_dev)

# append the string converter to the feature engineering pipeline
pipe.steps.append(['StringConverter', stringer])

Discretizers Examples

QualitativeDiscretizer Example

TODO: add StringConverter

QualitativeDiscretizer enables the transformation of qualitative data into statistically relevant categories, facilitating model robustness.

Qualitative Data consists of categorical variables without any inherent order
Qualitative Ordinal Data consists of categorical variables with a predefined order or hierarchy

Following parameters must be set for QualitativeDiscretizer:

features, list of column names of qualitative and qualitative ordinal data to discretize
min_freq, should be set from 0.01 (preciser, decreased stability) to 0.05 (faster, increased stability).
- For qualitative data: Minimal frequency of a modality, less frequent modalities are grouped in the default_value='__OTHER__' modality. Values are ordered based on y_train bucket mean.
- For qualitative ordinal data: Less frequent modalities are grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values (specified in the values_orders dictionnary).
values_orders, dict of qualitative ordinal features matched to the order of their modalities
- For qualitative ordinal data: dict of features values and GroupedList of their values. Modalities less frequent than min_freq are automaticaly grouped to the closest modality (smallest frequency or closest target rate), between the superior and inferior values.

from AutoCarver.Discretizers import QualitativeDiscretizer

quali_features = ['age', 'type', 'grade', 'city']  # qualitative features to be discretized

# specifying orders of qualitative ordinal features
values_orders = {
    'age': ['0-18', '18-30', '30-50', '50+'],
    'grade': ['A', 'B', 'C', 'D', 'J', 'K', 'NN']
}

# pre-processing of features into categorical ordinal features
quali_discretizer = QualitativeDiscretizer(features=quali_features, min_freq=0.02, values_orders=values_orders)
quali_discretizer.fit_transform(X_train, y_train)
quali_discretizer.transform(X_dev)

# storing built buckets
values_orders.update(quali_discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QualitativeDiscretizer', quali_discretizer])

QualitativeDiscretizer ensures that the ordinal nature of the data is preserved during the discretization process, resulting in meaningful and interpretable categories.

At this step, all numpy.nan are kept as their own modality. not all of them

QuantitativeDiscretizer Example

TODO: change q for min_freq

QuantitativeDiscretizer enables the transformation of quantitative data into automatically determined intervals of ranges of values, facilitating model robustness.

Quantitative Data consists of continuous and discrete numerical variables.

Following parameters must be set for QuantitativeDiscretizer:

features, list of column names of quantitative data to discretize
q, should be set from 20 (faster, increased stability) to 50 (preciser, decreased stability).
- For quantitative data: Number of quantiles to initialy cut the feature in. Values more frequent than 1/q will be set as their own group and remaining frequency will be cut into proportionaly less quantiles (q:=max(round(non_frequent * q), 1)).

from AutoCarver.Discretizers import QuantitativeDiscretizer

quanti_features = ['amount', 'distance', 'length', 'height']  # quantitative features to be discretized

# pre-processing of features into categorical ordinal features
quanti_discretizer = QuantitativeDiscretizer(features=quanti_features, q=40)
quanti_discretizer.fit_transform(X_train, y_train)
quanti_discretizer.transform(X_dev)

# storing built buckets
values_orders.update(quanti_discretizer.values_orders)

# append the discretizer to the feature engineering pipeline
pipe.steps.append(['QuantitativeDiscretizer', quanti_discretizer])

At this step, all numpy.nan are kept as their own modality.

from pickle import load

# restoring the pipeline
pipe = load(open('my_pipe.pkl', 'rb'))

# applying pipe to a validation set or in production
X_val = pipe.transform(X_val)

TODO: add before after picture

FeatureSelector Examples

Quantitative data

from AutoCarver.FeatureSelector import FeatureSelector
from AutoCarver.FeatureSelector import zscore_measure, iqr_measure, kruskal_measure, R_measure, measure_filter, spearman_filter

measures = [zscore_measure, iqr_measure, kruskal_measure, R_measure]  # measures of interest (the last one is used for ranking)
filters = [measure_filter, spearman_filter]  # filtering out by inter-feature correlation

# select the best 25 most target associated quantitative features
quanti_selector = FeatureSelector(
    features=quanti_features,  # features to select from
    n_best=25,  # best 25 features
    measures=measures, filters=filters,   # selected measures and filters
    thresh_mode=0.9,  # filters out features with more than 90% of their mode
    thresh_nan=0.9,  # filters out features with more than 90% of missing values
    thresh_corr=0.5,  # filters out features with spearman greater than 0.5 with a better feature
    name_measure='R_measure', thresh_measure=0.06,  # filters out features with R_measure lower than 0.06
    verbose=True  # displays statistics
)
X_train = quanti_selector.fit_transform(X_train, y_train)
X_dev = quanti_selector.transform(X_dev)

# append the selector to the feature engineering pipeline
pipe.steps.append(['QuantiFeatureSelector', quanti_selector])

FeatureSelector TODO: add how to build on measures and filters

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

6.0.5

Jan 19, 2024

6.0.4 yanked

Jan 16, 2024

Reason this release was yanked:

non working continuous discretizer

6.0.3 yanked

Jan 9, 2024

Reason this release was yanked:

non-working selectors

6.0.2 yanked

Dec 22, 2023

Reason this release was yanked:

non working selectors

5.4.9

Dec 2, 2023

5.4.8

Nov 20, 2023

5.4.7

Nov 19, 2023

5.4.6

Nov 18, 2023

5.4.5

Nov 18, 2023

5.4.4

Nov 16, 2023

5.4.3

Nov 15, 2023

5.4.2

Nov 12, 2023

5.4.1

Nov 11, 2023

5.4.0

Nov 2, 2023

5.3.4

Nov 1, 2023

5.3.3

Nov 1, 2023

5.3.2

Nov 1, 2023

5.3.0

Oct 26, 2023

5.2.2

Oct 2, 2023

5.2.1

Sep 22, 2023

5.2.0

Aug 7, 2023

5.1.9

Jul 30, 2023

5.1.8

Jul 20, 2023

5.1.7

Jul 17, 2023

5.1.6

Jul 17, 2023

5.1.5

Jul 17, 2023

5.1.4

Jul 17, 2023

5.1.3

Jul 16, 2023

5.1.2

Jul 16, 2023

5.1.1

Jul 16, 2023

5.1.0

Jul 16, 2023

5.0.9

Jul 16, 2023

5.0.8

Jul 15, 2023

5.0.7

Jul 15, 2023

5.0.6 yanked

Jul 15, 2023

5.0.5 yanked

Jul 15, 2023

5.0.4 yanked

Jul 14, 2023

5.0.3 yanked

Jul 14, 2023

5.0.2 yanked

Jul 14, 2023

5.0.1 yanked

Jul 13, 2023

5.0.0 yanked

Jul 13, 2023

This version

4.4.1

Jun 12, 2023

4.4.0

Jun 11, 2023

4.3.2 yanked

Jun 9, 2023

4.3.1 yanked

May 22, 2023

4.3.0 yanked

May 22, 2023

4.2.1 yanked

May 17, 2023

4.2.0 yanked

May 17, 2023

4.1.0 yanked

May 15, 2023

4.0.1 yanked

May 14, 2023

4.0.0 yanked

Apr 12, 2023

3.1.0 yanked

Feb 8, 2023

Reason this release was yanked:

test on py37 to py311

3.0.12 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.11 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.10 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.9 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.8 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.7 yanked

Jan 30, 2023

Reason this release was yanked:

test on py37 to py311

3.0.6 yanked

Jan 28, 2023

Reason this release was yanked:

test on py37 to py311

3.0.5 yanked

Jan 28, 2023

Reason this release was yanked:

Github

3.0.4 yanked

Jan 28, 2023

Reason this release was yanked:

Github

3.0.3 yanked

Jan 28, 2023

Reason this release was yanked:

Github

3.0.2 yanked

Jan 28, 2023

Reason this release was yanked:

Github

3.0.1 yanked

Jan 28, 2023

Reason this release was yanked:

nans in regression

3.0.0 yanked

Jan 28, 2023

Reason this release was yanked:

missing VIF

2.1.0 yanked

Jan 27, 2023

Reason this release was yanked:

added FeatureSelector

2.0.8 yanked

Jan 27, 2023

Reason this release was yanked:

wrong denominator for target rates

2.0.7 yanked

Jan 27, 2023

Reason this release was yanked:

SettingWithCopyWarning

2.0.6 yanked

Jan 27, 2023

Reason this release was yanked:

removing prints

2.0.5 yanked

Jan 24, 2023

Reason this release was yanked:

not stable pandas apply

2.0.4 yanked

Jan 9, 2023

Reason this release was yanked:

corrected string formatting in discretizers

2.0.3 yanked

Jan 9, 2023

Reason this release was yanked:

corrected verbosity

2.0.2 yanked

Jan 9, 2023

Reason this release was yanked:

corrected import

2.0.1 yanked

Jan 9, 2023

Reason this release was yanked:

corrected __init__ file

1.1.0 yanked

Jan 6, 2023

Reason this release was yanked:

updated computation of association via crosstabs directly

1.0.3 yanked

Jan 5, 2023

Reason this release was yanked:

incorrect grouping of NaNs

1.0.2 yanked

Jan 5, 2023

Reason this release was yanked:

corrected Discretizers

1.0.1 yanked

Jan 5, 2023

Reason this release was yanked:

wrong Discretizers import

1.0.0 yanked

Jan 5, 2023

Reason this release was yanked:

init file not working

0.0.1 yanked

Jan 5, 2023

Reason this release was yanked:

missing association measure

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AutoCarver-4.4.1.tar.gz (30.5 kB view hashes)

Uploaded Jun 12, 2023 Source

Built Distribution

AutoCarver-4.4.1-py3-none-any.whl (32.4 kB view hashes)

Uploaded Jun 12, 2023 Python 3

Hashes for AutoCarver-4.4.1.tar.gz

Hashes for AutoCarver-4.4.1.tar.gz
Algorithm	Hash digest
SHA256	`fea73be3aae4ae08b3c9e8f4552413c63495c737c5249263174bc6da08c2a5f0`
MD5	`b2013b3301cc9603994b2c6dc1072671`
BLAKE2b-256	`40484a82227d31f8f31cced8e8f012166db390578be80c432d1115fc169bd2d7`

Hashes for AutoCarver-4.4.1-py3-none-any.whl

Hashes for AutoCarver-4.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8fd098c86db1e45ac5807cd4846a2ef33a022c847b6152bb28614995ef2954ab`
MD5	`ae446416f1aebe1754d322c2c4cabf20`
BLAKE2b-256	`cd3cbfaeeb38bb22069cf60e54cbb0c9ddc15d411432d75fc3d6ceecccab87f2`