Skip to main content

scorescanner streamline the exploration and quantification of relationships between features and the target in a context of predictive Machine Learning models.

Project description

ScoreScanner

ScoreScanner is a Python library designed to accelerate and simplify the process of understanding and quantifying the relationship between features and the target variable in the context of predictive Machine Learning modeling.

Table of Contents

Key Features

Preprocessing

  • Outlier Identification & Replacement: Automatically detecting and replacing outliers.
  • Supervised Binning of Continuous Variables: Converting continuous variables into categorical ones using supervised binning techniques for better interpretability.

Feature Analysis

  • Univariate Feature Importance: Identifying the most impactful features on the target variable using statistical measures.
  • Divergent Category Identification: Pinpoint the categories that deviate most from the target, providing deeper insights into data using Jensen-Shannon divergence.
  • Feature Clustering: Clustering Cramers'v correlation matrix.

Feature Selection

  • Multicollinearity Elimination: Reducing multicollinearity to ensure that model's predictors are independent, enhancing the stability and interpretability of a model.
  • Identifying Correlated Variable Subgroups: Automatically grouping correlated variables, facilitating a nuanced interpretation of feature importance through the mean of absolute Shapley values.

Logistic Regression

  • Logistic Regression Report: Generate detailed logistic regression reports, offering a clear view of how each independent variable influences the target.

Installation

To install ScoreScanner, you can use pip:

pip install scorescanner

Quick Tutorial

To start, let's import the "Adult" dataset from UCI, aimed at classifying individuals based on whether their income exceeds $50K/year.

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
adult_data = pd.read_csv(url, names=columns)

Preprocessing

Now, we propose two preprocessing steps:

  • First, identifying and replacing outliers with extreme value.
  • Second, applying optimal binning of continuous variables, which includes creating unique categories for outliers and missing values.

We can incorporate both steps into a Scikit-learn pipeline:

#Target
target='income'
#Numerical features
num_features=[col for col in columns if adult_data[col].dtypes in ['int64'] and col not in target]
#Value to replace outliers
outlier_value = -999.001


# Defining the pipeline steps
pipeline_steps = [
    ('outlier_detection', outlierdetector(
        columns=num_features,
        method="IQR",
        replacement_method="constant",
        replacement_value=outlier_value,
    )),
    ('optimal_binning', multioptbinning(
        variables=num_features,
        target=target,
        target_dtype="multiclass",
        outlier_value=outlier_value,
    ))
]

# Creating the pipeline
data_preprocessing_pipeline = Pipeline(steps=pipeline_steps)

# Fitting the pipeline on the data
data_preprocessing_pipeline.fit(adult_data)

# Transforming the data 
adult_data_binned = data_preprocessing_pipeline.transform(adult_data)

Univariate Feature Importance

Now, we can identify the most impactful features on the target variable using the univariate importance method:

from scorescanner.utils.statistical_metrics import (
    univariate_feature_importance,
    univariate_category_importance,
    calculate_cramers_v_matrix,
    cluster_corr_matrix
)

# Target variable and features list
target = 'income'
features = [col for col in columns if col not in target]

# Calculate univariate feature importance
univariate_importance = univariate_feature_importance(
    df=adult_data_binned, features=features, target_var=target, method="cramerv"
)

# Display the univariate feature importance
univariate_importance

Description of the image

Identifying Highly Divergent Categories from target

Now, we can identify the categories that diverge most from the target:

univariate_category_importance(
    df=adult_data_binned, categorical_vars=features, target_var=target
)[0:30]

Description of the image

Visualisation

Now, we can visualize the most important measures and statistical metrics of a variable in a bar plot:

from scorescanner.utils.plotting import (
    generate_bar_plot,
    plot_woe,
    plot_js,
    plot_corr_matrix
)
fig = generate_bar_plot(
    df=adult_data_binned,
    feature="relationship",
    target_var=target,
    cat_ref=None,
)
fig.show()

The right axis represents the percentage, allowing us to visualize the evolution of each target modality across all bins.

We can also focus on the Weight of Evidence or the Jensen-Shannon metrics.

fig = plot_woe(
    df=adult_data_binned, feature="relationship", target_var=target, cat_ref=None
)
fig.show()
fig = plot_js(
    df=adult_data_binned,feature="relationship",target_var= target
    )
fig.show()

Feature Clustering

corr_matrix = calculate_cramers_v_matrix(df=adult_data_binned, sampling=False)
corr_matrix_clustered = cluster_corr_matrix(corr_matrix=corr_matrix, threshold=1.7) 
plot_corr_matrix(corr_matrix_clustered)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorescanner-0.1.1.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scorescanner-0.1.1-py3-none-any.whl (6.7 MB view details)

Uploaded Python 3

File details

Details for the file scorescanner-0.1.1.tar.gz.

File metadata

  • Download URL: scorescanner-0.1.1.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for scorescanner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f4d2c576b1c927417ef7d723e38a54b76f30233f3fc01c34279085d41995d7c1
MD5 da5408db93ebeed07fbed6f692131596
BLAKE2b-256 8ef4facc87108e11426060cf85a44ba303ee50b3e73dc30db0f06736bbdf8aa9

See more details on using hashes here.

File details

Details for the file scorescanner-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scorescanner-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.1

File hashes

Hashes for scorescanner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 efcf56bf87ddcd52cfc621c9589f2b07ee31a82ac5162fb35fa08aaa9acfcafc
MD5 bbcf1a693d23fa4e3d997d1000b7abb8
BLAKE2b-256 a5d85a822fa68b90080dadca477ddd6e6dd97e4f7f254ca94aab05cebed0cbab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page