scorescanner streamline the exploration and quantification of relationships between features and the target in a context of predictive Machine Learning models.

These details have not been verified by PyPI

Project description

ScoreScanner

ScoreScanner is a Python library designed to accelerate and simplify the process of understanding and quantifying the relationship between features and the target variable in the context of predictive Machine Learning modeling.

Key Features
Installation
Quick Tutorial

Key Features

Preprocessing

Outlier Identification & Replacement: Automatically detecting and replacing outliers.
Supervised Binning of Continuous Variables: Converting continuous variables into categorical ones using supervised binning techniques for better interpretability.

Feature Analysis

Univariate Feature Importance: Identifying the most impactful features on the target variable using statistical measures.
Divergent Category Identification: Pinpoint the categories that deviate most from the target, providing deeper insights into data using Jensen-Shannon divergence.
Feature Clustering: Clustering Cramers'v correlation matrix.

Feature Selection

Multicollinearity Elimination: Reducing multicollinearity to ensure that model's predictors are independent, enhancing the stability and interpretability of a model.
Identifying Correlated Variable Subgroups: Automatically grouping correlated variables, facilitating a nuanced interpretation of feature importance through the mean of absolute Shapley values.

Logistic Regression

Logistic Regression Report: Generate detailed logistic regression reports, offering a clear view of how each independent variable influences the target.

Installation

To install ScoreScanner, you can use pip:

pip install scorescanner

Quick Tutorial

To start, let's import the "Adult" dataset from UCI, aimed at classifying individuals based on whether their income exceeds $50K/year.

import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
adult_data = pd.read_csv(url, names=columns)

Preprocessing

Now, we propose two preprocessing steps:

First, identifying and replacing outliers with extreme value.
Second, applying optimal binning of continuous variables, which includes creating unique categories for outliers and missing values.

We can incorporate both steps into a Scikit-learn pipeline:

#Target
target='income'
#Numerical features
num_features=[col for col in columns if adult_data[col].dtypes in ['int64'] and col not in target]
#Value to replace outliers
outlier_value = -999.001


# Defining the pipeline steps
pipeline_steps = [
    ('outlier_detection', outlierdetector(
        columns=num_features,
        method="IQR",
        replacement_method="constant",
        replacement_value=outlier_value,
    )),
    ('optimal_binning', multioptbinning(
        variables=num_features,
        target=target,
        target_dtype="multiclass",
        outlier_value=outlier_value,
    ))
]

# Creating the pipeline
data_preprocessing_pipeline = Pipeline(steps=pipeline_steps)

# Fitting the pipeline on the data
data_preprocessing_pipeline.fit(adult_data)

# Transforming the data 
adult_data_binned = data_preprocessing_pipeline.transform(adult_data)

Univariate Feature Importance

Now, we can identify the most impactful features on the target variable using the univariate importance method:

from scorescanner.utils.statistical_metrics import (
    univariate_feature_importance,
    univariate_category_importance,
    calculate_cramers_v_matrix,
    cluster_corr_matrix
)

# Target variable and features list
target = 'income'
features = [col for col in columns if col not in target]

# Calculate univariate feature importance
univariate_importance = univariate_feature_importance(
    df=adult_data_binned, features=features, target_var=target, method="cramerv"
)

# Display the univariate feature importance
univariate_importance

Description of the image

Identifying Highly Divergent Categories from target

Now, we can identify the categories that diverge most from the target:

univariate_category_importance(
    df=adult_data_binned, categorical_vars=features, target_var=target
)[0:30]

Description of the image

Visualisation

Now, we can visualize the most important measures and statistical metrics of a variable in a bar plot:

from scorescanner.utils.plotting import (
    generate_bar_plot,
    plot_woe,
    plot_js,
    plot_corr_matrix
)

fig = generate_bar_plot(
    df=adult_data_binned,
    feature="relationship",
    target_var=target,
    cat_ref=None,
)
fig.show()

The right axis represents the percentage, allowing us to visualize the evolution of each target modality across all bins.

We can also focus on the Weight of Evidence or the Jensen-Shannon metrics.

fig = plot_woe(
    df=adult_data_binned, feature="relationship", target_var=target, cat_ref=None
)
fig.show()

fig = plot_js(
    df=adult_data_binned,feature="relationship",target_var= target
    )
fig.show()

Feature Clustering

corr_matrix = calculate_cramers_v_matrix(df=adult_data_binned, sampling=False)
corr_matrix_clustered = cluster_corr_matrix(corr_matrix=corr_matrix, threshold=1.7) 
plot_corr_matrix(corr_matrix_clustered)

Project details

These details have not been verified by PyPI

Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.3

Apr 28, 2024

This version

0.1.1

Feb 12, 2024

0.1.0

Feb 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorescanner-0.1.1.tar.gz (24.5 kB view details)

Uploaded Feb 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scorescanner-0.1.1-py3-none-any.whl (6.7 MB view details)

Uploaded Apr 28, 2024 Python 3

File details

Details for the file scorescanner-0.1.1.tar.gz.

File metadata

Download URL: scorescanner-0.1.1.tar.gz
Upload date: Feb 12, 2024
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for scorescanner-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f4d2c576b1c927417ef7d723e38a54b76f30233f3fc01c34279085d41995d7c1`
MD5	`da5408db93ebeed07fbed6f692131596`
BLAKE2b-256	`8ef4facc87108e11426060cf85a44ba303ee50b3e73dc30db0f06736bbdf8aa9`

See more details on using hashes here.

File details

Details for the file scorescanner-0.1.1-py3-none-any.whl.

File metadata

Download URL: scorescanner-0.1.1-py3-none-any.whl
Upload date: Apr 28, 2024
Size: 6.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.1

File hashes

Hashes for scorescanner-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efcf56bf87ddcd52cfc621c9589f2b07ee31a82ac5162fb35fa08aaa9acfcafc`
MD5	`bbcf1a693d23fa4e3d997d1000b7abb8`
BLAKE2b-256	`a5d85a822fa68b90080dadca477ddd6e6dd97e4f7f254ca94aab05cebed0cbab`

See more details on using hashes here.

scorescanner 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ScoreScanner

Table of Contents

Key Features

Preprocessing

Feature Analysis

Feature Selection

Logistic Regression

Installation

Quick Tutorial

Preprocessing

Univariate Feature Importance

Identifying Highly Divergent Categories from target

Visualisation

Feature Clustering

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes