Skip to main content

Explainable Boosted Scoring

Project description

xbooster 🚀

A scorecard-format classificatory framework for logistic regression with XGBoost. xbooster allows to convert an XGB logistic regression into a logarithmic (point) scoring system.

In addition, it provides a suite of interpretability tools to understand the model's behavior, which can be instrumental for model testing and expert validation.

The interpretability suite includes:

  • Granular boosted tree statistics, including metrics such as Weight of Evidence (WOE) and Information Value (IV) for splits 🌳
  • Tree visualization with customizations 🎨
  • Global and local feature importance 📊

xbooster also provides a scorecard deployment using SQL 📦.

Installation ⤵

Install the package using pip:

pip install xbooster

Usage 📝

Here's a quick example of how to use xbooster to construct a scorecard for an XGBoost model:

import pandas as pd
import xgboost as xgb
from xbooster.constructor import XGBScorecardConstructor
from sklearn.model_selection import train_test_split

# Load data and train XGBoost model
url = (
    "https://github.com/xRiskLab/xBooster/raw/main/examples/data/credit_data.parquet"
)
dataset = pd.read_parquet(url)

features = [
    "external_risk_estimate",
    "revolving_utilization_of_unsecured_lines",
    "account_never_delinq_percent",
    "net_fraction_revolving_burden",
    "num_total_cc_accounts",
    "average_months_in_file",
]

target = "is_bad"

X, y = dataset[features], dataset[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the XGBoost model
best_params = {
    'n_estimators': 100,
    'learning_rate': 0.55,
    'max_depth': 1,
    'min_child_weight': 10,
    'grow_policy': "lossguide",
    'early_stopping_rounds': 5
}
model = xgb.XGBClassifier(**best_params, random_state=62)
model.fit(X_train, y_train)

# Initialize XGBScorecardConstructor
scorecard_constructor = XGBScorecardConstructor(model, X_train, y_train)
scorecard_constructor.construct_scorecard()

# Print the scorecard
print(scorecard_constructor.scorecard)

After this, we can create a scorecard and test its Gini score:

from sklearn.metrics import roc_auc_score

# Create scoring points
xgb_scorecard_with_points = scorecard_constructor.create_points(
    pdo=50, target_points=600, target_odds=50
)
# Make predictions using the scorecard
credit_scores = scorecard_constructor.predict_score(X_test)
gini = roc_auc_score(y_test, -credit_scores) * 2 - 1
print(f"Test Gini score: {gini:.2%}")

We can also visualize the score distribution between the events of interest.

from xbooster import explainer

explainer.plot_score_distribution(
    y_test, 
    credit_scores,
    num_bins=30, 
    figsize=(8, 3),
    dpi=100
)

We can further examine feature importances.

Below, we can visualize the global feature importances using Points as our metric:

from xbooster import explainer

explainer.plot_importance(
    scorecard_constructor,
    metric='Points',
    method='global',
    normalize=True,
    figsize=(3, 3)
)

Alternatively, we can calculate local feature importances, which are important for boosters with a depth greater than 1.

explainer.plot_importance(
    scorecard_constructor,
    metric='Likelihood',
    method='local',
    normalize=True,
    color='#ffd43b',
    edgecolor='#1e1e1e',
    figsize=(3, 3)
)

Finally, we can generate a scorecard in SQL format.

sql_query = scorecard_constructor.generate_sql_query(table_name='my_table')
print(sql_query)

Parameters 🛠

xbooster.constructor - XGBoost Scorecard Constructor

Description

A class for generating a scorecard from a trained XGBoost model. The methodology is inspired by the NVIDIA GTC Talk "Machine Learning in Retail Credit Risk" by Paul Edwards.

Methods

  1. extract_leaf_weights() -> pd.DataFrame:

    • Extracts the leaf weights from the booster's trees and returns a DataFrame.
    • Returns:
      • pd.DataFrame: DataFrame containing the extracted leaf weights.
  2. extract_decision_nodes() -> pd.DataFrame:

    • Extracts the split (decision) nodes from the booster's trees and returns a DataFrame.
    • Returns:
      • pd.DataFrame: DataFrame containing the extracted split (decision) nodes.
  3. construct_scorecard() -> pd.DataFrame:

    • Constructs a scorecard based on a booster.
    • Returns:
      • pd.DataFrame: The constructed scorecard.
  4. create_points(pdo=50, target_points=600, target_odds=19, precision_points=0, score_type='XAddEvidence') -> pd.DataFrame:

    • Creates a points card from a scorecard.
    • Parameters:
      • pdo (int, optional): The points to double the odds. Default is 50.
      • target_points (int, optional): The standard scorecard points. Default is 600.
      • target_odds (int, optional): The standard scorecard odds. Default is 19.
      • precision_points (int, optional): The points decimal precision. Default is 0.
      • score_type (str, optional): The log-odds to use for the points card. Default is 'XAddEvidence'.
    • Returns:
      • pd.DataFrame: The points card.
  5. predict_score(X: pd.DataFrame) -> pd.Series:

    • Predicts the score for a given dataset using the constructed scorecard.
    • Parameters:
      • X (pd.DataFrame): Features of the dataset.
    • Returns:
      • pd.Series: Predicted scores.
  6. sql_query (property):

    • Property that returns the SQL query for deploying the scorecard.
    • Returns:
      • str: The SQL query for deploying the scorecard.
  7. generate_sql_query(table_name: str = "my_table") -> str:

    • Converts a scorecard into an SQL format.
    • Parameters:
      • table_name (str): The name of the input table in SQL.
    • Returns:
      • str: The final SQL query for deploying the scorecard.

xbooster.explainer - XGBoost Scorecard Explainer

This module provides functionalities for explaining XGBoost scorecards, including methods to extract split information, build interaction splits, visualize tree structures, plot feature importances, and more.

Methods:

  1. extract_splits_info(features: str) -> list:

    • Extracts split information from the DetailedSplit feature.
    • Inputs:
      • features (str): A string containing split information.
    • Outputs:
      • Returns a list of tuples containing split information (feature, sign, value).
  2. build_interactions_splits(scorecard_constructor: Optional[XGBScorecardConstructor] = None, dataframe: Optional[pd.DataFrame] = None) -> pd.DataFrame:

    • Builds interaction splits from the XGBoost scorecard.
    • Inputs:
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • dataframe (Optional[pd.DataFrame]): The dataframe containing split information.
    • Outputs:
      • Returns a pandas DataFrame containing interaction splits.
  3. split_and_count(scorecard_constructor: Optional[XGBScorecardConstructor] = None, dataframe: Optional[pd.DataFrame] = None, label_column: Optional[str] = None) -> pd.DataFrame:

    • Splits the dataset and counts events for each split.
    • Inputs:
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • dataframe (Optional[pd.DataFrame]): The dataframe containing features and labels.
      • label_column (Optional[str]): The label column in the dataframe.
    • Outputs:
      • Returns a pandas DataFrame containing split information and event counts.
  4. plot_importance(scorecard_constructor: Optional[XGBScorecardConstructor] = None, metric: str = "Likelihood", normalize: bool = True, method: Optional[str] = None, dataframe: Optional[pd.DataFrame] = None, **kwargs: Any) -> None:

    • Plots the importance of features based on the XGBoost scorecard.
    • Inputs:
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • metric (str): Metric to plot ("Likelihood" (default), "NegLogLikelihood", "IV", or "Points").
      • normalize (bool): Whether to normalize the importance values (default: True).
      • method (Optional[str]): The method to use for plotting the importance ("global" or "local").
      • dataframe (Optional[pd.DataFrame]): The dataframe containing features and labels.
      • fontfamily (str): The font family to use for the plot (default: "Monospace").
      • fontsize (int): The font size to use for the plot (default: 12).
      • dpi (int): The DPI of the plot (default: 100).
      • title (str): The title of the plot (default: "Feature Importance").
      • **kwargs (Any): Additional Matplotlib parameters.
  5. plot_score_distribution(y_true: pd.Series = None, y_pred: pd.Series = None, n_bins: int = 25, scorecard_constructor: Optional[XGBScorecardConstructor] = None, **kwargs: Any):

    • Plots the distribution of predicted scores based on actual labels.
    • Inputs:
      • y_true (pd.Series): The true labels.
      • y_pred (pd.Series): The predicted labels.
      • n_bins (int): Number of bins for histogram (default: 25).
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • **kwargs (Any): Additional Matplotlib parameters.
  6. plot_local_importance(scorecard_constructor: Optional[XGBScorecardConstructor] = None, metric: str = "Likelihood", normalize: bool = True, dataframe: Optional[pd.DataFrame] = None, **kwargs: Any) -> None:

    • Plots the local importance of features based on the XGBoost scorecard.
    • Inputs:
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • metric (str): Metric to plot ("Likelihood" (default), "NegLogLikelihood", "IV", or "Points").
      • normalize (bool): Whether to normalize the importance values (default: True).
      • dataframe (Optional[pd.DataFrame]): The dataframe containing features and labels.
      • fontfamily (str): The font family to use for the plot (default: "Arial").
      • fontsize (int): The font size to use for the plot (default: 12).
      • boxstyle (str): The rounding box style to use for the plot (default: "round").
      • title (str): The title of the plot (default: "Local Feature Importance").
      • **kwargs (Any): Additional parameters to pass to the matplotlib function.
  7. plot_tree(tree_index: int, scorecard_constructor: Optional[XGBScorecardConstructor] = None, show_info: bool = True) -> None:

    • Plots the tree structure.
    • Inputs:
      • tree_index (int): Index of the tree to plot.
      • scorecard_constructor (Optional[XGBScorecardConstructor]): The XGBoost scorecard constructor.
      • show_info (bool): Whether to show additional information (default: True).
      • **kwargs (Any): Additional Matplotlib parameters.

Contributing 🤝

Contributions are welcome! For bug reports or feature requests, please open an issue.

For code contributions, please open a pull request.

Version

Current version: 0.2.2

Changelog

[0.1.0] - 2024-02-14

  • Initial release

[0.2.0] - 2024-05-03

  • Added tree visualization class (explainer.py)
  • Updated the local explanation algorithm for models with a depth > 1 (explainer.py)
  • Added a categorical preprocessor (_utils.py)

[0.2.1] - 2024-05-03

  • Updates of dependencies

[0.2.2] - 2024-05-08

  • Updates in explainer.py module to improve kwargs handling and minor changes.

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xbooster-0.2.2.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

xbooster-0.2.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file xbooster-0.2.2.tar.gz.

File metadata

  • Download URL: xbooster-0.2.2.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Darwin/23.4.0

File hashes

Hashes for xbooster-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ea8cab7ef7260f19913f0a782c1f6e728656830b1c2d6be9e479bd9800473e65
MD5 00c59cdab616922f4acf8c22380b9f0c
BLAKE2b-256 141147b8ef3d30dfa7aaa31094cb3180e37bf0552c23223b156514089d78a41d

See more details on using hashes here.

Provenance

File details

Details for the file xbooster-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: xbooster-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.12 Darwin/23.4.0

File hashes

Hashes for xbooster-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c21c83d1ccb80ac0b6fd43959fc3bc73024e0bc757bb4b92837cd3d8b5c28d52
MD5 efd3ae766b402b91f3f4be297bfa5829
BLAKE2b-256 08b71e241d8dc838a85d557e0837dc9b0c5af390de418e57079161dbeb32e8a0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page