Skip to main content

Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API

Project description

Code style: black

WOE-Scoring

Monotone Weight Of Evidence (WOE) Transformer and LogisticRegression model with scikit-learn API. Optimized for performance and stability.

Features

  • WOE Transformation: Convert categorical and numerical features to Weight of Evidence encoding
  • Automated Feature Selection: Multiple algorithms for optimal feature selection
  • Binning Strategies: Smart binning with monotonicity constraints
  • Sklearn Compatibility: Follows scikit-learn's API standards
  • Performance Optimized: Parallel processing and vectorized operations
  • SQL Export: Generate SQL for model deployment
  • Scorecard Generation: Create credit scorecards with customizable scaling

Installation

pip install woe-scoring

Quickstart

  1. Install the package:
pip install woe-scoring
  1. Use WOETransformer:
import pandas as pd
from woe_scoring import WOETransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

cat_cols = [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
]

encoder = WOETransformer(
    max_bins=8,
    min_pct_group=0.1,
    diff_woe_threshold=0.1,
    cat_features=cat_cols,
    special_cols=special_cols,
    n_jobs=-1,
    merge_type="chi2",
)

encoder.fit(train, train["Survived"])
encoder.save_to_file("train_dict.json")

encoder.load_woe_iv_dict("train_dict.json")
encoder.refit(train, train["Survived"])

enc_train = encoder.transform(train)
enc_test = encoder.transform(test)

model = LogisticRegression()
model.fit(enc_train, train["Survived"])
test_proba = model.predict_proba(enc_test)[:, 1]
  1. Use CreateModel:
import pandas as pd
from woe_scoring import CreateModel
from sklearn.model_selection import train_test_split

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

model = CreateModel(
    max_vars=5,
    special_cols=special_cols,
    selection_method="sfs",
    model_type="sklearn",
    gini_threshold=5.0,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced",
    cv=3,
)
model.fit(train, train["Survived"])
test_proba = model.predict_proba(test[model.feature_names_])

print(model.coef_, model.intercept_)
print(model.feature_names_)

Detailed Documentation

WOETransformer

The WOETransformer converts categorical and numerical features into Weight of Evidence (WOE) values. WOE measures the predictive power of a feature by comparing the distribution of events and non-events.

WOETransformer(
    max_bins=10,               # Maximum number of bins for each feature
    min_pct_group=0.05,        # Minimum percentage of each bin
    n_jobs=1,                  # Number of parallel jobs
    prefix="WOE_",             # Prefix for transformed features
    merge_type="chi2",         # Bin merging strategy ('chi2', 'woe', 'monotonic')
    cat_features=None,         # List of categorical features
    special_cols=None,         # Columns to exclude from transformation
    cat_features_threshold=0,  # Threshold for auto-identifying categorical features
    diff_woe_threshold=0.05,   # Minimum WOE difference between bins
    safe_original_data=False   # Whether to keep original features
)

Key Methods

  • fit(data, target): Calculates optimal bins and WOE values
  • transform(data): Converts features to WOE values
  • save_to_file(path): Saves binning information to a JSON file
  • load_woe_iv_dict(path): Loads binning information from a JSON file
  • refit(data, target): Updates WOE values for existing bins with new data

CreateModel

The CreateModel class combines feature selection, model training, and model evaluation:

CreateModel(
    selection_method='rfe',    # Feature selection method ('rfe', 'sfs', 'iv')
    model_type='sklearn',      # Model implementation ('sklearn', 'statsmodel')
    max_vars=None,             # Maximum number of features to select
    special_cols=None,         # Columns to include as-is
    unused_cols=None,          # Columns to exclude
    n_jobs=1,                  # Number of parallel jobs
    gini_threshold=5.0,        # Minimum Gini score to keep a feature
    iv_threshold=0.05,         # Minimum IV threshold for feature selection
    corr_threshold=0.5,        # Correlation threshold for feature selection
    min_pct_group=0.05,        # Minimum percentage for each group
    random_state=None,         # Random seed for reproducibility
    class_weight='balanced',   # Class weighting strategy
    direction='forward',       # Direction for sequential feature selection
    cv=3,                      # Cross-validation folds
    l1_exp_scale=4,            # Exponent scale for L1 regularization
    l1_grid_size=20,           # Grid size for L1 regularization search
    scoring='roc_auc'          # Performance metric
)

Key Methods

  • fit(data, target): Selects features and fits model
  • predict(data): Makes binary predictions
  • predict_proba(data): Returns probability predictions
  • save_reports(path): Saves model reports
  • generate_sql(encoder): Generates SQL for model deployment
  • save_scorecard(encoder, path, ...): Creates credit scorecard

Advanced Usage

Generating SQL for Deployment

# First fit the WOE transformer and model
encoder = WOETransformer()
encoder.fit(train, train["target"])
train_woe = encoder.transform(train)

model = CreateModel()
model.fit(train_woe, train["target"])

# Generate SQL query for scoring
sql_query = model.generate_sql(encoder)

Creating a Scorecard

# Save a credit scorecard to Excel
model.save_scorecard(
    encoder=encoder,
    path="output_dir",
    base_scorecard_points=600,  # Base score
    odds=50,                    # Base odds
    points_to_double_odds=20    # Points to double the odds
)

Customizing Binning for Categorical Features

# Specify categorical features and their treatment
encoder = WOETransformer(
    cat_features=["education", "marital_status", "occupation"],
    max_bins=5,                 # Max bins for categorical features
    diff_woe_threshold=0.1,     # Merge bins with similar WOE values
    min_pct_group=0.05          # Minimum population percentage per bin
)

Performance Optimization

The library is optimized for performance with:

  • Vectorized operations for fast transformation
  • Parallel processing for binning and feature selection
  • Efficient memory usage for large datasets
  • Optimized algorithms for binning and feature selection

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

woe_scoring-2.0.0.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

woe_scoring-2.0.0-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file woe_scoring-2.0.0.tar.gz.

File metadata

  • Download URL: woe_scoring-2.0.0.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for woe_scoring-2.0.0.tar.gz
Algorithm Hash digest
SHA256 0f967c6ce8657befb4cda9eaab540acfa026b033a8205f06dab43a350bca9f87
MD5 f837b827df85d5623419bb894e07b743
BLAKE2b-256 ced4f1cf8b187ee4fdc60f20e2d59a5518f4237bcdf4c041e7259cfa464c1983

See more details on using hashes here.

File details

Details for the file woe_scoring-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: woe_scoring-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for woe_scoring-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d2d410ace8a9ed91b81aeb8be56b235b0a0bdaf2d75a379fc48bc5b5eb11133
MD5 29b6a588163a7141ba1e4bf51f744046
BLAKE2b-256 c4f2326f9d0114f7f50ebb0ff963859f1e4613107cc2a9649f9fc5d2aea02abb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page