Skip to main content

Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API

Project description

Code style: black

WOE-Scoring

Monotone Weight Of Evidence (WOE) Transformer and LogisticRegression model with scikit-learn API. Optimized for performance and stability.

Features

  • WOE Transformation: Convert categorical and numerical features to Weight of Evidence encoding
  • Automated Feature Selection: Multiple algorithms for optimal feature selection
  • Automated Feature Generation: Automatically create and select high-quality ratio and interaction features
  • Binning Strategies: Smart binning with monotonicity constraints
  • Sklearn Compatibility: Follows scikit-learn's API standards
  • Performance Optimized: Parallel processing and vectorized operations
  • SQL Export: Generate SQL for model deployment
  • Scorecard Generation: Create credit scorecards with customizable scaling

Installation

pip install woe-scoring

Quickstart

  1. Install the package:
pip install woe-scoring
  1. Use WOETransformer:
import pandas as pd
from woe_scoring import WOETransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

cat_cols = [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
]

encoder = WOETransformer(
    max_bins=8,
    min_pct_group=0.1,
    diff_woe_threshold=0.1,
    cat_features=cat_cols,
    special_cols=special_cols,
    n_jobs=-1,
    merge_type="chi2",
    generate_features=True,  # Enable feature generation
    max_generated_features=10,
)

encoder.fit(train, train["Survived"])
encoder.save_to_file("train_dict.json")

encoder.load_woe_iv_dict("train_dict.json")
encoder.refit(train, train["Survived"])

enc_train = encoder.transform(train)
enc_test = encoder.transform(test)

model = LogisticRegression()
model.fit(enc_train, train["Survived"])
test_proba = model.predict_proba(enc_test)[:, 1]
  1. Use CreateModel:
import pandas as pd
from woe_scoring import CreateModel
from sklearn.model_selection import train_test_split

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

model = CreateModel(
    max_vars=5,
    special_cols=special_cols,
    selection_method="sfs",
    model_type="sklearn",
    gini_threshold=5.0,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced",
    cv=3,
)
model.fit(train, train["Survived"])
test_proba = model.predict_proba(test[model.feature_names_])

print(model.coef_, model.intercept_)
print(model.feature_names_)

Detailed Documentation

WOETransformer

The WOETransformer converts categorical and numerical features into Weight of Evidence (WOE) values. WOE measures the predictive power of a feature by comparing the distribution of events and non-events.

WOETransformer(
    max_bins=10,               # Maximum number of bins for each feature
    min_pct_group=0.05,        # Minimum percentage of each bin
    n_jobs=1,                  # Number of parallel jobs
    prefix="WOE_",             # Prefix for transformed features
    merge_type="chi2",         # Bin merging strategy ('chi2', 'woe', 'monotonic')
    cat_features=None,         # List of categorical features
    special_cols=None,         # Columns to exclude from transformation
    cat_features_threshold=0,  # Threshold for auto-identifying categorical features
    diff_woe_threshold=0.05,   # Minimum WOE difference between bins
    safe_original_data=False,  # Whether to keep original features
    generate_features=False,   # Whether to generate new features
    max_generated_features=20  # Max number of generated features to select
)

Key Methods

  • fit(data, target): Calculates optimal bins and WOE values
  • transform(data): Converts features to WOE values
  • save_to_file(path): Saves binning information to a JSON file
  • load_woe_iv_dict(path): Loads binning information from a JSON file
  • refit(data, target): Updates WOE values for existing bins with new data

CreateModel

The CreateModel class combines feature selection, model training, and model evaluation:

CreateModel(
    selection_method='rfe',    # Feature selection method ('rfe', 'sfs', 'iv')
    model_type='sklearn',      # Model implementation ('sklearn', 'statsmodel')
    max_vars=None,             # Maximum number of features to select
    special_cols=None,         # Columns to include as-is
    unused_cols=None,          # Columns to exclude
    n_jobs=1,                  # Number of parallel jobs
    gini_threshold=5.0,        # Minimum Gini score to keep a feature
    iv_threshold=0.05,         # Minimum IV threshold for feature selection
    corr_threshold=0.5,        # Correlation threshold for feature selection
    min_pct_group=0.05,        # Minimum percentage for each group
    random_state=None,         # Random seed for reproducibility
    class_weight='balanced',   # Class weighting strategy
    direction='forward',       # Direction for sequential feature selection
    cv=3,                      # Cross-validation folds
    l1_exp_scale=4,            # Exponent scale for L1 regularization
    l1_grid_size=20,           # Grid size for L1 regularization search
    scoring='roc_auc'          # Performance metric
)

Key Methods

  • fit(data, target): Selects features and fits model
  • predict(data): Makes binary predictions
  • predict_proba(data): Returns probability predictions
  • save_reports(path): Saves model reports
  • generate_sql(encoder): Generates SQL for model deployment
  • save_scorecard(encoder, path, ...): Creates credit scorecard

Advanced Usage

Automated Feature Generation

WOE-Scoring can automatically generate and select high-quality features from your data:

encoder = WOETransformer(
    generate_features=True,    # Enable feature generation
    max_generated_features=20, # Select top 20 new features
    n_jobs=-1
)
encoder.fit(X, y)

This process:

  1. Creates ratio features from all pairs of numeric columns
  2. Calculates statistical aggregations (mean) for numeric columns grouped by categorical columns
  3. Calculates the Gini score for all new features
  4. Selects the top max_generated_features
  5. Adds them to the dataset and proceeds with WOE binning

Generating SQL for Deployment

# First fit the WOE transformer and model
encoder = WOETransformer()
encoder.fit(train, train["target"])
train_woe = encoder.transform(train)

model = CreateModel()
model.fit(train_woe, train["target"])

# Generate SQL query for scoring
sql_query = model.generate_sql(encoder)

Creating a Scorecard

# Save a credit scorecard to Excel
model.save_scorecard(
    encoder=encoder,
    path="output_dir",
    base_scorecard_points=600,  # Base score
    odds=50,                    # Base odds
    points_to_double_odds=20    # Points to double the odds
)

Customizing Binning for Categorical Features

# Specify categorical features and their treatment
encoder = WOETransformer(
    cat_features=["education", "marital_status", "occupation"],
    max_bins=5,                 # Max bins for categorical features
    diff_woe_threshold=0.1,     # Merge bins with similar WOE values
    min_pct_group=0.05          # Minimum population percentage per bin
)

Performance Optimization

The library is optimized for performance with:

  • Vectorized operations for fast transformation
  • Parallel processing for binning and feature selection
  • Efficient memory usage for large datasets
  • Optimized algorithms for binning and feature selection

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

woe_scoring-2.1.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

woe_scoring-2.1.0-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file woe_scoring-2.1.0.tar.gz.

File metadata

  • Download URL: woe_scoring-2.1.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for woe_scoring-2.1.0.tar.gz
Algorithm Hash digest
SHA256 17e9f88e73ff69f7c37907db88f5fcea61dd7a6e4f5402f7356557c29cc7809f
MD5 61cdafb4c3173a355478d2c0fdf0ed12
BLAKE2b-256 41f78648adc8d47688f1fe32b641e9f75a6dc466b33be6296e88476df6748ede

See more details on using hashes here.

File details

Details for the file woe_scoring-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: woe_scoring-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for woe_scoring-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab457f8730f532e04eb9b27a8af4738e22ce791580d95d4349c903587b356173
MD5 2db85b1d748af9e037f7c26beed145a3
BLAKE2b-256 e6a47f083316c0fe9258f6b26004b1d4f48cde9874174f56a4debdfb5dc55fa0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page