Skip to main content

End-to-End Automated Risk Scoring Platform for Credit, Fraud, and Churn Prediction

Project description

๐Ÿš€ RiskX - End-to-End Automated Risk Scoring Platform

v0.1.0 | Production-Ready Core | Credit โ€ข Fraud โ€ข Churn Risk Scoring

Python 3.8+ License: MIT PyPI version

RiskX is a comprehensive, production-ready platform for automated risk scoring. Built for financial institutions, fintech companies, and data scientists working on credit scoring, fraud detection, and customer churn prediction.


๐ŸŽฏ What is RiskX?

RiskX provides an end-to-end automated workflow for risk scoring:

  1. Data Loading - Multi-source data ingestion (CSV, Excel, SQL, APIs, Cloud)
  2. Data Cleaning - Automated quality checks and preprocessing
  3. Feature Engineering - Risk-specific features (WOE/IV, RFM, behavioral)
  4. ML Training - AutoML with multiple algorithms (LR, RF, XGBoost, LightGBM)
  5. Scoring - Real-time and batch scoring with interpretability
  6. Monitoring - Model performance and data drift detection (coming soon)

โœจ Key Features

๐Ÿ”ฅ What's Working NOW (v0.1.0)

1. Multi-Source Data Loading

Load data from 8+ different sources:

  • CSV, Excel, JSON, Parquet files
  • SQL databases (via SQLAlchemy)
  • REST APIs
  • Cloud data lakes (Azure, AWS, GCP)
  • Pandas DataFrames

2. Automated Data Cleaning

7 powerful cleaning methods:

  • Missing value imputation (6 strategies)
  • Outlier detection and handling (IQR, Z-score, clipping)
  • Type validation and correction
  • Categorical encoding (label, one-hot)
  • Feature scaling (standard, min-max)
  • Duplicate removal
  • Full automated pipeline with auto_clean()

3. Risk-Specific Feature Engineering

Create 50+ features automatically:

  • WOE (Weight of Evidence) & IV (Information Value)
  • Optimal binning (quantile, uniform, kmeans)
  • RFM analysis (Recency, Frequency, Monetary)
  • Behavioral features from transactions
  • Time-based features (11 datetime extractions)
  • Ratio and interaction features
  • Full automated pipeline with auto_features()

4. AutoML Training

Train and compare 4 algorithms:

  • Logistic Regression
  • Random Forest
  • XGBoost
  • LightGBM
  • Automatic best model selection
  • Model calibration (isotonic, sigmoid)
  • Ensemble methods (voting, stacking)
  • Hyperparameter optimization (Optuna)

5. Production-Ready Scoring

API-ready scoring engine:

  • Real-time single predictions
  • Batch scoring
  • Score range: 300-850 (configurable)
  • Risk ratings: Excellent, Very Good, Good, Fair, Poor
  • Reason codes for interpretability
  • Score interpretation and recommendations
  • API specification export

๐Ÿš€ Quick Start

Installation

# Core installation (pandas, numpy, scikit-learn)
pip install riskx

# Full installation (includes XGBoost, LightGBM, Optuna, etc.)
pip install riskx[full]

# ML only (XGBoost, LightGBM, Optuna)
pip install riskx[ml]

# Data sources (SQL, APIs, Parquet, Excel)
pip install riskx[data]

Basic Usage

from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine
from riskx import RiskAutoModel, ScoringEngine

# 1. Load data
connector = RiskDataConnector()
data = connector.from_csv("loan_applications.csv")

# 2. Clean data (automated)
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="default")

# 3. Engineer features (automated)
feature_engine = RiskFeatureEngine()
data_features = feature_engine.auto_features(data_clean, target="default")

# 4. Train models (AutoML)
model = RiskAutoModel()
X = data_features.drop("default", axis=1)
y = data_features["default"]
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# 5. Score new applications
scorer = ScoringEngine(model.get_best_model())
new_application = {
    "income": 75000,
    "credit_history_years": 8,
    "debt_to_income": 0.25,
    "age": 35
}
result = scorer.score_single(new_application)

print(f"Credit Score: {result['score']}")
print(f"Rating: {result['rating']}")
print(f"Risk Level: {result['risk_level']}")
print(f"Reason Codes: {result['reason_codes']}")

Output:

Credit Score: 742
Rating: Very Good
Risk Level: Low
Reason Codes: [
    {'code': 'RC1', 'feature': 'credit_history_years', 'importance': 0.35},
    {'code': 'RC2', 'feature': 'debt_to_income', 'importance': 0.28},
    {'code': 'RC3', 'feature': 'income', 'importance': 0.22}
]

๐Ÿ“– Detailed Examples

Example 1: Credit Scoring Pipeline

from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine, RiskAutoModel, ScoringEngine

# Load credit application data
connector = RiskDataConnector()
data = connector.from_sql(
    connection_string="postgresql://user:pass@localhost/credit_db",
    query="SELECT * FROM applications WHERE created_date >= '2024-01-01'"
)

# Auto-clean
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="approved")

print(f"Cleaned {len(data_clean)} records")

# Feature engineering with WOE/IV
feature_engine = RiskFeatureEngine()

# Compute WOE/IV for key features
woe_df, iv = feature_engine.compute_woe_iv(data_clean, 'annual_income', 'approved', n_bins=10)
print(f"Information Value: {iv:.4f}")

# Auto-generate all features
data_features = feature_engine.auto_features(data_clean, target='approved')

# Train models
model = RiskAutoModel()
X = data_features.drop('approved', axis=1)
y = data_features['approved']

results = model.train_auto(
    X, y,
    algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'],
    metric='auc'
)

# Get best model
best_model = model.get_best_model()
print(f"Best model AUC: {model.best_score:.4f}")

# Calibrate for better probabilities
calibrated_model = model.calibrate_model(X, y, method='isotonic')

# Score new applications
scorer = ScoringEngine(calibrated_model)
new_apps = [
    {"annual_income": 50000, "debt_ratio": 0.35, "age": 28},
    {"annual_income": 120000, "debt_ratio": 0.15, "age": 42}
]

for app in new_apps:
    score = scorer.score_single(app)
    print(f"Score: {score['score']}, Rating: {score['rating']}")

Example 2: Fraud Detection

from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load transaction data from API
connector = RiskDataConnector()
transactions = connector.from_api(
    url="https://api.example.com/transactions",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    params={"days": 90}
)

# Create behavioral features
feature_engine = RiskFeatureEngine()
behavioral_features = feature_engine.behavioral_features(
    df=transactions,
    customer_id='customer_id',
    time_column='transaction_date',
    value_column='amount'
)

# Features include: recency, frequency, monetary, velocity
print(behavioral_features.head())

# Train fraud detection model
model = RiskAutoModel()
X = behavioral_features.drop('is_fraud', axis=1)
y = behavioral_features['is_fraud']

results = model.train_auto(X, y, algorithms=['rf', 'xgboost'])

Example 3: Churn Prediction

from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load customer data from Data Lake
connector = RiskDataConnector()
customers = connector.from_datalake(
    path="abfss://container@account.dfs.core.windows.net/customers/",
    storage_options={
        "account_name": "your_account",
        "account_key": "your_key"
    }
)

# Time-based features
feature_engine = RiskFeatureEngine()
customers_with_time = feature_engine.time_features(customers, 'last_activity_date')

# Transaction aggregations
customers_with_trans = feature_engine.transaction_features(
    customers,
    group_by='customer_id',
    agg_columns=['purchase_amount', 'login_count', 'support_tickets']
)

# Ratio features (e.g., support_tickets / login_count)
customers_final = feature_engine.ratio_features(
    customers_with_trans,
    numerator_cols=['support_tickets'],
    denominator_cols=['login_count']
)

# Train churn model
model = RiskAutoModel()
X = customers_final.drop('churned', axis=1)
y = customers_final['churned']

results = model.train_auto(X, y, algorithms=['lightgbm', 'xgboost'])

๐Ÿ”ง Advanced Features

Hyperparameter Optimization

from riskx import RiskAutoModel

model = RiskAutoModel()

# Optimize XGBoost hyperparameters with Optuna
best_params = model.optimize_hyperparameters(
    X_train, y_train,
    algorithm='xgboost',
    n_trials=50
)

print(f"Best parameters: {best_params}")

Ensemble Models

from riskx import RiskAutoModel

model = RiskAutoModel()

# Train multiple models
model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# Create voting ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Or stacking ensemble
stacked_ensemble = model.create_ensemble(X, y, method='stacking')

Batch Scoring

from riskx import ScoringEngine
import pandas as pd

scorer = ScoringEngine(model)

# Score thousands of applications at once
applications_df = pd.read_csv("new_applications.csv")
scored_df = scorer.score_batch(applications_df)

# Results include score, probability, rating, risk_level for each row
scored_df[['score', 'rating', 'risk_level']].head()

Custom Score Binning

from riskx import ScoringEngine

scorer = ScoringEngine(model)

# Custom score bins
custom_bins = {
    'Excellent': (750, 850),
    'Good': (650, 749),
    'Fair': (550, 649),
    'Poor': (300, 549)
}

scorer.set_custom_bins(custom_bins)

๐Ÿ“Š API Reference

RiskDataConnector

Load data from multiple sources:

connector = RiskDataConnector()

# CSV files
data = connector.from_csv("data.csv")

# Excel files
data = connector.from_excel("data.xlsx", sheet_name="Sheet1")

# SQL databases
data = connector.from_sql("postgresql://localhost/db", "SELECT * FROM table")

# REST APIs
data = connector.from_api("https://api.example.com/data")

# JSON files
data = connector.from_json("data.json")

# Parquet files
data = connector.from_parquet("data.parquet")

# Cloud data lakes (Azure, AWS, GCP)
data = connector.from_datalake("s3://bucket/path/")

# Pandas DataFrame
data = connector.from_dataframe(df)

RiskCleaner

7 cleaning methods:

cleaner = RiskCleaner()

# Data quality profiling
profile = cleaner.profile(df)

# Missing value handling
df_clean = cleaner.clean_missing(df, strategy='auto')  # auto, mean, median, mode, forward, drop, fill

# Outlier handling
df_clean = cleaner.clean_outliers(df, method='iqr')  # iqr, zscore, clip

# Type validation
df_clean = cleaner.clean_types(df, type_map={'age': 'int', 'income': 'float'})

# Categorical encoding
df_encoded = cleaner.encode_categorical(df, columns=['category'], method='onehot')

# Feature scaling
df_scaled = cleaner.normalize(df, columns=['income', 'age'], method='standard')

# Duplicate removal
df_unique = cleaner.remove_duplicates(df)

# Full automated pipeline
df_clean = cleaner.auto_clean(df, target_column='default')

RiskFeatureEngine

Create risk-specific features:

engine = RiskFeatureEngine()

# WOE/IV calculation
woe_df, iv = engine.compute_woe_iv(df, 'income', 'default', n_bins=10)

# Optimal binning
df_binned = engine.auto_bin(df, 'age', n_bins=10, method='quantile')

# Behavioral features (RFM)
behavioral = engine.behavioral_features(df, 'customer_id', 'date', 'amount')

# Transaction aggregations
trans_features = engine.transaction_features(df, 'customer_id', ['amount', 'count'])

# Time features (11 extractions)
time_features = engine.time_features(df, 'transaction_date')

# Ratio features
ratio_features = engine.ratio_features(df, ['revenue'], ['cost'])

# Interaction features
interaction_features = engine.interaction_features(df, ['age', 'income'])

# Full automated pipeline
all_features = engine.auto_features(df, target='default')

RiskAutoModel

AutoML training:

model = RiskAutoModel()

# Train multiple algorithms
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'])

# Get best model
best = model.get_best_model()

# Calibrate model
calibrated = model.calibrate_model(X, y, method='isotonic')

# Create ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Hyperparameter optimization
best_params = model.optimize_hyperparameters(X, y, algorithm='xgboost', n_trials=50)

# Predictions
probs = model.predict_proba(X_test)

# Save/load
model.save_model("model.pkl")
model.load_model("model.pkl")

ScoringEngine

Production scoring:

scorer = ScoringEngine(model, score_min=300, score_max=850)

# Single prediction
result = scorer.score_single({'income': 50000, 'age': 30})
# Returns: {score, probability, rating, risk_level, reason_codes, timestamp}

# Batch scoring
df_scored = scorer.score_batch(df)

# Score interpretation
interpretation = scorer.interpret_score(720)
# Returns: {score, rating, risk_level, recommendation, approval_probability, suggested_interest_rate, percentile}

# Custom bins
scorer.set_custom_bins({'Excellent': (750, 850), 'Good': (650, 749)})

# API specification
api_spec = scorer.export_api_spec()

# Generate scorecard
scorecard = scorer.generate_scorecard(feature_weights)

# Simulate scores (for testing)
simulated = scorer.simulate_score_distribution(n_samples=10000)

๐ŸŽ“ Use Cases

โœ… Credit Scoring

  • Personal loan approvals
  • Credit card applications
  • Mortgage underwriting
  • SME lending

โœ… Fraud Detection

  • Transaction fraud
  • Identity fraud
  • Account takeover detection
  • Payment fraud

โœ… Churn Prediction

  • Customer retention
  • Subscription cancellation risk
  • Product abandonment
  • Service discontinuation

โœ… Risk Management

  • Portfolio risk assessment
  • Credit risk monitoring
  • Operational risk scoring
  • Compliance risk evaluation

๐Ÿ—๏ธ Architecture

RiskX Architecture
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Data Sources โ†’ Data Connector โ†’ Data Cleaner โ†’ Feature Engine โ†’ AutoML โ†’ Scoring Engine โ†’ API/Batch Output
                    โ†“               โ†“              โ†“             โ†“           โ†“
                  CSV/SQL       Profiling      WOE/IV      XGBoost    Real-time Score
                  Excel/API     Imputation   Behavioral   LightGBM     + Reason Codes
                  Parquet       Outliers      RFM         Ensemble     + Ratings
                  Cloud         Encoding      Time        Calibrated   + Risk Levels

๐Ÿ“ฆ What's Included

โœ… Core Modules (v0.1.0 - Production Ready)

  1. riskx.core.data_connector - Multi-source data loading (8+ sources)
  2. riskx.core.data_cleaner - Automated data cleaning (7 methods)
  3. riskx.core.feature_engineering - Risk features (WOE/IV, RFM, behavioral)
  4. riskx.core.model_auto - AutoML training (4 algorithms)
  5. riskx.core.scoring_engine - Production scoring (real-time + batch)

โณ Coming Soon

  1. riskx.core.monitoring - PSI, CSI, drift detection
  2. riskx.core.explainability - SHAP, LIME interpretability
  3. riskx.deployment - Cloud deployment (Azure, AWS, GCP)
  4. riskx.pipelines - End-to-end orchestration
  5. riskx.cli - Command-line interface

๐Ÿ”ฌ Technical Details

Dependencies

Core (required):

  • pandas >= 1.3.0
  • numpy >= 1.21.0
  • scikit-learn >= 1.0.0

Optional (recommended):

  • xgboost >= 1.5.0
  • lightgbm >= 3.3.0
  • optuna >= 2.10.0
  • shap >= 0.40.0
  • sqlalchemy >= 1.4.0
  • requests >= 2.26.0
  • pyarrow >= 6.0.0

Performance

  • Training: Optimized with multi-threading (n_jobs=-1)
  • Scoring: Real-time latency < 10ms
  • Batch Scoring: 10,000+ records/second
  • Memory: Efficient column-oriented storage

๐Ÿ’ผ Production Deployment

# Save trained model
model.save_model("production_model.pkl")

# Load in production
from riskx import RiskAutoModel, ScoringEngine

model = RiskAutoModel()
model.load_model("production_model.pkl")

scorer = ScoringEngine(model)

# API endpoint example (FastAPI)
from fastapi import FastAPI
app = FastAPI()

@app.post("/score")
def score_application(features: dict):
    result = scorer.score_single(features)
    return result

๐Ÿ“œ License

MIT License - see LICENSE file for details


๐Ÿ‘จโ€๐Ÿ’ป Author

Idriss Bado
Email: idrissbadoolivier@gmail.com
GitHub: @idrissbado


๐Ÿ™ Acknowledgments

Built with โค๏ธ for the risk modeling and financial ML community.


๐Ÿ“ž Support


Ready to revolutionize your risk scoring? Install RiskX today!

pip install riskx[full]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

riskx-0.1.1.tar.gz (52.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

riskx-0.1.1-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file riskx-0.1.1.tar.gz.

File metadata

  • Download URL: riskx-0.1.1.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for riskx-0.1.1.tar.gz
Algorithm Hash digest
SHA256 00688cb4ae5126d60666f43bdbd08881f9672d204250934217e34914d759250f
MD5 afaf41befe7cc3287c9beb1625e89735
BLAKE2b-256 1a5f58ea8d14c5e89785348a4376f710422d88f5c0c505495f9cfc0edaa71e1b

See more details on using hashes here.

File details

Details for the file riskx-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: riskx-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for riskx-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 87b5de544741fe5bff14177c0d1201aee05438447f62784356bd27f4f777439c
MD5 8f30bc6fc24f0cef16b7b871ee4632e9
BLAKE2b-256 a099f062e9f248deab5efff1ce09dac7eb418cf698d87fb97f454c97fbc51a2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page