Skip to main content

End-to-End Automated Risk Scoring Platform for Credit, Fraud, and Churn Prediction

Project description

๐Ÿš€ RiskX - End-to-End Automated Risk Scoring Platform

v0.1.0 | Production-Ready Core | Credit โ€ข Fraud โ€ข Churn Risk Scoring

Python 3.8+ License: MIT PyPI version

RiskX is a comprehensive, production-ready platform for automated risk scoring. Built for financial institutions, fintech companies, and data scientists working on credit scoring, fraud detection, and customer churn prediction.


๐ŸŽฏ What is RiskX?

RiskX provides an end-to-end automated workflow for risk scoring:

  1. Data Loading - Multi-source data ingestion (CSV, Excel, SQL, APIs, Cloud)
  2. Data Cleaning - Automated quality checks and preprocessing
  3. Feature Engineering - Risk-specific features (WOE/IV, RFM, behavioral)
  4. ML Training - AutoML with multiple algorithms (LR, RF, XGBoost, LightGBM)
  5. Scoring - Real-time and batch scoring with interpretability
  6. Monitoring - Model performance and data drift detection (coming soon)

โœจ Key Features

๐Ÿ”ฅ What's Working NOW (v0.1.0)

1. Multi-Source Data Loading

Load data from 8+ different sources:

  • CSV, Excel, JSON, Parquet files
  • SQL databases (via SQLAlchemy)
  • REST APIs
  • Cloud data lakes (Azure, AWS, GCP)
  • Pandas DataFrames

2. Automated Data Cleaning

7 powerful cleaning methods:

  • Missing value imputation (6 strategies)
  • Outlier detection and handling (IQR, Z-score, clipping)
  • Type validation and correction
  • Categorical encoding (label, one-hot)
  • Feature scaling (standard, min-max)
  • Duplicate removal
  • Full automated pipeline with auto_clean()

3. Risk-Specific Feature Engineering

Create 50+ features automatically:

  • WOE (Weight of Evidence) & IV (Information Value)
  • Optimal binning (quantile, uniform, kmeans)
  • RFM analysis (Recency, Frequency, Monetary)
  • Behavioral features from transactions
  • Time-based features (11 datetime extractions)
  • Ratio and interaction features
  • Full automated pipeline with auto_features()

4. AutoML Training

Train and compare 4 algorithms:

  • Logistic Regression
  • Random Forest
  • XGBoost
  • LightGBM
  • Automatic best model selection
  • Model calibration (isotonic, sigmoid)
  • Ensemble methods (voting, stacking)
  • Hyperparameter optimization (Optuna)

5. Production-Ready Scoring

API-ready scoring engine:

  • Real-time single predictions
  • Batch scoring
  • Score range: 300-850 (configurable)
  • Risk ratings: Excellent, Very Good, Good, Fair, Poor
  • Reason codes for interpretability
  • Score interpretation and recommendations
  • API specification export

๐Ÿš€ Quick Start

Installation

# Core installation (pandas, numpy, scikit-learn)
pip install riskx

# Full installation (includes XGBoost, LightGBM, Optuna, etc.)
pip install riskx[full]

# ML only (XGBoost, LightGBM, Optuna)
pip install riskx[ml]

# Data sources (SQL, APIs, Parquet, Excel)
pip install riskx[data]

Basic Usage

from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine
from riskx import RiskAutoModel, ScoringEngine

# 1. Load data
connector = RiskDataConnector()
data = connector.from_csv("loan_applications.csv")

# 2. Clean data (automated)
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="default")

# 3. Engineer features (automated)
feature_engine = RiskFeatureEngine()
data_features = feature_engine.auto_features(data_clean, target="default")

# 4. Train models (AutoML)
model = RiskAutoModel()
X = data_features.drop("default", axis=1)
y = data_features["default"]
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# 5. Score new applications
scorer = ScoringEngine(model.get_best_model())
new_application = {
    "income": 75000,
    "credit_history_years": 8,
    "debt_to_income": 0.25,
    "age": 35
}
result = scorer.score_single(new_application)

print(f"Credit Score: {result['score']}")
print(f"Rating: {result['rating']}")
print(f"Risk Level: {result['risk_level']}")
print(f"Reason Codes: {result['reason_codes']}")

Output:

Credit Score: 742
Rating: Very Good
Risk Level: Low
Reason Codes: [
    {'code': 'RC1', 'feature': 'credit_history_years', 'importance': 0.35},
    {'code': 'RC2', 'feature': 'debt_to_income', 'importance': 0.28},
    {'code': 'RC3', 'feature': 'income', 'importance': 0.22}
]

๐Ÿ“– Detailed Examples

Example 1: Credit Scoring Pipeline

from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine, RiskAutoModel, ScoringEngine

# Load credit application data
connector = RiskDataConnector()
data = connector.from_sql(
    connection_string="postgresql://user:pass@localhost/credit_db",
    query="SELECT * FROM applications WHERE created_date >= '2024-01-01'"
)

# Auto-clean
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="approved")

print(f"Cleaned {len(data_clean)} records")

# Feature engineering with WOE/IV
feature_engine = RiskFeatureEngine()

# Compute WOE/IV for key features
woe_df, iv = feature_engine.compute_woe_iv(data_clean, 'annual_income', 'approved', n_bins=10)
print(f"Information Value: {iv:.4f}")

# Auto-generate all features
data_features = feature_engine.auto_features(data_clean, target='approved')

# Train models
model = RiskAutoModel()
X = data_features.drop('approved', axis=1)
y = data_features['approved']

results = model.train_auto(
    X, y,
    algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'],
    metric='auc'
)

# Get best model
best_model = model.get_best_model()
print(f"Best model AUC: {model.best_score:.4f}")

# Calibrate for better probabilities
calibrated_model = model.calibrate_model(X, y, method='isotonic')

# Score new applications
scorer = ScoringEngine(calibrated_model)
new_apps = [
    {"annual_income": 50000, "debt_ratio": 0.35, "age": 28},
    {"annual_income": 120000, "debt_ratio": 0.15, "age": 42}
]

for app in new_apps:
    score = scorer.score_single(app)
    print(f"Score: {score['score']}, Rating: {score['rating']}")

Example 2: Fraud Detection

from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load transaction data from API
connector = RiskDataConnector()
transactions = connector.from_api(
    url="https://api.example.com/transactions",
    headers={"Authorization": "Bearer YOUR_TOKEN"},
    params={"days": 90}
)

# Create behavioral features
feature_engine = RiskFeatureEngine()
behavioral_features = feature_engine.behavioral_features(
    df=transactions,
    customer_id='customer_id',
    time_column='transaction_date',
    value_column='amount'
)

# Features include: recency, frequency, monetary, velocity
print(behavioral_features.head())

# Train fraud detection model
model = RiskAutoModel()
X = behavioral_features.drop('is_fraud', axis=1)
y = behavioral_features['is_fraud']

results = model.train_auto(X, y, algorithms=['rf', 'xgboost'])

Example 3: Churn Prediction

from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel

# Load customer data from Data Lake
connector = RiskDataConnector()
customers = connector.from_datalake(
    path="abfss://container@account.dfs.core.windows.net/customers/",
    storage_options={
        "account_name": "your_account",
        "account_key": "your_key"
    }
)

# Time-based features
feature_engine = RiskFeatureEngine()
customers_with_time = feature_engine.time_features(customers, 'last_activity_date')

# Transaction aggregations
customers_with_trans = feature_engine.transaction_features(
    customers,
    group_by='customer_id',
    agg_columns=['purchase_amount', 'login_count', 'support_tickets']
)

# Ratio features (e.g., support_tickets / login_count)
customers_final = feature_engine.ratio_features(
    customers_with_trans,
    numerator_cols=['support_tickets'],
    denominator_cols=['login_count']
)

# Train churn model
model = RiskAutoModel()
X = customers_final.drop('churned', axis=1)
y = customers_final['churned']

results = model.train_auto(X, y, algorithms=['lightgbm', 'xgboost'])

๐Ÿ”ง Advanced Features

Hyperparameter Optimization

from riskx import RiskAutoModel

model = RiskAutoModel()

# Optimize XGBoost hyperparameters with Optuna
best_params = model.optimize_hyperparameters(
    X_train, y_train,
    algorithm='xgboost',
    n_trials=50
)

print(f"Best parameters: {best_params}")

Ensemble Models

from riskx import RiskAutoModel

model = RiskAutoModel()

# Train multiple models
model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])

# Create voting ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Or stacking ensemble
stacked_ensemble = model.create_ensemble(X, y, method='stacking')

Batch Scoring

from riskx import ScoringEngine
import pandas as pd

scorer = ScoringEngine(model)

# Score thousands of applications at once
applications_df = pd.read_csv("new_applications.csv")
scored_df = scorer.score_batch(applications_df)

# Results include score, probability, rating, risk_level for each row
scored_df[['score', 'rating', 'risk_level']].head()

Custom Score Binning

from riskx import ScoringEngine

scorer = ScoringEngine(model)

# Custom score bins
custom_bins = {
    'Excellent': (750, 850),
    'Good': (650, 749),
    'Fair': (550, 649),
    'Poor': (300, 549)
}

scorer.set_custom_bins(custom_bins)

๐Ÿ“Š API Reference

RiskDataConnector

Load data from multiple sources:

connector = RiskDataConnector()

# CSV files
data = connector.from_csv("data.csv")

# Excel files
data = connector.from_excel("data.xlsx", sheet_name="Sheet1")

# SQL databases
data = connector.from_sql("postgresql://localhost/db", "SELECT * FROM table")

# REST APIs
data = connector.from_api("https://api.example.com/data")

# JSON files
data = connector.from_json("data.json")

# Parquet files
data = connector.from_parquet("data.parquet")

# Cloud data lakes (Azure, AWS, GCP)
data = connector.from_datalake("s3://bucket/path/")

# Pandas DataFrame
data = connector.from_dataframe(df)

RiskCleaner

7 cleaning methods:

cleaner = RiskCleaner()

# Data quality profiling
profile = cleaner.profile(df)

# Missing value handling
df_clean = cleaner.clean_missing(df, strategy='auto')  # auto, mean, median, mode, forward, drop, fill

# Outlier handling
df_clean = cleaner.clean_outliers(df, method='iqr')  # iqr, zscore, clip

# Type validation
df_clean = cleaner.clean_types(df, type_map={'age': 'int', 'income': 'float'})

# Categorical encoding
df_encoded = cleaner.encode_categorical(df, columns=['category'], method='onehot')

# Feature scaling
df_scaled = cleaner.normalize(df, columns=['income', 'age'], method='standard')

# Duplicate removal
df_unique = cleaner.remove_duplicates(df)

# Full automated pipeline
df_clean = cleaner.auto_clean(df, target_column='default')

RiskFeatureEngine

Create risk-specific features:

engine = RiskFeatureEngine()

# WOE/IV calculation
woe_df, iv = engine.compute_woe_iv(df, 'income', 'default', n_bins=10)

# Optimal binning
df_binned = engine.auto_bin(df, 'age', n_bins=10, method='quantile')

# Behavioral features (RFM)
behavioral = engine.behavioral_features(df, 'customer_id', 'date', 'amount')

# Transaction aggregations
trans_features = engine.transaction_features(df, 'customer_id', ['amount', 'count'])

# Time features (11 extractions)
time_features = engine.time_features(df, 'transaction_date')

# Ratio features
ratio_features = engine.ratio_features(df, ['revenue'], ['cost'])

# Interaction features
interaction_features = engine.interaction_features(df, ['age', 'income'])

# Full automated pipeline
all_features = engine.auto_features(df, target='default')

RiskAutoModel

AutoML training:

model = RiskAutoModel()

# Train multiple algorithms
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'])

# Get best model
best = model.get_best_model()

# Calibrate model
calibrated = model.calibrate_model(X, y, method='isotonic')

# Create ensemble
ensemble = model.create_ensemble(X, y, method='voting')

# Hyperparameter optimization
best_params = model.optimize_hyperparameters(X, y, algorithm='xgboost', n_trials=50)

# Predictions
probs = model.predict_proba(X_test)

# Save/load
model.save_model("model.pkl")
model.load_model("model.pkl")

ScoringEngine

Production scoring:

scorer = ScoringEngine(model, score_min=300, score_max=850)

# Single prediction
result = scorer.score_single({'income': 50000, 'age': 30})
# Returns: {score, probability, rating, risk_level, reason_codes, timestamp}

# Batch scoring
df_scored = scorer.score_batch(df)

# Score interpretation
interpretation = scorer.interpret_score(720)
# Returns: {score, rating, risk_level, recommendation, approval_probability, suggested_interest_rate, percentile}

# Custom bins
scorer.set_custom_bins({'Excellent': (750, 850), 'Good': (650, 749)})

# API specification
api_spec = scorer.export_api_spec()

# Generate scorecard
scorecard = scorer.generate_scorecard(feature_weights)

# Simulate scores (for testing)
simulated = scorer.simulate_score_distribution(n_samples=10000)

๐ŸŽ“ Use Cases

โœ… Credit Scoring

  • Personal loan approvals
  • Credit card applications
  • Mortgage underwriting
  • SME lending

โœ… Fraud Detection

  • Transaction fraud
  • Identity fraud
  • Account takeover detection
  • Payment fraud

โœ… Churn Prediction

  • Customer retention
  • Subscription cancellation risk
  • Product abandonment
  • Service discontinuation

โœ… Risk Management

  • Portfolio risk assessment
  • Credit risk monitoring
  • Operational risk scoring
  • Compliance risk evaluation

๐Ÿ—๏ธ Architecture

RiskX Architecture
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Data Sources โ†’ Data Connector โ†’ Data Cleaner โ†’ Feature Engine โ†’ AutoML โ†’ Scoring Engine โ†’ API/Batch Output
                    โ†“               โ†“              โ†“             โ†“           โ†“
                  CSV/SQL       Profiling      WOE/IV      XGBoost    Real-time Score
                  Excel/API     Imputation   Behavioral   LightGBM     + Reason Codes
                  Parquet       Outliers      RFM         Ensemble     + Ratings
                  Cloud         Encoding      Time        Calibrated   + Risk Levels

๐Ÿ“ฆ What's Included

โœ… Core Modules (v0.1.0 - Production Ready)

  1. riskx.core.data_connector - Multi-source data loading (8+ sources)
  2. riskx.core.data_cleaner - Automated data cleaning (7 methods)
  3. riskx.core.feature_engineering - Risk features (WOE/IV, RFM, behavioral)
  4. riskx.core.model_auto - AutoML training (4 algorithms)
  5. riskx.core.scoring_engine - Production scoring (real-time + batch)

โณ Coming Soon

  1. riskx.core.monitoring - PSI, CSI, drift detection
  2. riskx.core.explainability - SHAP, LIME interpretability
  3. riskx.deployment - Cloud deployment (Azure, AWS, GCP)
  4. riskx.pipelines - End-to-end orchestration
  5. riskx.cli - Command-line interface

๐Ÿ”ฌ Technical Details

Dependencies

Core (required):

  • pandas >= 1.3.0
  • numpy >= 1.21.0
  • scikit-learn >= 1.0.0

Optional (recommended):

  • xgboost >= 1.5.0
  • lightgbm >= 3.3.0
  • optuna >= 2.10.0
  • shap >= 0.40.0
  • sqlalchemy >= 1.4.0
  • requests >= 2.26.0
  • pyarrow >= 6.0.0

Performance

  • Training: Optimized with multi-threading (n_jobs=-1)
  • Scoring: Real-time latency < 10ms
  • Batch Scoring: 10,000+ records/second
  • Memory: Efficient column-oriented storage

๐Ÿ’ผ Production Deployment

# Save trained model
model.save_model("production_model.pkl")

# Load in production
from riskx import RiskAutoModel, ScoringEngine

model = RiskAutoModel()
model.load_model("production_model.pkl")

scorer = ScoringEngine(model)

# API endpoint example (FastAPI)
from fastapi import FastAPI
app = FastAPI()

@app.post("/score")
def score_application(features: dict):
    result = scorer.score_single(features)
    return result

๐Ÿ“œ License

MIT License - see LICENSE file for details


๐Ÿ‘จโ€๐Ÿ’ป Author

Idriss Bado
Email: idrissbadoolivier@gmail.com
GitHub: @idrissbado


๐Ÿ™ Acknowledgments

Built with โค๏ธ for the risk modeling and financial ML community.


๐Ÿ“ž Support


Ready to revolutionize your risk scoring? Install RiskX today!

pip install riskx[full]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

riskx-0.1.0.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

riskx-0.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file riskx-0.1.0.tar.gz.

File metadata

  • Download URL: riskx-0.1.0.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for riskx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e1b780663160229be5d9ee015c7b67b6aff2f373f0354ab0f8c4edcda116bdc4
MD5 70fc3e6baf57869dde4672c6f1c72903
BLAKE2b-256 000f2da28e359f02fa90eb9887bfd7d48193f39803707f4f2d6dafcbc7b8dfe1

See more details on using hashes here.

File details

Details for the file riskx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: riskx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for riskx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36d16b2512793e60c39fe61abd6e3fb625514905b34fc4357ab760a9d533b3f0
MD5 9a0338e530e05ef4c95e5bacecdc241c
BLAKE2b-256 af173341c3d42f1a6637ca9d77c33f5c6dab8c3bf6daefe198f8ca291c24b54f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page