End-to-End Automated Risk Scoring Platform for Credit, Fraud, and Churn Prediction
Project description
๐ RiskX - End-to-End Automated Risk Scoring Platform
v0.1.0 | Production-Ready Core | Credit โข Fraud โข Churn Risk Scoring
RiskX is a comprehensive, production-ready platform for automated risk scoring. Built for financial institutions, fintech companies, and data scientists working on credit scoring, fraud detection, and customer churn prediction.
๐ฏ What is RiskX?
RiskX provides an end-to-end automated workflow for risk scoring:
- Data Loading - Multi-source data ingestion (CSV, Excel, SQL, APIs, Cloud)
- Data Cleaning - Automated quality checks and preprocessing
- Feature Engineering - Risk-specific features (WOE/IV, RFM, behavioral)
- ML Training - AutoML with multiple algorithms (LR, RF, XGBoost, LightGBM)
- Scoring - Real-time and batch scoring with interpretability
- Monitoring - Model performance and data drift detection (coming soon)
โจ Key Features
๐ฅ What's Working NOW (v0.1.0)
1. Multi-Source Data Loading
Load data from 8+ different sources:
- CSV, Excel, JSON, Parquet files
- SQL databases (via SQLAlchemy)
- REST APIs
- Cloud data lakes (Azure, AWS, GCP)
- Pandas DataFrames
2. Automated Data Cleaning
7 powerful cleaning methods:
- Missing value imputation (6 strategies)
- Outlier detection and handling (IQR, Z-score, clipping)
- Type validation and correction
- Categorical encoding (label, one-hot)
- Feature scaling (standard, min-max)
- Duplicate removal
- Full automated pipeline with
auto_clean()
3. Risk-Specific Feature Engineering
Create 50+ features automatically:
- WOE (Weight of Evidence) & IV (Information Value)
- Optimal binning (quantile, uniform, kmeans)
- RFM analysis (Recency, Frequency, Monetary)
- Behavioral features from transactions
- Time-based features (11 datetime extractions)
- Ratio and interaction features
- Full automated pipeline with
auto_features()
4. AutoML Training
Train and compare 4 algorithms:
- Logistic Regression
- Random Forest
- XGBoost
- LightGBM
- Automatic best model selection
- Model calibration (isotonic, sigmoid)
- Ensemble methods (voting, stacking)
- Hyperparameter optimization (Optuna)
5. Production-Ready Scoring
API-ready scoring engine:
- Real-time single predictions
- Batch scoring
- Score range: 300-850 (configurable)
- Risk ratings: Excellent, Very Good, Good, Fair, Poor
- Reason codes for interpretability
- Score interpretation and recommendations
- API specification export
๐ Quick Start
Installation
# Core installation (pandas, numpy, scikit-learn)
pip install riskx
# Full installation (includes XGBoost, LightGBM, Optuna, etc.)
pip install riskx[full]
# ML only (XGBoost, LightGBM, Optuna)
pip install riskx[ml]
# Data sources (SQL, APIs, Parquet, Excel)
pip install riskx[data]
Basic Usage
from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine
from riskx import RiskAutoModel, ScoringEngine
# 1. Load data
connector = RiskDataConnector()
data = connector.from_csv("loan_applications.csv")
# 2. Clean data (automated)
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="default")
# 3. Engineer features (automated)
feature_engine = RiskFeatureEngine()
data_features = feature_engine.auto_features(data_clean, target="default")
# 4. Train models (AutoML)
model = RiskAutoModel()
X = data_features.drop("default", axis=1)
y = data_features["default"]
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])
# 5. Score new applications
scorer = ScoringEngine(model.get_best_model())
new_application = {
"income": 75000,
"credit_history_years": 8,
"debt_to_income": 0.25,
"age": 35
}
result = scorer.score_single(new_application)
print(f"Credit Score: {result['score']}")
print(f"Rating: {result['rating']}")
print(f"Risk Level: {result['risk_level']}")
print(f"Reason Codes: {result['reason_codes']}")
Output:
Credit Score: 742
Rating: Very Good
Risk Level: Low
Reason Codes: [
{'code': 'RC1', 'feature': 'credit_history_years', 'importance': 0.35},
{'code': 'RC2', 'feature': 'debt_to_income', 'importance': 0.28},
{'code': 'RC3', 'feature': 'income', 'importance': 0.22}
]
๐ Detailed Examples
Example 1: Credit Scoring Pipeline
from riskx import RiskDataConnector, RiskCleaner, RiskFeatureEngine, RiskAutoModel, ScoringEngine
# Load credit application data
connector = RiskDataConnector()
data = connector.from_sql(
connection_string="postgresql://user:pass@localhost/credit_db",
query="SELECT * FROM applications WHERE created_date >= '2024-01-01'"
)
# Auto-clean
cleaner = RiskCleaner()
data_clean = cleaner.auto_clean(data, target_column="approved")
print(f"Cleaned {len(data_clean)} records")
# Feature engineering with WOE/IV
feature_engine = RiskFeatureEngine()
# Compute WOE/IV for key features
woe_df, iv = feature_engine.compute_woe_iv(data_clean, 'annual_income', 'approved', n_bins=10)
print(f"Information Value: {iv:.4f}")
# Auto-generate all features
data_features = feature_engine.auto_features(data_clean, target='approved')
# Train models
model = RiskAutoModel()
X = data_features.drop('approved', axis=1)
y = data_features['approved']
results = model.train_auto(
X, y,
algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'],
metric='auc'
)
# Get best model
best_model = model.get_best_model()
print(f"Best model AUC: {model.best_score:.4f}")
# Calibrate for better probabilities
calibrated_model = model.calibrate_model(X, y, method='isotonic')
# Score new applications
scorer = ScoringEngine(calibrated_model)
new_apps = [
{"annual_income": 50000, "debt_ratio": 0.35, "age": 28},
{"annual_income": 120000, "debt_ratio": 0.15, "age": 42}
]
for app in new_apps:
score = scorer.score_single(app)
print(f"Score: {score['score']}, Rating: {score['rating']}")
Example 2: Fraud Detection
from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel
# Load transaction data from API
connector = RiskDataConnector()
transactions = connector.from_api(
url="https://api.example.com/transactions",
headers={"Authorization": "Bearer YOUR_TOKEN"},
params={"days": 90}
)
# Create behavioral features
feature_engine = RiskFeatureEngine()
behavioral_features = feature_engine.behavioral_features(
df=transactions,
customer_id='customer_id',
time_column='transaction_date',
value_column='amount'
)
# Features include: recency, frequency, monetary, velocity
print(behavioral_features.head())
# Train fraud detection model
model = RiskAutoModel()
X = behavioral_features.drop('is_fraud', axis=1)
y = behavioral_features['is_fraud']
results = model.train_auto(X, y, algorithms=['rf', 'xgboost'])
Example 3: Churn Prediction
from riskx import RiskDataConnector, RiskFeatureEngine, RiskAutoModel
# Load customer data from Data Lake
connector = RiskDataConnector()
customers = connector.from_datalake(
path="abfss://container@account.dfs.core.windows.net/customers/",
storage_options={
"account_name": "your_account",
"account_key": "your_key"
}
)
# Time-based features
feature_engine = RiskFeatureEngine()
customers_with_time = feature_engine.time_features(customers, 'last_activity_date')
# Transaction aggregations
customers_with_trans = feature_engine.transaction_features(
customers,
group_by='customer_id',
agg_columns=['purchase_amount', 'login_count', 'support_tickets']
)
# Ratio features (e.g., support_tickets / login_count)
customers_final = feature_engine.ratio_features(
customers_with_trans,
numerator_cols=['support_tickets'],
denominator_cols=['login_count']
)
# Train churn model
model = RiskAutoModel()
X = customers_final.drop('churned', axis=1)
y = customers_final['churned']
results = model.train_auto(X, y, algorithms=['lightgbm', 'xgboost'])
๐ง Advanced Features
Hyperparameter Optimization
from riskx import RiskAutoModel
model = RiskAutoModel()
# Optimize XGBoost hyperparameters with Optuna
best_params = model.optimize_hyperparameters(
X_train, y_train,
algorithm='xgboost',
n_trials=50
)
print(f"Best parameters: {best_params}")
Ensemble Models
from riskx import RiskAutoModel
model = RiskAutoModel()
# Train multiple models
model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost'])
# Create voting ensemble
ensemble = model.create_ensemble(X, y, method='voting')
# Or stacking ensemble
stacked_ensemble = model.create_ensemble(X, y, method='stacking')
Batch Scoring
from riskx import ScoringEngine
import pandas as pd
scorer = ScoringEngine(model)
# Score thousands of applications at once
applications_df = pd.read_csv("new_applications.csv")
scored_df = scorer.score_batch(applications_df)
# Results include score, probability, rating, risk_level for each row
scored_df[['score', 'rating', 'risk_level']].head()
Custom Score Binning
from riskx import ScoringEngine
scorer = ScoringEngine(model)
# Custom score bins
custom_bins = {
'Excellent': (750, 850),
'Good': (650, 749),
'Fair': (550, 649),
'Poor': (300, 549)
}
scorer.set_custom_bins(custom_bins)
๐ API Reference
RiskDataConnector
Load data from multiple sources:
connector = RiskDataConnector()
# CSV files
data = connector.from_csv("data.csv")
# Excel files
data = connector.from_excel("data.xlsx", sheet_name="Sheet1")
# SQL databases
data = connector.from_sql("postgresql://localhost/db", "SELECT * FROM table")
# REST APIs
data = connector.from_api("https://api.example.com/data")
# JSON files
data = connector.from_json("data.json")
# Parquet files
data = connector.from_parquet("data.parquet")
# Cloud data lakes (Azure, AWS, GCP)
data = connector.from_datalake("s3://bucket/path/")
# Pandas DataFrame
data = connector.from_dataframe(df)
RiskCleaner
7 cleaning methods:
cleaner = RiskCleaner()
# Data quality profiling
profile = cleaner.profile(df)
# Missing value handling
df_clean = cleaner.clean_missing(df, strategy='auto') # auto, mean, median, mode, forward, drop, fill
# Outlier handling
df_clean = cleaner.clean_outliers(df, method='iqr') # iqr, zscore, clip
# Type validation
df_clean = cleaner.clean_types(df, type_map={'age': 'int', 'income': 'float'})
# Categorical encoding
df_encoded = cleaner.encode_categorical(df, columns=['category'], method='onehot')
# Feature scaling
df_scaled = cleaner.normalize(df, columns=['income', 'age'], method='standard')
# Duplicate removal
df_unique = cleaner.remove_duplicates(df)
# Full automated pipeline
df_clean = cleaner.auto_clean(df, target_column='default')
RiskFeatureEngine
Create risk-specific features:
engine = RiskFeatureEngine()
# WOE/IV calculation
woe_df, iv = engine.compute_woe_iv(df, 'income', 'default', n_bins=10)
# Optimal binning
df_binned = engine.auto_bin(df, 'age', n_bins=10, method='quantile')
# Behavioral features (RFM)
behavioral = engine.behavioral_features(df, 'customer_id', 'date', 'amount')
# Transaction aggregations
trans_features = engine.transaction_features(df, 'customer_id', ['amount', 'count'])
# Time features (11 extractions)
time_features = engine.time_features(df, 'transaction_date')
# Ratio features
ratio_features = engine.ratio_features(df, ['revenue'], ['cost'])
# Interaction features
interaction_features = engine.interaction_features(df, ['age', 'income'])
# Full automated pipeline
all_features = engine.auto_features(df, target='default')
RiskAutoModel
AutoML training:
model = RiskAutoModel()
# Train multiple algorithms
results = model.train_auto(X, y, algorithms=['logistic', 'rf', 'xgboost', 'lightgbm'])
# Get best model
best = model.get_best_model()
# Calibrate model
calibrated = model.calibrate_model(X, y, method='isotonic')
# Create ensemble
ensemble = model.create_ensemble(X, y, method='voting')
# Hyperparameter optimization
best_params = model.optimize_hyperparameters(X, y, algorithm='xgboost', n_trials=50)
# Predictions
probs = model.predict_proba(X_test)
# Save/load
model.save_model("model.pkl")
model.load_model("model.pkl")
ScoringEngine
Production scoring:
scorer = ScoringEngine(model, score_min=300, score_max=850)
# Single prediction
result = scorer.score_single({'income': 50000, 'age': 30})
# Returns: {score, probability, rating, risk_level, reason_codes, timestamp}
# Batch scoring
df_scored = scorer.score_batch(df)
# Score interpretation
interpretation = scorer.interpret_score(720)
# Returns: {score, rating, risk_level, recommendation, approval_probability, suggested_interest_rate, percentile}
# Custom bins
scorer.set_custom_bins({'Excellent': (750, 850), 'Good': (650, 749)})
# API specification
api_spec = scorer.export_api_spec()
# Generate scorecard
scorecard = scorer.generate_scorecard(feature_weights)
# Simulate scores (for testing)
simulated = scorer.simulate_score_distribution(n_samples=10000)
๐ Use Cases
โ Credit Scoring
- Personal loan approvals
- Credit card applications
- Mortgage underwriting
- SME lending
โ Fraud Detection
- Transaction fraud
- Identity fraud
- Account takeover detection
- Payment fraud
โ Churn Prediction
- Customer retention
- Subscription cancellation risk
- Product abandonment
- Service discontinuation
โ Risk Management
- Portfolio risk assessment
- Credit risk monitoring
- Operational risk scoring
- Compliance risk evaluation
๐๏ธ Architecture
RiskX Architecture
โโโโโโโโโโโโโโโโโ
Data Sources โ Data Connector โ Data Cleaner โ Feature Engine โ AutoML โ Scoring Engine โ API/Batch Output
โ โ โ โ โ
CSV/SQL Profiling WOE/IV XGBoost Real-time Score
Excel/API Imputation Behavioral LightGBM + Reason Codes
Parquet Outliers RFM Ensemble + Ratings
Cloud Encoding Time Calibrated + Risk Levels
๐ฆ What's Included
โ Core Modules (v0.1.0 - Production Ready)
- riskx.core.data_connector - Multi-source data loading (8+ sources)
- riskx.core.data_cleaner - Automated data cleaning (7 methods)
- riskx.core.feature_engineering - Risk features (WOE/IV, RFM, behavioral)
- riskx.core.model_auto - AutoML training (4 algorithms)
- riskx.core.scoring_engine - Production scoring (real-time + batch)
โณ Coming Soon
- riskx.core.monitoring - PSI, CSI, drift detection
- riskx.core.explainability - SHAP, LIME interpretability
- riskx.deployment - Cloud deployment (Azure, AWS, GCP)
- riskx.pipelines - End-to-end orchestration
- riskx.cli - Command-line interface
๐ฌ Technical Details
Dependencies
Core (required):
- pandas >= 1.3.0
- numpy >= 1.21.0
- scikit-learn >= 1.0.0
Optional (recommended):
- xgboost >= 1.5.0
- lightgbm >= 3.3.0
- optuna >= 2.10.0
- shap >= 0.40.0
- sqlalchemy >= 1.4.0
- requests >= 2.26.0
- pyarrow >= 6.0.0
Performance
- Training: Optimized with multi-threading (n_jobs=-1)
- Scoring: Real-time latency < 10ms
- Batch Scoring: 10,000+ records/second
- Memory: Efficient column-oriented storage
๐ผ Production Deployment
# Save trained model
model.save_model("production_model.pkl")
# Load in production
from riskx import RiskAutoModel, ScoringEngine
model = RiskAutoModel()
model.load_model("production_model.pkl")
scorer = ScoringEngine(model)
# API endpoint example (FastAPI)
from fastapi import FastAPI
app = FastAPI()
@app.post("/score")
def score_application(features: dict):
result = scorer.score_single(features)
return result
๐ License
MIT License - see LICENSE file for details
๐จโ๐ป Author
Idriss Bado
Email: idrissbadoolivier@gmail.com
GitHub: @idrissbado
๐ Acknowledgments
Built with โค๏ธ for the risk modeling and financial ML community.
๐ Support
- Documentation: GitHub README
- Issues: GitHub Issues
- PyPI: https://pypi.org/project/riskx/
Ready to revolutionize your risk scoring? Install RiskX today!
pip install riskx[full]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file riskx-0.1.0.tar.gz.
File metadata
- Download URL: riskx-0.1.0.tar.gz
- Upload date:
- Size: 43.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1b780663160229be5d9ee015c7b67b6aff2f373f0354ab0f8c4edcda116bdc4
|
|
| MD5 |
70fc3e6baf57869dde4672c6f1c72903
|
|
| BLAKE2b-256 |
000f2da28e359f02fa90eb9887bfd7d48193f39803707f4f2d6dafcbc7b8dfe1
|
File details
Details for the file riskx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: riskx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d16b2512793e60c39fe61abd6e3fb625514905b34fc4357ab760a9d533b3f0
|
|
| MD5 |
9a0338e530e05ef4c95e5bacecdc241c
|
|
| BLAKE2b-256 |
af173341c3d42f1a6637ca9d77c33f5c6dab8c3bf6daefe198f8ca291c24b54f
|