Comprehensive evaluation library for scikit-learn models with advanced metrics, custom thresholds, and visualizations
Project description
Extended Sklearn Metrics
A comprehensive, production-ready evaluation library for scikit-learn models with advanced metrics, ROC/AUC analysis, feature importance, fairness evaluation, and professional visualizations. Designed for data scientists and ML engineers who need thorough model evaluation beyond basic accuracy scores.
Table of Contents
- Features
- Installation
- Quick Start
- Detailed Usage Guide
- API Reference
- Advanced Examples
- Architecture
- Dependencies
- Contributing
- Version History
- License
Features
Core Capabilities
Extended Sklearn Metrics provides a complete evaluation toolkit that goes far beyond sklearn's built-in metrics:
- Cross-Validation Evaluation: Robust model assessment with configurable CV folds and custom performance thresholds
- ROC/AUC Analysis: Comprehensive receiver operating characteristic analysis with threshold optimization
- Multi-Class Support: Native support for binary and multi-class classification problems
- Comprehensive Model Evaluation: Hold-out test evaluation with cross-validation stability analysis
- Feature Importance: Dual-method importance analysis (built-in model importance + permutation importance)
- Fairness Evaluation: Assess model fairness across demographic groups and protected attributes
- Residual Diagnostics: Statistical tests and visualizations for regression model residuals
- Professional Visualizations: Publication-ready plots and comprehensive evaluation dashboards
ROC/AUC Analysis
The library provides state-of-the-art ROC curve analysis capabilities:
- ROC Curve Calculation: Complete ROC curve with FPR, TPR, and thresholds for every point
- Precision-Recall Curves: PR curves with AUC-PR metrics for imbalanced datasets
- Threshold Optimization: Multiple methods for finding optimal classification thresholds:
- Youden's Index (maximizes TPR - FPR)
- F1-optimal threshold
- Balanced accuracy optimization
- Distance to perfect classifier
- Multi-Class ROC: One-vs-rest ROC analysis with macro and micro averaging
- Interactive Analysis: Threshold analysis plots showing performance trade-offs
Comprehensive Evaluation Framework
The final_model_evaluation function provides enterprise-grade model assessment:
- Hold-Out Testing: Unbiased test set evaluation with detailed metrics
- Cross-Validation Stability: Assess model consistency across different data splits
- Feature Importance:
- Built-in model importance (for tree-based models)
- Permutation importance (model-agnostic)
- Ranked importance with statistical significance
- Model Interpretation: Complexity assessment and interpretability metrics
- Error Analysis:
- Error patterns and correlations
- Residual diagnostics for regression
- Confusion matrices for classification
- Fairness Evaluation:
- Performance comparison across demographic groups
- Disparate impact analysis
- Bias detection in predictions
- Actionable Insights: Automated recommendations based on evaluation results
Advanced Visualizations
All visualization functions are modular, customizable, and production-ready:
- Performance Plots: Bar charts, radar charts, and comparison plots
- ROC Visualizations: ROC curves with confidence intervals and optimal threshold markers
- Precision-Recall Plots: PR curves with F1-optimal threshold highlights
- Multi-Class ROC: Overlaid ROC curves for all classes with macro/micro averages
- Feature Importance Charts: Horizontal bar charts with error bars
- Fairness Comparisons: Side-by-side performance metrics across groups
- Comprehensive Dashboards: Multi-panel evaluation reports with all key metrics
- Residual Diagnostics: Q-Q plots, residual vs fitted, and scale-location plots
Installation
Install via pip:
pip install extended-sklearn-metrics
For development installation:
git clone https://github.com/SubaashNair/extended-sklearn-metrics.git
cd extended-sklearn-metrics
pip install -e .
Quick Start
Basic Classification Evaluation
This example demonstrates basic classification model evaluation with cross-validation:
from extended_sklearn_metrics import evaluate_classification_model_with_cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic classification data
# n_samples=1000: Dataset with 1000 samples
# n_features=10: 10 features per sample
# random_state=42: Reproducible results
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Split into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a Random Forest classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Evaluate with 5-fold cross-validation
# This returns a DataFrame with detailed metrics
results = evaluate_classification_model_with_cross_validation(
model,
X_train,
y_train,
cv=5
)
# Display the evaluation results
print(results)
# Output includes: Accuracy, Precision, Recall, F1-Score, ROC AUC
# Each metric includes: value, threshold interpretation, and performance category
Understanding the Output:
The results DataFrame contains:
- Metric: The name of the evaluation metric
- Value: The numerical score (0-1 for most metrics)
- Threshold: Performance category boundaries (Excellent/Good/Acceptable/Poor)
- Calculation: How the metric was computed
- Performance: Automatic categorization of model performance
ROC/AUC Analysis with Threshold Optimization
This example shows how to perform comprehensive ROC analysis and find optimal decision thresholds:
from extended_sklearn_metrics import (
calculate_roc_metrics,
create_roc_curve_plot,
find_optimal_thresholds,
print_roc_auc_summary
)
from sklearn.linear_model import LogisticRegression
# Train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Get predicted probabilities for the positive class
# These probabilities are used for ROC analysis
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Calculate comprehensive ROC metrics
# Returns: FPR, TPR, thresholds, AUC score, optimal threshold
roc_results = calculate_roc_metrics(y_test, y_pred_proba)
# Print a detailed ROC/AUC summary report
print_roc_auc_summary(roc_results)
# Find optimal decision thresholds using multiple methods
optimal_thresholds = find_optimal_thresholds(y_test, y_pred_proba)
print("\nOptimal Thresholds for Different Objectives:")
for method, threshold in optimal_thresholds.items():
print(f" {method}: {threshold:.4f}")
# Create publication-ready ROC curve plot
create_roc_curve_plot(roc_results, title="Logistic Regression ROC Curve")
Why Multiple Thresholds?
Different business objectives require different thresholds:
- Youden's Index: Balanced TPR and FPR - good for balanced datasets
- F1-Optimal: Maximizes F1 score - good for imbalanced data
- Balanced Accuracy: Equal weight to sensitivity and specificity
Multi-Class ROC Analysis
For problems with more than two classes, use one-vs-rest ROC analysis:
from extended_sklearn_metrics import (
calculate_multiclass_roc_metrics,
create_multiclass_roc_plot
)
from sklearn.datasets import make_classification
# Create a 3-class classification problem
# n_classes=3: Three different classes to predict
# n_informative=8: 8 out of 10 features are informative
X, y = make_classification(
n_samples=1000,
n_features=10,
n_classes=3,
n_informative=8,
n_clusters_per_class=1,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a multi-class classifier
# multi_class='ovr': One-vs-Rest strategy
model = LogisticRegression(random_state=42, multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)
# Get predicted probabilities for all classes
# Shape: (n_samples, n_classes)
y_pred_proba = model.predict_proba(X_test)
# Calculate ROC metrics for each class using one-vs-rest
# Returns ROC curves for each class plus macro/micro averages
multiclass_roc = calculate_multiclass_roc_metrics(y_test, y_pred_proba)
# Print summary statistics
print(f"Macro-average AUC: {multiclass_roc['macro_auc']:.3f}")
print(f"Micro-average AUC: {multiclass_roc['micro_auc']:.3f}")
# Per-class AUC scores
for class_label in multiclass_roc['class_labels']:
auc = multiclass_roc['class_results'][class_label]['roc_auc']
print(f"Class {class_label} AUC: {auc:.3f}")
# Create comprehensive multi-class ROC plot
create_multiclass_roc_plot(multiclass_roc, title="Multi-Class ROC Analysis")
Understanding Multi-Class Metrics:
- Macro-Average AUC: Average of per-class AUC scores (equal weight to each class)
- Micro-Average AUC: Computed from pooled FPR/TPR (accounts for class imbalance)
- One-vs-Rest: Each class is compared against all other classes combined
Detailed Usage Guide
Comprehensive Model Evaluation
The final_model_evaluation function provides the most thorough assessment:
from extended_sklearn_metrics import (
final_model_evaluation,
print_evaluation_summary,
create_evaluation_report,
create_comprehensive_evaluation_plots
)
import pandas as pd
import numpy as np
# Use meaningful feature names for better interpretability
feature_names = [
'age', 'income', 'credit_score', 'debt_ratio', 'employment_length',
'num_credit_lines', 'num_delinquencies', 'loan_amount',
'property_value', 'savings_amount'
]
# Convert numpy arrays to pandas DataFrames with named features
X_train_df = pd.DataFrame(X_train, columns=feature_names)
X_test_df = pd.DataFrame(X_test, columns=feature_names)
# Perform comprehensive evaluation with all features
evaluation_results = final_model_evaluation(
model=model,
X_train=X_train_df,
y_train=y_train,
X_test=X_test_df,
y_test=y_test,
task_type='classification', # or 'regression'
cv_folds=5, # Number of cross-validation folds
feature_names=feature_names,
suppress_warnings=True, # Suppress sklearn feature name warnings
random_state=42
)
# Print executive summary with key findings
print_evaluation_summary(evaluation_results)
# Create detailed evaluation report as DataFrame
report_df = create_evaluation_report(evaluation_results)
print("\nDetailed Evaluation Report:")
print(report_df)
# Generate comprehensive visualization dashboard
# This creates a multi-panel plot with:
# - Performance metrics
# - Feature importance
# - Cross-validation stability
# - Error analysis
create_comprehensive_evaluation_plots(evaluation_results)
What This Provides:
- Test Set Performance: Accuracy, precision, recall, F1, ROC AUC on held-out data
- Cross-Validation Stability: Mean and standard deviation of metrics across CV folds
- Feature Importance:
- Built-in importance from the model
- Permutation importance (model-agnostic)
- Ranked list of most impactful features
- Model Complexity: Number of parameters, depth (for trees), interpretability score
- Error Analysis: Patterns in misclassifications, correlation with features
- Recommendations: Automated suggestions for model improvement
Fairness Evaluation Across Demographic Groups
Assess whether your model performs equitably across different populations:
from extended_sklearn_metrics import (
create_fairness_report,
create_fairness_comparison_plot
)
import numpy as np
# Create synthetic demographic data
# In production, these would come from your actual dataset
np.random.seed(42)
protected_attrs = {
'gender': np.random.choice(
['Male', 'Female'],
size=len(y_test),
p=[0.6, 0.4]
),
'age_group': np.random.choice(
['18-30', '31-50', '51+'],
size=len(y_test),
p=[0.3, 0.4, 0.3]
),
'ethnicity': np.random.choice(
['Group_A', 'Group_B', 'Group_C'],
size=len(y_test),
p=[0.5, 0.3, 0.2]
)
}
# Run evaluation with fairness analysis enabled
evaluation_results = final_model_evaluation(
model=model,
X_train=X_train_df,
y_train=y_train,
X_test=X_test_df,
y_test=y_test,
task_type='classification',
cv_folds=5,
feature_names=feature_names,
protected_attributes=protected_attrs, # Enable fairness metrics
random_state=42
)
# Generate fairness report comparing performance across groups
fairness_report = create_fairness_report(evaluation_results)
if fairness_report is not None:
print("\nFairness Analysis Across Demographic Groups:")
print(fairness_report)
# Visualize performance disparities
create_fairness_comparison_plot(evaluation_results)
Fairness Metrics Computed:
For each protected attribute group:
- Accuracy: Overall prediction accuracy
- Precision: Positive predictive value
- Recall: True positive rate (sensitivity)
- F1-Score: Harmonic mean of precision and recall
- Group Size: Number of samples in the group
- Disparate Impact: Ratio of positive prediction rates between groups
Interpreting Results:
- Look for significant differences in performance across groups
- Disparate impact ratio < 0.8 or > 1.2 may indicate bias
- Consider both statistical significance and practical importance
Regression Model Evaluation
For regression tasks, evaluate with residual diagnostics:
from extended_sklearn_metrics import (
evaluate_model_with_cross_validation,
calculate_residual_diagnostics,
create_residual_plots,
print_residual_diagnostics_report
)
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
# Create regression dataset
# n_samples=1000: 1000 training examples
# n_features=10: 10 input features
# noise=10: Standard deviation of Gaussian noise
X, y = make_regression(
n_samples=1000,
n_features=10,
noise=10,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train regression model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Basic regression evaluation
results = evaluate_model_with_cross_validation(
model,
X_train,
y_train,
cv=5
)
print("Regression Performance Metrics:")
print(f"Test R²: {results['test_r2']:.4f}")
print(f"Test RMSE: {results['test_rmse']:.4f}")
print(f"Test MAE: {results['test_mae']:.4f}")
print(f"\nCross-Validation R²: {results['cv_r2_mean']:.4f} ± {results['cv_r2_std']:.4f}")
# Comprehensive residual diagnostics
residual_diag = calculate_residual_diagnostics(
model,
X_test,
y_test,
cv=5
)
# Print diagnostic report
print_residual_diagnostics_report(residual_diag)
# Create residual diagnostic plots
# - Residuals vs Fitted
# - Q-Q Plot (test for normality)
# - Scale-Location Plot (test for homoscedasticity)
# - Residuals vs Leverage
create_residual_plots(residual_diag)
# Comprehensive regression evaluation
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X_train_df = pd.DataFrame(X_train, columns=feature_names)
X_test_df = pd.DataFrame(X_test, columns=feature_names)
regression_results = final_model_evaluation(
model=model,
X_train=X_train_df,
y_train=y_train,
X_test=X_test_df,
y_test=y_test,
task_type='regression',
cv_folds=5,
feature_names=feature_names,
random_state=42
)
print_evaluation_summary(regression_results)
Residual Diagnostics Include:
-
Normality Tests:
- Shapiro-Wilk test
- Kolmogorov-Smirnov test
- Anderson-Darling test
-
Heteroscedasticity Tests:
- Breusch-Pagan test
- Goldfeld-Quandt test
-
Autocorrelation:
- Durbin-Watson statistic
-
Outlier Detection:
- Cook's distance
- Leverage analysis
API Reference
Core Evaluation Functions
evaluate_classification_model_with_cross_validation(model, X, y, cv=5, average='weighted')
Evaluate a classification model using k-fold cross-validation.
Parameters:
model(estimator): Trained scikit-learn classifier withfitandpredictmethodsX(array-like, shape (n_samples, n_features)): Training feature datay(array-like, shape (n_samples,)): Training target labelscv(int, default=5): Number of cross-validation foldsaverage(str, default='weighted'): Averaging strategy for multi-class metrics:'micro': Calculate metrics globally by counting total TP, FP, FN'macro': Calculate metrics for each label, unweighted mean'weighted': Calculate metrics for each label, weighted by support'samples': Calculate metrics for each instance
Returns:
pd.DataFrame: Evaluation results with columns:Metric: Name of the metricValue: Numerical scoreThreshold: Performance category boundariesCalculation: Formula or explanationPerformance: Categorical assessment (Excellent/Good/Acceptable/Poor/Very Poor)
Example:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
results = evaluate_classification_model_with_cross_validation(
model, X_train, y_train, cv=10, average='macro'
)
print(results)
evaluate_model_with_cross_validation(model, X, y, cv=5, target_range=None, custom_thresholds=None)
General-purpose model evaluation supporting both classification and regression.
Parameters:
model(estimator): Trained scikit-learn modelX(array-like): Training featuresy(array-like): Training targetscv(int, default=5): Number of cross-validation foldstarget_range(float, optional): Range of target variable for regression (max - min)custom_thresholds(CustomThresholds, optional): Custom performance thresholds
Returns:
pd.DataFrame: Task-appropriate metrics (classification or regression)
Example:
from extended_sklearn_metrics import CustomThresholds
# Define custom performance thresholds
custom_thresh = CustomThresholds(
error_thresholds=(5, 10, 15), # RMSE/MAE: Excellent < 5%, Good < 10%, etc.
score_thresholds=(0.7, 0.85) # R²: Poor < 0.7, Good > 0.85
)
results = evaluate_model_with_cross_validation(
model, X, y, cv=5, custom_thresholds=custom_thresh
)
ROC/AUC Analysis Functions
calculate_roc_metrics(y_true, y_pred_proba, pos_label=None)
Calculate ROC curve metrics for binary classification.
Parameters:
y_true(array-like): True binary labelsy_pred_proba(array-like): Predicted probabilities for positive class (0-1)pos_label(int/str, optional): Label of positive class (default: inferred)
Returns:
dict: ROC analysis results containing:fpr: False positive rates at each thresholdtpr: True positive rates at each thresholdthresholds: Decision thresholdsroc_auc: Area under ROC curveoptimal_threshold: Threshold maximizing Youden's indexoptimal_tpr: TPR at optimal thresholdoptimal_fpr: FPR at optimal thresholdthreshold_metrics: DataFrame with detailed threshold analysis
Example:
# Get ROC metrics
roc_metrics = calculate_roc_metrics(y_test, y_pred_proba, pos_label=1)
print(f"AUC: {roc_metrics['roc_auc']:.4f}")
print(f"Optimal Threshold: {roc_metrics['optimal_threshold']:.4f}")
print(f"TPR at Optimal: {roc_metrics['optimal_tpr']:.4f}")
print(f"FPR at Optimal: {roc_metrics['optimal_fpr']:.4f}")
# Access detailed threshold analysis
threshold_df = roc_metrics['threshold_metrics']
print(threshold_df.head())
calculate_multiclass_roc_metrics(y_true, y_pred_proba, class_names=None)
Calculate ROC metrics for multi-class classification using one-vs-rest approach.
Parameters:
y_true(array-like): True class labelsy_pred_proba(array-like, shape (n_samples, n_classes)): Predicted probabilitiesclass_names(list, optional): Names of classes for display
Returns:
dict: Multi-class ROC results containing:class_results: Per-class ROC metricsmacro_average: Macro-averaged ROC curve and AUCmicro_average: Micro-averaged ROC curve and AUCclass_labels: Class labels used
Example:
multiclass_roc = calculate_multiclass_roc_metrics(
y_test,
y_pred_proba,
class_names=['Setosa', 'Versicolor', 'Virginica']
)
print(f"Macro AUC: {multiclass_roc['macro_average']['roc_auc']:.4f}")
print(f"Micro AUC: {multiclass_roc['micro_average']['roc_auc']:.4f}")
find_optimal_thresholds(y_true, y_pred_proba, criteria=['youden', 'f1', 'balanced_accuracy'])
Find optimal classification thresholds using multiple optimization methods.
Parameters:
y_true(array-like): True binary labelsy_pred_proba(array-like): Predicted probabilitiescriteria(list, optional): Optimization methods to use:'youden': Maximizes Youden's index (TPR - FPR)'f1': Maximizes F1 score'balanced_accuracy': Maximizes (TPR + TNR) / 2'closest_to_perfect': Minimizes distance to (0, 1) point
Returns:
pd.DataFrame: Optimal thresholds with performance metrics for each method
Comprehensive Evaluation
final_model_evaluation(model, X_train, y_train, X_test, y_test, task_type='auto', cv_folds=5, feature_names=None, protected_attributes=None, random_state=42, suppress_warnings=False)
Comprehensive model evaluation with hold-out testing, feature importance, and fairness analysis.
Parameters:
model(estimator): Trained scikit-learn modelX_train,y_train: Training dataX_test,y_test: Test datatask_type(str, default='auto'): Task type ('classification', 'regression', or 'auto')cv_folds(int, default=5): Number of cross-validation folds for stability analysisfeature_names(list, optional): Feature names for interpretabilityprotected_attributes(dict, optional): Protected attributes for fairness analysis- Keys: Attribute names (e.g., 'gender', 'age_group')
- Values: Array-like of attribute values for test set
random_state(int, default=42): Random seed for reproducibilitysuppress_warnings(bool, default=False): Suppress sklearn warnings
Returns:
dict: Comprehensive evaluation results containing:performance: Test set metricscv_stability: Cross-validation statisticsfeature_importance: Feature importance rankingserror_analysis: Error patterns and correlationsfairness_analysis: Fairness metrics by group (if protected_attributes provided)interpretation: Model complexity and interpretability scores
Example:
# With fairness analysis
protected_attrs = {
'gender': gender_array,
'age': age_group_array
}
results = final_model_evaluation(
model=model,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
task_type='classification',
cv_folds=10,
feature_names=feature_names,
protected_attributes=protected_attrs,
suppress_warnings=True,
random_state=42
)
# Access specific results
print(f"Test Accuracy: {results['performance']['accuracy']:.4f}")
print(f"CV Stability: {results['cv_stability']['accuracy_std']:.4f}")
Visualization Functions
create_roc_curve_plot(roc_results, title=None, show_optimal_threshold=True)
Create ROC curve visualization with optional optimal threshold marker.
Parameters:
roc_results(dict): Results fromcalculate_roc_metrics()title(str, optional): Plot titleshow_optimal_threshold(bool, default=True): Highlight optimal threshold point
create_multiclass_roc_plot(multiclass_roc_results, title=None)
Create multi-class ROC curve visualization showing all classes.
Parameters:
multiclass_roc_results(dict): Results fromcalculate_multiclass_roc_metrics()title(str, optional): Plot title
create_comprehensive_evaluation_plots(evaluation_results, figsize=None)
Create comprehensive evaluation dashboard with multiple panels.
Parameters:
evaluation_results(dict): Results fromfinal_model_evaluation()figsize(tuple, optional): Figure size (width, height)
Panels Include:
- Performance metrics comparison
- Feature importance rankings
- Cross-validation stability
- Error analysis
- Fairness comparison (if available)
Reporting Functions
print_evaluation_summary(evaluation_results)
Print executive summary of evaluation results to console.
Parameters:
evaluation_results(dict): Results fromfinal_model_evaluation()
create_evaluation_report(evaluation_results)
Create detailed evaluation report as pandas DataFrame.
Parameters:
evaluation_results(dict): Results fromfinal_model_evaluation()
Returns:
pd.DataFrame: Detailed metrics report with categories and interpretations
create_feature_importance_report(evaluation_results)
Create feature importance analysis report.
Parameters:
evaluation_results(dict): Results fromfinal_model_evaluation()
Returns:
pd.DataFrame: Feature importance rankings with:- Feature names
- Importance scores (built-in and permutation)
- Rank position
- Statistical significance
create_fairness_report(evaluation_results)
Create fairness analysis report by demographic groups.
Parameters:
evaluation_results(dict): Results fromfinal_model_evaluation()
Returns:
pd.DataFrame: Fairness metrics by group with:- Group identifier
- Performance metrics per group
- Group size
- Disparate impact ratios
Advanced Examples
Complete Production Workflow
This example demonstrates a complete ML evaluation workflow suitable for production environments:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from extended_sklearn_metrics import *
# Step 1: Create realistic dataset with meaningful features
np.random.seed(42)
X, y = make_classification(
n_samples=2000,
n_features=15,
n_informative=10,
n_redundant=3,
n_clusters_per_class=2,
weights=[0.7, 0.3], # Imbalanced classes (70-30 split)
flip_y=0.05, # 5% label noise
random_state=42
)
# Define meaningful feature names (e.g., for loan approval model)
feature_names = [
'annual_income',
'years_education',
'age',
'years_experience',
'debt_to_income_ratio',
'credit_score',
'loan_amount',
'property_value',
'savings_balance',
'num_dependents',
'employment_type_score',
'payment_history_score',
'account_balance',
'investment_portfolio_value',
'has_insurance'
]
# Convert to DataFrame for better handling
X_df = pd.DataFrame(X, columns=feature_names)
y_series = pd.Series(y, name='loan_approved')
# Step 2: Create protected attributes for fairness testing
# In production, these would come from your actual data
protected_attrs = {
'gender': np.random.choice(['Male', 'Female'], size=len(y), p=[0.55, 0.45]),
'age_group': np.random.choice(
['Young (18-30)', 'Middle (31-50)', 'Senior (51+)'],
size=len(y),
p=[0.3, 0.5, 0.2]
),
'ethnicity': np.random.choice(
['Group A', 'Group B', 'Group C'],
size=len(y),
p=[0.6, 0.25, 0.15]
)
}
# Step 3: Split data (stratified to maintain class balance)
X_train, X_test, y_train, y_test = train_test_split(
X_df,
y_series,
test_size=0.2,
random_state=42,
stratify=y_series
)
# Get protected attributes for test set only
test_indices = y_test.index
protected_attrs_test = {
attr: values[test_indices]
for attr, values in protected_attrs.items()
}
# Step 4: Train model with careful hyperparameter selection
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=20,
min_samples_leaf=10,
class_weight='balanced', # Handle class imbalance
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
# Step 5: Comprehensive evaluation with all features
print("=" * 80)
print("COMPREHENSIVE MODEL EVALUATION REPORT")
print("=" * 80)
evaluation_results = final_model_evaluation(
model=model,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
task_type='classification',
cv_folds=5,
feature_names=feature_names,
protected_attributes=protected_attrs_test,
suppress_warnings=True,
random_state=42
)
# Step 6: Generate and display reports
# Executive Summary
print("\n" + "=" * 80)
print("EXECUTIVE SUMMARY")
print("=" * 80)
print_evaluation_summary(evaluation_results)
# Detailed Metrics
print("\n" + "=" * 80)
print("DETAILED PERFORMANCE METRICS")
print("=" * 80)
eval_report = create_evaluation_report(evaluation_results)
print(eval_report.to_string())
# Feature Importance Analysis
print("\n" + "=" * 80)
print("TOP 10 MOST IMPORTANT FEATURES")
print("=" * 80)
fi_report = create_feature_importance_report(evaluation_results)
if fi_report is not None:
print(fi_report.head(10).to_string())
# Save feature importance to CSV
fi_report.to_csv('feature_importance.csv', index=False)
print("\nFeature importance saved to: feature_importance.csv")
# Fairness Analysis
print("\n" + "=" * 80)
print("FAIRNESS ANALYSIS ACROSS DEMOGRAPHIC GROUPS")
print("=" * 80)
fairness_report = create_fairness_report(evaluation_results)
if fairness_report is not None:
print(fairness_report.to_string())
# Check for significant disparities
for attr in protected_attrs_test.keys():
attr_data = fairness_report[fairness_report['Attribute'] == attr]
accuracy_range = attr_data['Accuracy'].max() - attr_data['Accuracy'].min()
if accuracy_range > 0.05: # More than 5% difference
print(f"\nWARNING: Significant accuracy disparity in {attr}: {accuracy_range:.1%}")
# Step 7: Create comprehensive visualizations
print("\n" + "=" * 80)
print("GENERATING VISUALIZATIONS")
print("=" * 80)
# Main evaluation dashboard
create_comprehensive_evaluation_plots(evaluation_results)
print("Created: Comprehensive evaluation dashboard")
# Feature importance plot
create_feature_importance_plot(evaluation_results, top_n=15)
print("Created: Feature importance plot")
# Fairness comparison
create_fairness_comparison_plot(evaluation_results)
print("Created: Fairness comparison plot")
# Step 8: ROC curve analysis with threshold optimization
print("\n" + "=" * 80)
print("ROC/AUC ANALYSIS WITH THRESHOLD OPTIMIZATION")
print("=" * 80)
y_pred_proba = model.predict_proba(X_test)[:, 1]
roc_results = calculate_roc_metrics(y_test, y_pred_proba)
# Print ROC summary
print_roc_auc_summary(roc_results)
# Find optimal thresholds
optimal_thresholds = find_optimal_thresholds(y_test, y_pred_proba)
print("\nOptimal Decision Thresholds:")
for method, threshold in optimal_thresholds.items():
print(f" {method}: {threshold:.4f}")
# Create ROC curve
create_roc_curve_plot(roc_results, title="Loan Approval Model - ROC Curve")
# Step 9: Generate recommendations
print("\n" + "=" * 80)
print("MODEL DEPLOYMENT RECOMMENDATIONS")
print("=" * 80)
# Check if model is production-ready
test_accuracy = evaluation_results['performance']['accuracy']
cv_stability = evaluation_results['cv_stability']['accuracy_std']
auc_score = roc_results['roc_auc']
print(f"\nModel Performance Assessment:")
print(f" Test Accuracy: {test_accuracy:.1%}")
print(f" CV Stability (std): {cv_stability:.4f}")
print(f" ROC AUC: {auc_score:.4f}")
if test_accuracy > 0.85 and cv_stability < 0.05 and auc_score > 0.85:
print("\nRECOMMENDATION: Model is ready for production deployment")
print(" - High accuracy and AUC scores")
print(" - Stable performance across cross-validation folds")
print(" - Consider A/B testing against current system")
elif test_accuracy > 0.75:
print("\nRECOMMENDATION: Model shows promise but needs improvement")
print(" - Consider feature engineering")
print(" - Try hyperparameter tuning")
print(" - Collect more training data if possible")
else:
print("\nRECOMMENDATION: Model not ready for production")
print(" - Review feature selection")
print(" - Try different algorithms")
print(" - Investigate data quality issues")
# Check fairness
if fairness_report is not None:
max_disparity = 0
for attr in protected_attrs_test.keys():
attr_data = fairness_report[fairness_report['Attribute'] == attr]
disparity = attr_data['Accuracy'].max() - attr_data['Accuracy'].min()
max_disparity = max(max_disparity, disparity)
if max_disparity > 0.1:
print("\nFAIRNESS WARNING: Significant performance disparities detected")
print(" - Review model for potential bias")
print(" - Consider fairness-aware training methods")
print(" - Consult with ethics/compliance team")
print("\n" + "=" * 80)
print("EVALUATION COMPLETE")
print("=" * 80)
Custom Threshold Evaluation
Example of using custom performance thresholds for domain-specific requirements:
from extended_sklearn_metrics import CustomThresholds, evaluate_model_with_cross_validation
# Define custom thresholds for medical diagnosis model
# Where high accuracy is critical and errors are costly
medical_thresholds = CustomThresholds(
error_thresholds=(2, 5, 10), # Very strict: 2% excellent, 5% good, 10% acceptable
score_thresholds=(0.9, 0.95) # High bar: < 0.9 poor, > 0.95 good
)
# Evaluate regression model for medical predictions
results = evaluate_model_with_cross_validation(
model=medical_model,
X=X_medical,
y=y_medical,
cv=10, # More folds for robust estimation
custom_thresholds=medical_thresholds
)
print("Medical Model Evaluation (Strict Thresholds):")
print(results)
Architecture
Version 0.4.0 - Optimized Modular Architecture
Extended Sklearn Metrics has been significantly refactored for improved maintainability, performance, and code organization:
Modular Structure
Core Modules:
model_evaluation.py- Base evaluation functions for regressionclassification_evaluation.py- Classification-specific evaluationcomprehensive_evaluation.py- End-to-end evaluation frameworkroc_auc_analysis.py- ROC/AUC analysis and threshold optimizationresidual_diagnostics.py- Regression residual analysisevaluation_reporting.py- Report generation utilities
Internal Utilities:
_validation.py- Shared input validation logic (eliminates code duplication)_plotting_backend.py- Lazy matplotlib import system (faster imports)
Visualization Package:
The visualizations have been reorganized into a modular sub-package:
visualizations/
├── __init__.py - Package exports for backward compatibility
├── _base.py - Common utilities
├── performance.py - Performance summary and comparison plots
├── roc_curves.py - ROC, PR, and threshold analysis plots
├── residuals.py - Residual diagnostic plots
├── comprehensive.py - Multi-panel evaluation dashboards
└── fairness.py - Feature importance and fairness plots
Key Improvements:
-
Code Deduplication: Eliminated 150+ lines of duplicate validation code through shared
_validation.pymodule -
Lazy Imports: Matplotlib is now imported lazily via
_plotting_backend.py, reducing import time by ~10ms and eliminating repetitive try/except blocks -
Modular Visualizations: Split 1,487-line monolithic file into 6 focused modules (100-437 lines each) for better maintainability
-
Backward Compatibility: 100% backward compatible - all existing code works without modification
-
Better Organization: Clear separation of concerns with focused, single-responsibility modules
Dependencies
Required Dependencies
numpy >= 1.24.0
pandas >= 2.0.0
scikit-learn >= 1.3.0
matplotlib >= 3.5.0
scipy >= 1.9.0 (for statistical tests)
Installation
Install all dependencies automatically:
pip install extended-sklearn-metrics
For development:
pip install extended-sklearn-metrics[dev]
Development dependencies include:
- pytest >= 7.0.0
- pytest-cov >= 4.0.0
- black >= 22.0.0
- flake8 >= 5.0.0
Contributing
Contributions are welcome! Here's how you can help:
Reporting Issues
- Use the GitHub issue tracker
- Include a minimal reproducible example
- Specify your environment (OS, Python version, package versions)
Contributing Code
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
pytest tests/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code Style
- Follow PEP 8 guidelines
- Use type hints for function signatures
- Write comprehensive docstrings (Google style)
- Add unit tests for new features
- Update documentation as needed
Version History
v0.4.0 (Current Development)
MAJOR REFACTORING - Improved Architecture and Performance
Architecture Improvements:
-
Created
_validation.pymodule for shared validation logic- Eliminated 150+ lines of duplicate code
- Single source of truth for input validation
- Consistent error messages across all functions
-
Implemented
_plotting_backend.pyfor lazy matplotlib imports- ~10ms faster package import time
- Removed 80+ lines of repetitive try/except blocks
- Cleaner function code with
@require_matplotlibdecorator
-
Reorganized visualizations into modular sub-package
- Split 1,487-line monolithic file into 6 focused modules
- Improved maintainability (max 437 lines per module)
- Better code organization by functionality
Code Quality:
- Reduced total codebase by 1,750 lines (-21.5%)
- Zero code duplication in validation logic
- Improved module cohesion and separation of concerns
- Enhanced code readability and maintainability
Performance:
- Faster package imports (cached matplotlib loading)
- More efficient validation (no redundant checks)
- Optimized module loading with lazy imports
Backward Compatibility:
- 100% backward compatible with v0.3.x
- All existing code works without modification
- No breaking changes to public API
- All 79 tests passing
v0.3.5
Bug Fixes:
- Fixed AttributeError in feature interactions analysis when data has insufficient samples
- Enhanced correlation validation and error handling for feature interaction detection
- Improved robustness for edge cases in interaction analysis
v0.3.4
New Features:
- Added
suppress_warningsparameter tofinal_model_evaluation()function - Users can now suppress sklearn warnings about feature names and other non-critical warnings
- Implemented clean context manager approach for warning suppression
Example:
results = final_model_evaluation(
model, X_train, y_train, X_test, y_test,
suppress_warnings=True # Suppress sklearn warnings
)
v0.3.3
Bug Fixes:
- Fixed AttributeError in error correlation analysis when X_test has insufficient samples
- Enhanced validation and error handling for correlation calculations
- Improved robustness for edge cases with small datasets
v0.3.2
Bug Fixes:
- Fixed AttributeError in model complexity analysis for tree-based models
- Enhanced error handling in comprehensive evaluation framework
v0.3.1
Improvements:
- Improved error handling and stability
- Enhanced compatibility with different sklearn model types
v0.3.0
MAJOR RELEASE - Comprehensive Evaluation Framework
New Features:
- Added comprehensive ROC/AUC analysis with threshold optimization
- Implemented multi-class ROC support (one-vs-rest approach)
- Added Precision-Recall curves and AUC-PR metrics
- Created comprehensive model evaluation framework (
final_model_evaluation) - Added feature importance analysis (built-in + permutation)
- Implemented fairness evaluation across demographic groups
- Added hold-out test evaluation with cross-validation stability
- Created professional reporting and visualization suite
- Added model interpretation and complexity assessment
- Enhanced error analysis and residual diagnostics
Improvements:
- Complete API overhaul for better usability
- Comprehensive documentation with examples
- Production-ready evaluation capabilities
v0.2.0
New Features:
- Added residual diagnostics for regression models
- Enhanced visualization capabilities
- Improved cross-validation metrics
Improvements:
- Better error handling
- More informative console output
v0.1.0
Initial Release
Features:
- Basic classification and regression evaluation
- Cross-validation support with custom thresholds
- Basic performance visualizations
- Core metrics: accuracy, precision, recall, F1, R², RMSE, MAE
License
MIT License
Copyright (c) 2024 Subashanan Nair
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Repository: https://github.com/SubaashNair/extended-sklearn-metrics
Documentation: https://github.com/SubaashNair/extended-sklearn-metrics/blob/main/README.md
Issues: https://github.com/SubaashNair/extended-sklearn-metrics/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extended_sklearn_metrics-0.4.0.tar.gz.
File metadata
- Download URL: extended_sklearn_metrics-0.4.0.tar.gz
- Upload date:
- Size: 94.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcd4bb28769f06b7bcffcac1438072294bf7d7b53f893a0ea235f6fdcdb4768c
|
|
| MD5 |
7cd81440e5f77572cd1f5c6c2caf5b00
|
|
| BLAKE2b-256 |
e1fda6bff2e4acf0f7cd35cdc9e5e38e88ad7b5dbcba4d87e0a50864bb695f26
|
File details
Details for the file extended_sklearn_metrics-0.4.0-py3-none-any.whl.
File metadata
- Download URL: extended_sklearn_metrics-0.4.0-py3-none-any.whl
- Upload date:
- Size: 87.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
944d921bfefa2eb593ac74c4e015213da04802fc5db9edc5880356310c91f8eb
|
|
| MD5 |
2ee5370b24504cf348f32e720f1d9c8e
|
|
| BLAKE2b-256 |
f98f9b624339dbe361af0d514326eb3a16ae654ad4e7dedc8f9932f4e25efb14
|