A powerful stacking classifier framework that integrates advanced machine learning techniques, overfitting prevention, and explainability features such as LIME, SHAP, and model interpretation dashboards.
Project description
volkanoban
The volkanoban library is a robust stacking framework that integrates models such as Random Forest, XGBoost, LightGBM, CatBoost, and other advanced techniques like Extra Trees, Bagging, and HistGradientBoosting. It also includes deep learning methods like MLPClassifier (a multi-layer neural network). These models are combined using stacking and voting to improve accuracy and performance across various datasets.
volkanoban provides a variety of powerful functions for model training, evaluation, and explainability.
Key Features of the volkanoban Library
-
Stacking Classifier Combines multiple models using a stacking approach, where a meta-learner is trained to make the final prediction based on the output of base learners.
-
Voting Classifier Uses soft voting to combine the predictions of multiple models, improving overall classification accuracy by leveraging the strengths of different algorithms.
-
Advanced Model Support
Tree-based models: Random Forest, XGBoost, LightGBM, CatBoost, Extra Trees, and HistGradientBoosting.
Gradient-based models: MLPClassifier (multi-layer perceptron). B agging models: BaggingClassifier for enhanced stability and accuracy.
Automatic Feature Scaling Automatically scales features when using gradient-based models (such as MLPClassifier) but skips scaling for tree-based models (such as XGBoost, LightGBM), which don't require it. This scaling is determined dynamically based on the models being used.
Overfitting Prevention
Limiting Tree Depth: In models like Random Forest, XGBoost, LightGBM, and CatBoost, the max_depth parameter is used to control overfitting by limiting the complexity of each tree.
Regularization (L1 and L2): Implemented in XGBoost, LightGBM, and MLPClassifier using reg_alpha (L1) and reg_lambda (L2) to penalize large weights and prevent overfitting.
Subsampling and Feature Sampling: Used in XGBoost, LightGBM, and CatBoost to improve generalization by sampling both data points and features.
Learning Rate: A lower learning rate helps models like XGBoost, LightGBM, and CatBoost generalize better by controlling how much the model adjusts with each learning step.
Data Imputation
Handles missing values using SimpleImputer and KNNImputer.
Explainability
Provides model explainability through:
LIME: Local Interpretable Model-agnostic Explanations.
Feature Importance Visualization: Visualizes feature importance from different models.
Cross-Validation
Evaluates model performance through k-fold cross-validation, ensuring robustness and generalizability.
Hyperparameter Tuning
Searches for optimal model parameters using GridSearchCV.
Advanced Metrics
Calculates metrics like Matthews Correlation Coefficient (MCC), ROC AUC, Accuracy, Precision, Recall, and F1 Score for comprehensive performance evaluation.
Installation
You can install the package using pip:
pip install volkanoban
Usage Example 1: Breast Cancer Dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from volkanoban import volkanobanClassifier
# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
# Initialize the volkanobanClassifier
classifier = volkanobanClassifier()
# Train the classifier
X_train, X_test, y_train, y_test = classifier.train(X, y)
# Predict on new data
y_pred = classifier.predict(X_test)
# Evaluate model performance
num_classes = len(np.unique(y_test))
classifier.evaluate_performance(y_test, y_pred, num_classes)
# Plot ROC curve
y_pred_proba = classifier.stacking_model.predict_proba(X_test)
classifier.plot_roc_curve(y_test, y_pred_proba)
# Perform LIME analysis
classifier.lime_analysis(X_train, X_test, index=0, feature_names=data.feature_names, class_names=data.target_names)
# Perform cross-validation
classifier.cross_validate(X, y, cv=5)
# Perform hyperparameter tuning
classifier.hyperparameter_tuning(X, y)
# Plot feature importance
classifier.plot_feature_importance(X.columns)
# Run the ExplainerDashboard
classifier.run_explainer_dashboard(X_train, X_test, y_test, X.columns)
Usage Example 2: Forest Cover Type Dataset
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_covtype
from volkanoban import volkanobanClassifier
# Load the forest cover type dataset
data = fetch_covtype()
X = pd.DataFrame(data.data) # Feature matrix
y = pd.Series(data.target) # Target variable
# Initialize the volkanobanClassifier
classifier = volkanobanClassifier()
# Train the classifier (Train the dataset)
X_train, X_test, y_train, y_test = classifier.train(X, y)
# Predict on new data (make predictions on the test set)
y_pred = classifier.predict(X_test)
# 1. Evaluate performance (Evaluate model performance)
num_classes = len(np.unique(y_test)) # Determine the number of classes in the test set
classifier.evaluate_performance(y_test, y_pred, num_classes)
# 2. Plot ROC Curve (Visualize the ROC curve with AUC)
y_pred_proba = classifier.stacking_model.predict_proba(X_test)
classifier.plot_roc_curve(y_test, y_pred_proba)
# 3. Calculate Extra Metrics (Matthews Correlation Coefficient and ROC AUC Score)
classifier.extra_metrics(y_test, y_pred)
# 4. Perform LIME analysis (Analyze a single instance for explainability)
feature_names = X.columns # Get feature names
class_names = [str(i) for i in np.unique(y)] # Define class names
classifier.lime_analysis(X_train, X_test, index=0, feature_names=feature_names, class_names=class_names)
# 5. Cross-Validation (Evaluate model performance using cross-validation)
classifier.cross_validate(X, y, cv=5)
# 6. Hyperparameter Tuning (Perform hyperparameter tuning using GridSearchCV)
classifier.hyperparameter_tuning(X, y)
# 7. Plot Feature Importance (Visualize the importance of features from different models)
classifier.plot_feature_importance(X.columns)
# 8. Run the ExplainerDashboard (Launch an interactive dashboard to explore model insights)
classifier.run_explainer_dashboard(X_train, X_test, y_test, X.columns)
Usage Example 3: Wine Dataset (Multi-class Classification)
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from volkanoban import volkanobanClassifier
# Load the wine dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names) # Feature matrix
y = pd.Series(data.target) # Target variable
# Initialize the volkanobanClassifier
classifier = volkanobanClassifier()
# Train the classifier (Train the dataset)
X_train, X_test, y_train, y_test = classifier.train(X, y)
# Predict on new data (make predictions on the test set)
y_pred = classifier.predict(X_test)
# 1. Evaluate performance (evaluate model performance)
num_classes = len(np.unique(y_test)) # Determine the number of classes in the test set
classifier.evaluate_performance(y_test, y_pred, num_classes)
# 2. Plot ROC Curve (Visualize the ROC curve with AUC)
y_pred_proba = classifier.stacking_model.predict_proba(X_test)
classifier.plot_roc_curve(y_test, y_pred_proba)
# 3. Calculate Extra Metrics (Matthews Correlation Coefficient and ROC AUC Score)
classifier.extra_metrics(y_test, y_pred)
# 4. Perform LIME analysis (Analyze a single instance for explainability)
feature_names = X.columns # Get feature names
class_names = [str(i) for i in np.unique(y)] # Define class names
classifier.lime_analysis(X_train, X_test, index=0, feature_names=feature_names, class_names=class_names)
# 5. Cross-Validation (Evaluate model performance using cross-validation)
classifier.cross_validate(X, y, cv=5)
# 6. Hyperparameter Tuning (Perform hyperparameter tuning using GridSearchCV)
classifier.hyperparameter_tuning(X, y)
# 7. Plot Feature Importance (Visualize the importance of features from different models)
classifier.plot_feature_importance(X.columns)
# 8. Run the ExplainerDashboard (Launch an interactive dashboard to explore model insights)
classifier.run_explainer_dashboard(X_train, X_test, y_test, feature_names)
Predict Function Description
The predict function in volkanoban allows for generating predictions on unseen data. It ensures that the input data provided for prediction matches the feature set used during model training, thereby maintaining consistency. This function is highly flexible, enabling predictions on single or batch inputs.
The predict function allows making predictions for new input data. It supports making predictions on unseen data and ensures the input data matches the expected features used by the model.
Example Usage:
from volkanoban import volkanobanClassifier
import pandas as pd
# Initialize the classifier
classifier = volkanobanClassifier()
# Example input data: replace with actual feature values
input_data = {"mean radius": 14.2, "mean texture": 15.6, "mean perimeter": 89.0, "mean area": 530.0}
# Convert input data to a DataFrame matching the model's expected feature format
df_input = pd.DataFrame([input_data])
# Make predictions
y_pred = classifier.predict(df_input)
# Output the predicted class
print("Predicted class:", y_pred)
Function Descriptions
evaluate_performance
This function evaluates the model's performance using metrics like accuracy, precision, recall, F1 score, and confusion matrix. It prints a well-formatted table for easy interpretation.
Arguments:
y_true: Ground truth labels.y_pred: Predicted labels by the model.num_classes: Number of unique classes in the dataset.
Example Usage:
classifier.evaluate_performance(y_test, y_pred, num_classes)
lime_analysis
This function generates a LIME explanation for a specific test instance, showing how individual features influence the model's prediction.
Arguments:
X_train: The scaled training dataset.X_test: The scaled testing dataset.index: Index of the test instance to analyze.feature_names: List of feature names from the dataset.class_names: List of class names corresponding to the target variable.
Example Usage:
classifier.lime_analysis(X_train, X_test, 0, feature_names, class_names)
plot_feature_importance
This function visualizes feature importance across base models in the stacking classifier.
Arguments:
feature_names: List of feature names from the dataset.
Example Usage:
classifier.plot_feature_importance(feature_names)
run_explainer_dashboard
This function launches an interactive dashboard using explainerdashboard, allowing exploration of the model's predictions, feature importance, and more.
Arguments:
X_train: The scaled training dataset.X_test: The scaled testing dataset.y_test: Ground truth labels for the testing dataset.feature_names: List of feature names from the dataset.dashboard_title: Optional title for the dashboard.
Additional Functions
Predict Function
The predict function allows you to make predictions for new input data. It supports predictions on unseen data and ensures the input data matches the expected features used by the model.
Example input features
input_data = {"mean radius": 12.0, "mean texture": 18.0, "mean perimeter": 80.0, "mean area": 450.0}
Convert to DataFrame
df_input = pd.DataFrame([input_data])
Predict the behavior
y_pred = classifier.predict(df_input) print("Predicted behavior:", y_pred)
Evaluate performance
This function evaluates the model's performance using metrics like accuracy, precision, recall, F1 score, and confusion matrix. It prints a well-formatted table for easy interpretation.
Perform
classifier.evaluate_performance(y_test, y_pred, num_classes=2)
LIME analysis
The lime_analysis function generates a LIME explanation for a specific test instance, showing how individual features influence the model's prediction.
Perform
classifier.lime_analysis(X_train, X_test, index=0, feature_names=data.feature_names, class_names=data.target_names)
plot_feature_importance
The plot_feature_importance function visualizes the feature importance across the base models in the stacking classifier.
Perform
classifier.plot_feature_importance(X.columns)
run_explainer_dashboard
The run_explainer_dashboard function launches an interactive dashboard using explainerdashboard, allowing exploration of the model's predictions, feature importance, and more.
Run
classifier.run_explainer_dashboard(X_train, X_test, y_test, X.columns)
Overfitting Prevention Strategies Applied:
-
Limiting Tree Depth: In models like RandomForest, XGBoost, and LightGBM, the
max_depthparameter is used to control overfitting by limiting the complexity of each tree. -
Regularization (L1 and L2): Implemented in XGBoost, LightGBM, and MLPClassifier using
reg_alpha(L1) andreg_lambda(L2) to penalize large weights and prevent overfitting. -
Early Stopping: Applied in XGBoost, LightGBM, and CatBoost to stop training when the model's performance on the validation set no longer improves.
-
Gradient-based Models Scaling: Automatically scales input features when using models like MLPClassifier, ensuring that gradient-based algorithms perform optimally. Tree-based models like XGBoost and LightGBM, which do not require scaling, are excluded from this process.
-
Cross-Validation: Ensures model generalization by evaluating performance across different splits of the data, reducing the likelihood of overfitting to a specific training set.
This implementation effectively reduces the risk of overfitting while maintaining the flexibility and performance of various machine learning models in the volkanoban framework.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file volkanoban-0.1.23.tar.gz.
File metadata
- Download URL: volkanoban-0.1.23.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5236f339fd6dd10b7017250b435b93d3ba327e109ec64c34543de5ba9265733
|
|
| MD5 |
229a7928f53eb5ede6d55386bd44a0cf
|
|
| BLAKE2b-256 |
35f86c331677150f5f6977894b55f0b1c5e72f9efc17f43c891b19db295e1de9
|
File details
Details for the file volkanoban-0.1.23-py3-none-any.whl.
File metadata
- Download URL: volkanoban-0.1.23-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d030a773cf8bce5b817834f82abc6f9a55f1a934a876eefd5f016d625e447dc
|
|
| MD5 |
0591677a4f98519ee322339f5a69a6ae
|
|
| BLAKE2b-256 |
c399d3ad7283da6cda2935df0ce37792b4afe616409234f4be6c14742ebc4508
|