For fitting classification with customized cost matrix
Project description
bbml.ensemble / BijanClassifier
Motivation
The bbml.ensemble library is a Python tool designed for classification problems where accuracy is not the primary concern. This novel classification approach focuses on fitting models based on weights or costs associated with different types of errors. In addition to addressing classification problems by assigning labels according to a specified weight/cost matrix, this library serves as an efficient tool for decision-making scenarios. One significant application is in industrial settings, such as manufacturing systems, where misclassifying a good item as a bad item, or vice versa, incurs asymmetric costs depending on the error type. Thus, the contribution of this model lies in fitting classification models based on a predefined weight/cost matrix. In this release, the BijanClassifier utilizes four base classification models: decision tree, random forest, gradient boosting, and logistic regression.
For binary classification problems, standard models aim to maximize the accuracy of the fitted model on the training set, treating Type I and Type II errors (FN and FP) with equal weight (symmetrical cost). However, BijanClassifier assigns asymmetrical cost values to each error type.
Instructions
-
Install:
pip install bbml -
Import:
from bbml.ensemble import BijanClassifier -
Call:
class bbml.ensemble.BijanClassifier(*, model_type='RF', cost_matrix=None, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0, monotonic_cst=None)
-
Parameters:
model_type : {“RF”, “DT”, “GB”, “LR”}, default=”RF”
This parameter specifies the base classification model, including: RF (Random Forest), DT (Decision Tree), GB (Gradient Boosting), and LR (Logistic Regression).
cost_matrix : array of cost matrix, or “None” default=None
Weights/Costs associated with the error types are provided in the format of an array. If set to None, only the basic/standard classification model is fitted without considering the error cost matrix.
thresholds : int or array-like, default=100
The number of decision threshold to use when discretizing the output of the classifier method. Pass an array-like to manually specify the thresholds to use.
cv : int, float, cross-validation generator, iterable or “prefit”, default=None
Determines the cross-validation splitting strategy to train classifier.
greater_is_better : bool, default=True
Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.
Note :
Based on selected
model_type, the other parameters can be modified and defined. For tree-based methods following parameters can be tuned. All other parameters, such as n_estimators (for RF and GB), bootstrap, n_jobs, and etc. , can be set and tuned in the same manner as the basic models in the scikit-learn library.criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation.
max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
max_features : int, float or {“sqrt”, “log2”}, default=None
The number of features to consider when looking for the best split.
random_state : int, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
-
Attributes:
**fit(X, y, params) :
Fit the classifier. X{array-like, sparse matrix} of shape (n_samples, n_features) Training data. y array-like of shape (n_samples,) Target values.
predict(X) :
Predict the target of new samples. X{array-like, sparse matrix} of shape (n_samples, n_features)
predict_proba(X) :
Predict class probabilities for X using the fitted estimator. X{array-like, sparse matrix} of shape (n_samples, n_features)
accuracy(X, y, sample_weight=None) :
Return the mean accuracy on the given test data and labels.
cost(y, y_pred) :
Return the misclassification cost based on cost matrix. y_pred is prediction values.
estimator_ :
estimator instance. The fitted classifier used when predicting.
best_params_ :
Parameter setting that gave the best results on the hold out data for multi class problems.
best_threshold_ :
The new decision threshold for binary classification problems.
best_score_ :
The optimal score of the objective metric, evaluated at best_threshold_.
cv_results_ :
A dictionary containing the scores and thresholds computed during the cross-validation process. Only exist if store_cv_results=True. The keys are "thresholds" and "scores".
n_features_in_ :
Number of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.
feature_names_in_ :
Names of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.
coef_ : ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function for logistic regression. coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).
intercept_ : ndarray of shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function for logistic regression. If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).
-
Sample Code:
import numpy as np from bbml.ensemble import BijanClassifier from sklearn.datasets import make_classification from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split # Step 1: Generate a binary dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, n_clusters_per_class=1, random_state=42) # Step 2: Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Step 3: Define the cost matrix and basic model CM = np.array([[0, 20], [80, 0]]) BM = "DT" # Step 4: Fit a BijanClassifier on the training set clf = BijanClassifier(model_type='DT', cost_matrix=CM, max_depth=2, random_state=42) clf.fit(X_train, y_train) # Step 5: Evaluate the classifier on the testing set y_pred = clf.predict(X_test) # Calculate accuracy accuracy = clf.accuracy(y_test, y_pred) # Calculate the associated cost with error types cost = clf.cost(y_test, y_pred) # Print accuracy and classification report print(f"Accuracy: {accuracy}") print(f"Cost: {cost}") print("Classification Report:") print(classification_report(y_test, y_pred))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bbml-0.0.1.tar.gz.
File metadata
- Download URL: bbml-0.0.1.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e610ed7a506a099575f45c2c1b98fa631c2533a119880fc51144beab33a4e177
|
|
| MD5 |
e24486c4a0da54fa990b7bf31f7030b2
|
|
| BLAKE2b-256 |
af2cf6b5fc47096ba089d8a66f25fcdec5c061c5c33becb32505b7413421f5fb
|
File details
Details for the file bbml-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bbml-0.0.1-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ad976a84603bb2e661cd082bbad1b726b5b83c8cdacba847ce3b49e682cdbc6
|
|
| MD5 |
efb303246bc094be6897bd5ed26ebeda
|
|
| BLAKE2b-256 |
ff350131ae32bbc0016bd42959717c231ab28fcb82eff43f917146587411ae61
|