For fitting classification with customized cost matrix

These details have not been verified by PyPI

Project description

bbml.ensemble / BijanClassifier

Motivation

The bbml.ensemble library is a Python tool designed for classification problems where accuracy is not the primary concern. This novel classification approach focuses on fitting models based on weights or costs associated with different types of errors. In addition to addressing classification problems by assigning labels according to a specified weight/cost matrix, this library serves as an efficient tool for decision-making scenarios. One significant application is in industrial settings, such as manufacturing systems, where misclassifying a good item as a bad item, or vice versa, incurs asymmetric costs depending on the error type. Thus, the contribution of this model lies in fitting classification models based on a predefined weight/cost matrix. In this release, the BijanClassifier utilizes four base classification models: decision tree, random forest, gradient boosting, and logistic regression.

For binary classification problems, standard models aim to maximize the accuracy of the fitted model on the training set, treating Type I and Type II errors (FN and FP) with equal weight (symmetrical cost). However, BijanClassifier assigns asymmetrical cost values to each error type.

Instructions

Install:
```
pip install bbml
```

Import:

from bbml.ensemble import BijanClassifier

Call:

class bbml.ensemble.BijanClassifier(*, model_type='RF', cost_matrix=None, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0, monotonic_cst=None)

Parameters:

model_type : {“RF”, “DT”, “GB”, “LR”}, default=”RF”

This parameter specifies the base classification model, including: RF (Random Forest), DT (Decision Tree), GB (Gradient Boosting), and LR (Logistic Regression).

cost_matrix : array of cost matrix, or “None” default=None

Weights/Costs associated with the error types are provided in the format of an array. If set to None, only the basic/standard classification model is fitted without considering the error cost matrix.

thresholds : int or array-like, default=100

The number of decision threshold to use when discretizing the output of the classifier method. Pass an array-like to manually specify the thresholds to use.

cv : int, float, cross-validation generator, iterable or “prefit”, default=None

Determines the cross-validation splitting strategy to train classifier.

greater_is_better : bool, default=True

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

Note :

Based on selected model_type, the other parameters can be modified and defined. For tree-based methods following parameters can be tuned. All other parameters, such as n_estimators (for RF and GB), bootstrap, n_jobs, and etc. , can be set and tuned in the same manner as the basic models in the scikit-learn library.

criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation.

max_depth : int, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf : int or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

max_features : int, float or {“sqrt”, “log2”}, default=None

The number of features to consider when looking for the best split.

random_state : int, RandomState instance or None, default=None

Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.

Attributes:

**fit(X, y, params) :

Fit the classifier. X{array-like, sparse matrix} of shape (n_samples, n_features) Training data. y array-like of shape (n_samples,) Target values.

predict(X) :

Predict the target of new samples. X{array-like, sparse matrix} of shape (n_samples, n_features)

predict_proba(X) :

Predict class probabilities for X using the fitted estimator. X{array-like, sparse matrix} of shape (n_samples, n_features)

accuracy(X, y, sample_weight=None) :

Return the mean accuracy on the given test data and labels.

cost(y, y_pred) :

Return the misclassification cost based on cost matrix. y_pred is prediction values.

estimator_ :

estimator instance. The fitted classifier used when predicting.

best_params_ :

Parameter setting that gave the best results on the hold out data for multi class problems.

best_threshold_ :

The new decision threshold for binary classification problems.

best_score_ :

The optimal score of the objective metric, evaluated at best_threshold_.

cv_results_ :

A dictionary containing the scores and thresholds computed during the cross-validation process. Only exist if store_cv_results=True. The keys are "thresholds" and "scores".

n_features_in_ :

Number of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

feature_names_in_ :

Names of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

coef_ : ndarray of shape (1, n_features) or (n_classes, n_features)

Coefficient of the features in the decision function for logistic regression. coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

intercept_ : ndarray of shape (1,) or (n_classes,)

Intercept (a.k.a. bias) added to the decision function for logistic regression. If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).

Sample Code:

import numpy as np
from bbml.ensemble import BijanClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Step 1: Generate a binary dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, n_clusters_per_class=1, random_state=42)

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define the cost matrix and basic model
CM = np.array([[0, 20], [80, 0]])
BM = "DT"

# Step 4: Fit a BijanClassifier on the training set
clf = BijanClassifier(model_type='DT', cost_matrix=CM, max_depth=2, random_state=42)
clf.fit(X_train, y_train)

# Step 5: Evaluate the classifier on the testing set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = clf.accuracy(y_test, y_pred)

# Calculate the associated cost with error types
cost = clf.cost(y_test, y_pred)

# Print accuracy and classification report
print(f"Accuracy: {accuracy}")
print(f"Cost: {cost}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.1

Aug 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbml-0.0.1.tar.gz (7.7 kB view details)

Uploaded Aug 2, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bbml-0.0.1-py3-none-any.whl (6.9 kB view details)

Uploaded Aug 2, 2024 Python 3

File details

Details for the file bbml-0.0.1.tar.gz.

File metadata

Download URL: bbml-0.0.1.tar.gz
Upload date: Aug 2, 2024
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for bbml-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`e610ed7a506a099575f45c2c1b98fa631c2533a119880fc51144beab33a4e177`
MD5	`e24486c4a0da54fa990b7bf31f7030b2`
BLAKE2b-256	`af2cf6b5fc47096ba089d8a66f25fcdec5c061c5c33becb32505b7413421f5fb`

See more details on using hashes here.

File details

Details for the file bbml-0.0.1-py3-none-any.whl.

File metadata

Download URL: bbml-0.0.1-py3-none-any.whl
Upload date: Aug 2, 2024
Size: 6.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for bbml-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ad976a84603bb2e661cd082bbad1b726b5b83c8cdacba847ce3b49e682cdbc6`
MD5	`efb303246bc094be6897bd5ed26ebeda`
BLAKE2b-256	`ff350131ae32bbc0016bd42959717c231ab28fcb82eff43f917146587411ae61`

See more details on using hashes here.

bbml 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

bbml.ensemble / BijanClassifier

Motivation

Instructions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes