Skip to main content

For fitting classification with customized cost matrix

Project description

bbml.ensemble / BijanClassifier

Motivation

The bbml.ensemble library is a Python tool designed for classification problems where accuracy is not the primary concern. This novel classification approach focuses on fitting models based on weights or costs associated with different types of errors. In addition to addressing classification problems by assigning labels according to a specified weight/cost matrix, this library serves as an efficient tool for decision-making scenarios. One significant application is in industrial settings, such as manufacturing systems, where misclassifying a good item as a bad item, or vice versa, incurs asymmetric costs depending on the error type. Thus, the contribution of this model lies in fitting classification models based on a predefined weight/cost matrix. In this release, the BijanClassifier utilizes four base classification models: decision tree, random forest, gradient boosting, and logistic regression.

For binary classification problems, standard models aim to maximize the accuracy of the fitted model on the training set, treating Type I and Type II errors (FN and FP) with equal weight (symmetrical cost). However, BijanClassifier assigns asymmetrical cost values to each error type.

Instructions

  1. Install:

    pip install bbml
    
  2. Import:

    from bbml.ensemble import BijanClassifier
    
  3. Call:

    class bbml.ensemble.BijanClassifier(*, model_type='RF', cost_matrix=None, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0, monotonic_cst=None)


  1. Parameters:

    model_type : {“RF”, “DT”, “GB”, “LR”}, default=”RF”

    This parameter specifies the base classification model, including: RF (Random Forest), DT (Decision Tree), GB (Gradient Boosting), and LR (Logistic Regression).

    cost_matrix : array of cost matrix, or “None” default=None

    Weights/Costs associated with the error types are provided in the format of an array. If set to None, only the basic/standard classification model is fitted without considering the error cost matrix.

    thresholds : int or array-like, default=100

    The number of decision threshold to use when discretizing the output of the classifier method. Pass an array-like to manually specify the thresholds to use.

    cv : int, float, cross-validation generator, iterable or “prefit”, default=None

    Determines the cross-validation splitting strategy to train classifier.

    greater_is_better : bool, default=True

    Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

    Note :

    Based on selected model_type, the other parameters can be modified and defined. For tree-based methods following parameters can be tuned. All other parameters, such as n_estimators (for RF and GB), bootstrap, n_jobs, and etc. , can be set and tuned in the same manner as the basic models in the scikit-learn library.

    criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation.

    max_depth : int, default=None

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

    min_samples_leaf : int or float, default=1

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    max_features : int, float or {“sqrt”, “log2”}, default=None

    The number of features to consider when looking for the best split.

    random_state : int, RandomState instance or None, default=None

    Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.


  1. Attributes:

    **fit(X, y, params) :

    Fit the classifier. X{array-like, sparse matrix} of shape (n_samples, n_features) Training data. y array-like of shape (n_samples,) Target values.

    predict(X) :

    Predict the target of new samples. X{array-like, sparse matrix} of shape (n_samples, n_features)

    predict_proba(X) :

    Predict class probabilities for X using the fitted estimator. X{array-like, sparse matrix} of shape (n_samples, n_features)

    accuracy(X, y, sample_weight=None) :

    Return the mean accuracy on the given test data and labels.

    cost(y, y_pred) :

    Return the misclassification cost based on cost matrix. y_pred is prediction values.

    estimator_ :

    estimator instance. The fitted classifier used when predicting.

    best_params_ :

    Parameter setting that gave the best results on the hold out data for multi class problems.

    best_threshold_ :

    The new decision threshold for binary classification problems.

    best_score_ :

    The optimal score of the objective metric, evaluated at best_threshold_.

    cv_results_ :

    A dictionary containing the scores and thresholds computed during the cross-validation process. Only exist if store_cv_results=True. The keys are "thresholds" and "scores".

    n_features_in_ :

    Number of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

    feature_names_in_ :

    Names of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

    coef_ : ndarray of shape (1, n_features) or (n_classes, n_features)

    Coefficient of the features in the decision function for logistic regression. coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

    intercept_ : ndarray of shape (1,) or (n_classes,)

    Intercept (a.k.a. bias) added to the decision function for logistic regression. If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).


  1. Sample Code:

    import numpy as np
    from bbml.ensemble import BijanClassifier
    from sklearn.datasets import make_classification
    from sklearn.metrics import classification_report
    from sklearn.model_selection import train_test_split
    
    # Step 1: Generate a binary dataset
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, n_clusters_per_class=1, random_state=42)
    
    # Step 2: Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Step 3: Define the cost matrix and basic model
    CM = np.array([[0, 20], [80, 0]])
    BM = "DT"
    
    # Step 4: Fit a BijanClassifier on the training set
    clf = BijanClassifier(model_type='DT', cost_matrix=CM, max_depth=2, random_state=42)
    clf.fit(X_train, y_train)
    
    # Step 5: Evaluate the classifier on the testing set
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = clf.accuracy(y_test, y_pred)
    
    # Calculate the associated cost with error types
    cost = clf.cost(y_test, y_pred)
    
    # Print accuracy and classification report
    print(f"Accuracy: {accuracy}")
    print(f"Cost: {cost}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbml-0.0.1.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bbml-0.0.1-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file bbml-0.0.1.tar.gz.

File metadata

  • Download URL: bbml-0.0.1.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for bbml-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e610ed7a506a099575f45c2c1b98fa631c2533a119880fc51144beab33a4e177
MD5 e24486c4a0da54fa990b7bf31f7030b2
BLAKE2b-256 af2cf6b5fc47096ba089d8a66f25fcdec5c061c5c33becb32505b7413421f5fb

See more details on using hashes here.

File details

Details for the file bbml-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: bbml-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for bbml-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ad976a84603bb2e661cd082bbad1b726b5b83c8cdacba847ce3b49e682cdbc6
MD5 efb303246bc094be6897bd5ed26ebeda
BLAKE2b-256 ff350131ae32bbc0016bd42959717c231ab28fcb82eff43f917146587411ae61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page