Skip to main content

A Comonotone-Independence Bayesian Classifier

Project description

CIBer

This package is an implementation of:
Paper: Comonotone-Independence Bayes classifier (CIBer)
Author: Yongzhao CHEN, Ka Chun CHEUNG, Nok Sang FAN, Suresh SETHI, and Sheung Chi Phillip YAM

This is the user guide for Comonotone-Independence Bayesian Classifier (CIBer). CIBer is a supervised learning model which deals with multi-class classification tasks. The continuous feature variables are discretized and those categorical ones are encoded via the proposed Joint Encoding.

This document mainly explain the important and practical functions in CIBer.py and CIBer_Engineering.py. Lastly, the CIBer_Bankchurner.ipynb gives a simple but illuminating example on CIBer with the use of Bankchurner dataset by Thomas Konstantin. Please refer to original author Kaiser's Repository for details.

Remarks

The MDLP discretization method has been disabled, you need to install package manually since it requires additional tools.

Solution 1:

Step1: install c/c++ tools

window users

install visual studio community, and then install Microsoft C++ Build Tools for C/C++ related packages

macos users

type the following line in terminal to install the Command Line Tools package

xcode-select --install

Step2: type the following line in terminal to install

pip install mdlp-discretization

Solution 2:

pip install git+https://github.com/hlin117/mdlp-discretization

refer to author hlin117's repository

Data Requirements

CIBer deals with multi-class classification tasks with numerical or discrete (but should be ordered) input variables. Before passing the data into the model, please perform some proper data preprocessing beforehand, e.g. removals of outlier and missing observation, and encode all categorical feature variables with numerical values.

CIBer.py

CIBer

To use CIBer:

from CIBer import CIBer

init(self, cont_col=[], asso_method='modified', min_asso=0.95, alpha=1, disc_method="norm", joint_encode=True, **kwargs)

cont_col: a list, containing the indices of the continuous variables

asso_method: a string can be set to "pearson", "spearman", "kendall", "modified". Four measurements to correlation. The default is "modified"

min_asso: a number between $0,1$ which specifies the threshold of correlation when determining the comonotonic relationship. The default value is 0.95

alpha: a positive number used in Laplacian smoothing. The default value is 1

joint_encode: a boolean, whether to use joint encoding. The default value is True

disc_method: a string indicating the discretization method adopted for each continuous feature variable. The default string is "norm" for normal distribution quantile method

**kwargs: additional keyworded arguments passing to Discretization(), below are two acceptable keyworded arguments

n_bins: a positive integer for the total number of bins for each discretization.

disc_backup: a string indicating the discretization method adopted if the method disc_method="mdlp" fails.

fit(self, x_train, y_train)

x_train: a numpy $n \times p$ array for the $p$ training (real-valued) feature variables with $n$ training observations

y_train: a numpy $n \times 1$ array for the $n$ training (real-valued) labels

predict(self, x_test)

x_test: a numpy $n \times p$ array for the $p$ test (real-valued) feature variables with $n$ test observations

return: a numpy $n \times 1$ array for the $n$ predicted class labels

predict_proba(self, x_test)

x_test: a numpy $n \times p$ array for the $p$ test (real-valued) feature variables with $n$ test observations

return: a numpy $n \times K$ array for the predicted probabilities of the $K$ classes with $n$ test observations

Retrieve comonotonic cluster results

self.cluster_book a Python dictionary where

  1. keys: class label
  2. vals: lists of clusters, each of which contains the indices of feature variables within the same cluster, generated by the AGNES algorithm. If there is only one integer value in a given list, then the corresponding feature variable is seen to be independent to all other feature variables given the class label. Otherwise, they are modelled by conditional comonotonicity given the class label.

self.distance_matrix_ a numpy $p \times p$ array, where the $(i,j)$ entry is the corresponding association value computed according to the chosen asso_method of feature $i$ and feature $j$.

CIBer_Engineering.py

Discretization(cont_col, disc_method, disc_backup="pkid", n_bins=10)

cont_col: a list of indices to be discretized

disc_method: any string in DISC_BASE + SCIPY_DIST, (refer to CIBer.py)

list of distributions provided by scipy used in Equal-quantile distribution method, number of bins determined by n_bins

SCIPY_DIST = ["uniform", "norm", "t", "chi2", "expon", "laplace", "skewnorm", "gamma"]

list of common discretiztion methods for Na"ive Bayes classifier

SIZE_BASE = ["equal_size", "pkid", "ndd", "wpkid"]

list of all discretization methods except SCIPY_DIST

DISC_BASE = ["equal_length", "auto"] + SIZE_BASE

list of alternative discretization methods if mdlp fails except SCIPY_DIST

MDLP_BACKUP = ["equal_length", "auto"] + SIZE_BASE

return a class for discretization method

Joint_Encoding

init(self, df, col_index)

df: a $n \times p$ dataframe for $p$ feature variables of $n$ observations

col_index: a list, containing the indices of categorical feature variables

fit(self, x_train)

x_train: a $n \times p$ numpy array for the $p$ training feature variables with $n$ training observations

transform(self, x_test)

x_test: a numpy $n \times p$ array for the $p$ test (real-valued) feature variables with $n$ test observations

return: a numpy $n \times p$ array with the encoded categorical feature variable

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CIBer-0.0.4.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

CIBer-0.0.4-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file CIBer-0.0.4.tar.gz.

File metadata

  • Download URL: CIBer-0.0.4.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for CIBer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 31c6348cf05a82f0b0dcb0c511fbaa3686d8951296d4c9b7aa3fefe8032d6de7
MD5 1cac6a1d0a57d5e89bb5c44a5a52d420
BLAKE2b-256 9fef90319f87b9d5774f6acd5b84100dba41969915ac0c1da86a7b1a67b46848

See more details on using hashes here.

File details

Details for the file CIBer-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: CIBer-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for CIBer-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5ece5feafc1032c48118f90b1c58d1825d875917fdc21c2277f56caa53670725
MD5 b4095cd609c5556c3fc9ab59e0d98e73
BLAKE2b-256 01c66ea8fa83198af34c6abed7dfe46ab076ab3643802716fdbed2dafb56c943

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page