Data diagnostic package with minimum setup to analyze miss predictions of ML models

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Rarity

Rarity is a diagnostic library for tabular data with minimal setup to enable deep dive into datasets identifying features that could have potentially influenced the model prediction performance. It is meant to be used at post model training phase to ease the understanding on miss predictions and carry out systematic analysis to identify the gap between actual values versus prediction values. The auto-generated gap analysis is presented as a dash web application with flexible parameters at several feature components.

The inputs needed to auto-generate gap analysis report with Rarity are solely depending on features, yTrue and yPred values. Rarity is therefore a model anogstic package and can be used to inspect miss predictions on tabular data generated by any model framework.

Supported Analysis Type

Rarity currently supports tasks related to

Regression
Binary Classifciation
Multiclass Classification

It can also be used to conduct bimodal analysis. As it is used to inspect miss predictions in details down to the granularity at each data index level, multiple modal analysis won't be ideal for repetition at individual data index for each model. Therefore, the package supports upto 2 model miss prediction gap analysis for side-by-side comparison benefiting more during the post model training and final phase of model fine-tuning stage.

Core Feature Components

There are five core feature components covered in the auto-generated gap analysis report by Rarity:

General Metrics : covers general metrics used to evaluate model performance.
Miss Predictions : presents miss predictions scatter plot by index number
Loss Clusters : covers clustering info on offset values (regression) and logloss values (classification)
xFeature Distribution : distribution plots ranked by kl-divergence score
Similarities : tabulated info listing top-n data points based on similarities in features with reference to data index specified by user

Counter-Factuals is also included under Similarities component tab for classification task to better compare data points with most similar features but show different prediction outcomes. For futher details on how the feature components are displayed in the web application, please checkout more examples under section Feature Introduction in the package documentation.

Installation

Rarity has been tested on Python 3.6+. It is advisable to create a Virtual Environment to use along with Rarity.

For details guide on how to create virtual environment, you can refer to this user guide. After creation of the virtual environment, activate it and proceed with one of the following steps.

From PyPI

pip install rarity

From Source

git clone https://github.com/aimakerspace/rarity.git
cd rarity
pip install -e .

Quick Start

Rarity is developed with minimum setup in mind in order for users to focus more on actual modelling works and decide appropriate fine-tuning strategy after quick overview on the state of miss-prediction at post model training phase. Users can compile the required inputs and prediction outcomes via one of the following mentioned methods and spin up the dash web application to see the final gap analysis report produced by Rarity.

Simplest mode with mininum code

After cloning Rarity repository, place xFeatures, yTrue and yPred files into configs/csv_data folder as illustrated in example below


    rarity
    ├── configs
    │   ├── csv_data
    │   │   ├── <example_user_xFeatures>.csv
    │   │   ├── <example_user_yTrue>.csv
    │   │   ├── <example_user_yPred_model_x>.csv
    │   │   └── <example_user_yPred_model_y>.csv
    │   │
    │   └── configs.py
    ...
    │ 
    ├── auto_gap_analysis.py

Then open configs.py file and update the first section to define the required meta data.

# *******************************************************************************************
# To be updated by user accordingly if to run this script using saved csv files
# Replace 'example_xxx.csv' with user's file name
xFeature_filename = 'example_user_xFeatures.csv'
yTrue_filename = 'example_user_yTrue.csv'
# yPred_file can be single file or max 2 files, wrap in a list
yPred_filename = ['example_user_yPred_model_x.csv', 'example_user_yPred_model_y.csv']

# model name must be listed according to the same sequence of the yPred_filename list above
# can be single model or max bi-modal, wrap in a list
MODEL_NAME_LIST = ['example_model_x', 'example_model_y']

# Supported analysis type: 'Regression', 'Binary Classification', 'Multiclass Classification'
ANALYSIS_TYPE = 'Regression'
ANALYSIS_TITLE = 'example_Customer Churn Prediction'

# Defaults to 8000, user can re-define to a new port number of choice
PORT = 8000
# *******************************************************************************************

After uploading files to configs/csv_data folder and updating configs/configs.py file, open terminal and make sure you are in the Rarity project root folder. A file named auto_gap_analysis.py is already in the root folder upon installation. Then run the following line of code in the terminal

python auto_gap_analysis.py

A window will be open in the web browser and you will see the gap analysis report is generated for you by Rarity. Below is an example generated for Bimodal analysis on Regression task.

Rarity

Using CSVDataLoader

After installation, open terminal and run the following codes :

from rarity import GapAnalyzer
from rarity.data_loader import CSVDataLoader

# define the file paths
xFeatures_file = 'example_xFeatures.csv'
yTrue_file = 'example_yTrue.csv'
yPred_file_list = ['example_yPreds_model_xx.csv', 'example_yPreds_rf.csv']
model_names_list = ['model_xx', 'model_yy']

# specify which port to use, if not provided, default port is set to 8000
preferred_port = 8866

# collate all files using dataloader to transform them into the input format
# that can be processed by various internal function calls
# example : '<analysis_type>' => 'Regression'
# example : '<analysis_title>' => 'Customer Churn Prediction'
data_loader = CSVDataLoader(xFeatures_file, yTrue_file, yPred_file_list,
                            model_names_list, '<analysis_type>')
analyzer = GapAnalyzer(data_loader, '<analysis_title>', preferred_port)
analyzer.run()

with additional adjustments as follows :

! replacement of example files to your own file names
! define `model_names_list`, `analysis_type` and `analysis_title` accordingly

Using DataframeLoader

To use DataframeLoader, it is assumed that you already have some inital dataframes tap-out in earlier runs in the terminal and would like to continue analysing the miss-predictions after model training. The DataframeLoader api call is meant for inline analysis if you prefer not to collate base info using csv files. You may collate all the xFeatures, yTrue and yPreds dataframes into the right input format using DataframeLoader as demonstrated below :

from rarity import GapAnalyzer
from rarity.data_loader import DataframeLoader

# define the file paths
xFeatures_df = <xfeatures_stored_in_pd.DataFrame>
yTrue_df = <yTrue_stored_in_pd.DataFrame/Series>
yPred_df_model_xx = <yPred_generated_by_model_xx_stored_in_pd.DataFrame>
yPred_df_model_yy = <yPred_generated_by_model_yy_stored_in_pd.DataFrame>
yPred_list = [yPred_df_model_xx, yPred_df_model_yy]
model_names_list = ['model_xx', 'model_yy']

# specify which port to use, if not provided, default port is set to 8000
preferred_port = 8866

# collate all files using dataloader to transform them into the input format
# that can be processed by various internal function calls
# example : '<analysis_type>' => 'Regression'
# example : '<analysis_title>' => 'Customer Churn Prediction'
data_loader = DataframeLoader(xFeatures_df, yTrue_df, yPred_list,
                              model_names_list, '<analysis_type>')
analyzer = GapAnalyzer(data_loader, '<analysis_title>', preferred_port)
analyzer.run()

Supports

Package documentation : Full version of the package documentation can be found HERE
Community : For interest on AI Singapore program, kindly head to our AISG community channels HERE

Acknowledgement

The developer team would like to express our gratitude to the program office for the support given to the development and release of this data dianogstic library ^#. Special thanks to AI Singapore making the development works of the package feasible.

Program Office

This project is supported by the National Research Foundation (NRF), Singapore under its AI Singapore Programme (AISG-RP-2019-050).

Developers

Yap Siew Lin (Core developer), Jeanne Choo (ex-AISG), Chong Wei Yih (ex-AISG)

^{# Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.2

Oct 3, 2021

1.0.2.dev1 pre-release

Oct 3, 2021

1.0.1

Oct 2, 2021

1.0.0

Oct 2, 2021

1.0.0.dev9 pre-release

Oct 2, 2021

0.0.1

Sep 15, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rarity-1.0.2.tar.gz (296.7 kB view hashes)

Uploaded Oct 3, 2021 Source

Built Distribution

rarity-1.0.2-py3-none-any.whl (358.4 kB view hashes)

Uploaded Oct 3, 2021 Python 3

Hashes for rarity-1.0.2.tar.gz

Hashes for rarity-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`8f9e2ef6e67949eed51f6d8f70f8d9b9dc25b6125d98621197a14a70bd8df3d5`
MD5	`e39756b918984513c990f0add8847a14`
BLAKE2b-256	`981ff929d946f86cd963bcaf90df7ea14b4d43653903eff7689f473c32d0bc70`

Hashes for rarity-1.0.2-py3-none-any.whl

Hashes for rarity-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2de820efe6bf631b82fc6947f0c581f59e61674cefbf2edcf70ae0b0e5034a03`
MD5	`3c712e6086cfb49506e30d3ede5c8375`
BLAKE2b-256	`6e42249f1ae166512da217df9b3d5ae5e244622fbf515ad08b9a01b7a0b8ea4d`