Dashboard to explore the data and to create baseline Machine Learning model.

These details have not been verified by PyPI

Project links

Project description

data-dashboard

Short description

Creates a simple static HTML dashboard with provided X, y data to help users see what's going on in their data, help make decisions regarding features and finds the best "baseline" model to predict y.

Instructions

You can install the package via pip:

pip install data-dashboard

To make it work, you need to have the data loaded in a format of:

X: data on which predictions will happen
y: target feature
descriptions (Optional): dict-like collection, that will be described later, can be left as None

If you don't have any data handy, you can use predefined datasets in examples.py*`:

from data_dashboard.examples.examples import iris

# All examples that you can use are:
# iris - multiclass
# boston - regression
# diabetes - regression
# digits - multiclass
# wine - multiclass
# breast_cancer - classification

X, y, descriptions = iris()

With data loaded into memory, you are able to proceed with Dashboard:

# importing Dashboard
from data_dashboard.dashboard import Dashboard
import os

# define an output directory where the HTML files will be created
output_path = os.path.join(os.getcwd(), "output")

# create an instance of a Dashboard
dsh = Dashboard(output_path, X, y, descriptions)

# create HTML Dashboard with default arguments
dsh.create_dashboard()

HTML Dashboard will be created in the defined output directory. Dashboard can be created for classification, regression and multiclass problems.

Dashboard is not able right now to deal with multi-label problems.

Dashboard object can be customized depending on the data that you have:

Dashboard(

    output_directory,  # directory where HTML dashboard will be placed

    X,  # data without target, preferably pandas DataFrame
    y,  # target data, preferably pandas Series,
    feature_descriptions_dict=None  # optional dict with descriptions of features

    random_state=None,  # integer representing random state for repeatable results

    classification_pos_label=None,  # one of the labels in target that will
    # be forced as a positive label
    force_classification_pos_label_multiclass=None,  # forcing label in target
    # in a multiclass problem, making it a binary classification problem

    already_transformed_columns=None  # list of columns that are already transformed

)

create_dashboard() first searches for the 'best' model across predefined set of default models and creates HTML dashboard only after finding it. Function can take scoring argument (which should be a metric function from sklearn) which will be used to evaluate models. If scoring is None, then the default metric (for a particular problem) is used.

Depending on provided arguments, search can happen in 4 different ways:

if models are provided as a sequence of instantiated Models, then each Model is fitted on the train part of the data and score is calculated on test set of the data.
if models are provided as a dictionary of Model: param_grid pairs, then:
- if mode == quick then each Model is instantiated with default parameters (similar to LazyPredict package), score is evaluated and only few of the best scoring models are then GridSearched (with HalvingGridSearch)
- if mode == detailed then all Models are GridSearched with provided grid_params.
if models is None, then default models (for a particular problem) are used, again depending on provided mode (either instantiated with default params in quick and then only some of them are GridSearched or all of them being GridSearched in a detailed mode.)

At the end, the best model (depending on the scoring) is chosen.

dsh.create_dashboard(

    models=None,  # can be sequence of instantiated Models, dict of 
    # Model: param_grid pairs or None

    scoring=None,  # should be a sklearn metric function
    mode='quick',  # either 'quick' or 'detailed'
    logging=True,  # turning logging (search results) on/off

    disable_pairplots=False,  # turning pairplots on/off as this is 
    # a potential bottleneck of the application
    force_pairplot=False  # forcing pairplots when Dashboard decided
    # to turn them off (when there are too many features in X).
)

Known Issues/Drawbacks

Multi-label classification is not included
Pairplots are turned off when the number of features crosses a threshold of 15, to prevent any MemoryErrors and save time on visualizations that degrade in the usefulness when the # of features increases
Features HTML page might be laggy depending on the # of features
CSS might be wonky on some resolutions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 20, 2021

0.1.0

Apr 29, 2021

0.0.9

Apr 27, 2021

0.0.8

Apr 27, 2021

This version

0.0.7

Apr 20, 2021

0.0.6

Apr 20, 2021

0.0.5

Apr 20, 2021

0.0.4

Apr 20, 2021

0.0.3

Apr 20, 2021

0.0.2

Apr 20, 2021

0.0.1

Apr 20, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

data_dashboard-0.0.7-py3-none-any.whl (58.3 kB view details)

Uploaded Apr 20, 2021 Python 3

File details

Details for the file data_dashboard-0.0.7-py3-none-any.whl.

File metadata

Download URL: data_dashboard-0.0.7-py3-none-any.whl
Upload date: Apr 20, 2021
Size: 58.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for data_dashboard-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e85088be3c76791d4988a9446dd5eeea738de558dd0a82b7199fe43c7703db6`
MD5	`785c4fa449b700c8a9eb2009f867c80d`
BLAKE2b-256	`102f208cd3c133aaf8095518fe60c75a31cf5c88eaf5ae34c8535ae6be02147b`