majora

Majora is a python library that automates common tasks in your exploratory data analysis.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

Throughout my data science journey, I have learned that it is a good practice to understand the data first and try to gather as many insights from it. Exploratory Data Analysis (EDA) is all about making sense of data in hand, before getting dirty with machine learning and sophisticated algorithm.

While there are plenty of Python libraries that can help create beautiful and complex visualizations, I often find myself starting with the most simplistic analyses: count plot, histogram, scatter plot, boxplot, etc. This initial EDA workflow is very similar for each new data set. But unfortunately, they are tedious. Converting to correct data types, selecting the right variable type for the right plot, iterate through all possible variable combinations, adjusting plot aesthetics and labels, etc. These are the tasks I would love to do... once. As someone that does not find great joy in completing repetitive tasks, I set out to build a tool that allows me to be as lazy as possible.

Description

Majora is a python library that automates common tasks in your exploratory data analysis. This includes missing values visualization, missing values handling, variable types handling, predictive modeling, and a variety of univariate and bivariate graphs. The goal is to provide a fast and effective tool for discovering insights, so you can quickly move on to the machine learning model.

Features

Smart data type conversion
Automatic graph discovery
Simple missing values identificaiton and handling
CART model with cross-validation and tree visualization

Table of contents
Installation
Dataset Overview
Missing Values
- Identify Missing Values
- Handle Missing Values
Variable Types Handling
- Identify data types
- Type conversions
Visualization
- Univariate plots
- Bivariate plots
Decision Tree Visualizer

Installation

:warning: Majora is only compatible with Python 3.

:warning: Decision Tree visualizer requires graphviz.

Install Via GitHub

> pip install git+https://github.com/GrandPurpleOcelot/Auto-EDA

Usage

from majora import *;

Initiate a class instance with input dataframe:

heart = pd.read_csv('datasets/heart.csv')
heart['target'] = np.where(heart['target'] == 1, 'has disease', 'no disease')

report = majora(heart, target_variable = 'target')

The available parameters are:

df: the input pandas dataframe.
target_variable: the target variable that Auto_EDA will focus on.

Dataset Overview

report.get_samples()

get_samples() returns a df concatenated from head + random samples + tail of the dataset.

>>> report.get_overview()

Number of Variables: 303
Number of Observations: 14
Memory Usage: 0.052106 Mb

get_samples() returns number of variables, observations, and memory usage.

Missing Values

Identify Missing Values

report.get_missings(missing_tag= '?')

The available parameters are:

missing_tag: Sometimes missing values are denoted with a number or string (eg. '?'), enter the missing tag to replace them with NAs

Handling missing values

>>> report.handle_missings(strategy = 'deletion', drop_threshold = 0.5)

Dropped columns: ['NMHC(GT)']
Number of dropped rows: 2416 --> 25.8% of rows removed

The available parameters are:

strategy: select a strategy to handle missing data. Options: 'deletion', 'encode', 'mean_mode'

'deletion': drop variables with > 70% missing (or a different threshold using argument 'drop_threshold') and remove observations that contain at least 1 missing value.

'encode'(Encoding imputation): for numerical variable, encoding missing entries as -999. For categorical variable, encoding missing entries as string "unknown"

'mean_mode'(Mean/mode imputation): for numerial variable, impute the missing entries with the mean. For categorical variable, impute the missing entries with the mode

drop_threshold: if 'deletion' strategy is selected, any column that have fraction of missing values exceed drop_threshold will be dropped. drop_threshold = 1 will keep all columns. Default drop_threshold = 0.7.

Variable Types

Identify Data Types

report.check_data_type()

Type conversion suggestions:

String datetime -> datetime
Small interger (for example: boolean) -> categorical type
String float -> float
Maximum cardinality (number of unique == number of observations) -> remove

Handle Suggested Type Conversions:

>>> report.change_data_type()

Column Datetime converts to datetime

Visualization

Univariate Plots

Histogram

Exploratory type: numerical data

report.histogram()

The available parameters are:

kde: boolean (default = False).

Count Plots

Exploratory type: categorical data

report.count_plots()

Word Cloud

Exploratory type: text data

Development in progress...

Bivariate Plots

Correlation Plots

Exploratory type: for numerical and numerical data

report.correlation()

Principal Component Analysis

Exploratory type: dimensionality reduction

report.pca()

Box Plots

Exploratory type: numerical and categorical data

report.boxplots()

Relative Frequency Plots

Exploratory type: categorical and categorical data

report.cat_plots()

Correspondence Analysis

Exploratory type: categorical and categorical data

```python
report.correspondence_analysis()
```

Trend Plot

Exploratory type: timeseries data

report.timeseries_plots(grouper = 'M')

The available parameters are:

grouper: aggregate the timeseries with a time interval (default = 'W' for 1 week) using mean. This argument is used to reduce the datetime points we have to plot.

Statistical Modeling

User can specify a target variable for classification/regression task using Classification And Regression Tree (CART).

Classification Report (train on 75% of data, test on 25%)

report.tree_model(max_depth = 4)

Classification Report on 25% of Testing Data:
              precision    recall  f1-score   support

 has disease       0.85      0.85      0.85        41
  no disease       0.83      0.83      0.83        35

    accuracy                           0.84        76
   macro avg       0.84      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76

Bar chart of relative feature importance

Decision tree visualization with Dtreeviz

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.3

Jul 18, 2020

0.0.2

Jul 18, 2020

0.0.1

Jul 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

majora-0.0.3.tar.gz (15.6 kB view hashes)

Uploaded Jul 18, 2020 Source

Built Distribution

majora-0.0.3-py3-none-any.whl (12.9 kB view hashes)

Uploaded Jul 18, 2020 Python 3

Hashes for majora-0.0.3.tar.gz

Hashes for majora-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b6368f712eba9305c0e716f9ef3d0155bf1e0e77ab640ff6445538996dc5139e`
MD5	`27ad239f606d76a92ea2cdadf6850403`
BLAKE2b-256	`adda153baccd0ee0ec3e66adf2175e7646a944638dd65122099e10e3570aea7c`

Hashes for majora-0.0.3-py3-none-any.whl

Hashes for majora-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c56f55b824b95e453be4b76b9b9c21ac3848fed12e73248f30459abd5e4035c`
MD5	`77253fbb0c784c0ec64b65dc6ab10b22`
BLAKE2b-256	`ee13193642b9267a03ff7c0f20a5b0069e22806aed6b97f857b917ad8d91e6eb`

majora 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Introduction

Description

Features

Table of Contents

Installation

Usage

Dataset Overview

Missing Values

Identify Missing Values

Handling missing values

Variable Types

Identify Data Types

Handle Suggested Type Conversions:

Visualization

Univariate Plots

Histogram

Count Plots

Word Cloud

Bivariate Plots

Correlation Plots

Principal Component Analysis

Box Plots

Relative Frequency Plots

Correspondence Analysis

Trend Plot

Statistical Modeling

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution