Exploratory data analysis tools

These details have not been verified by PyPI

Project links

Project description

Installation
Why Edamame?
- Exploratory data analysis functions
- TODO

Installation

To install the package,

pip install edamame

the edamame package works correctly inside a .ipynb file.

import edamame as eda

Why Edamame?

Edamame is born under the inspiration of the pandas-profiling and pycaret packages. The scope of edamame is to build friendly and helpful functions for handling the exploratory data analysis (EDA) step in a dataset studied and then train and analyse a model's battery for regression or classification problems.

Exploratory data analysis functions

You can find an example of the EDA that uses the edamame package in the edamame-notebooks repository.

Dimensions

a prettier version of the .shape method

eda.dimensions(data)

Parameters:

data: A pandas dataframe

The function displays the number of rows and columns of a pandas dataframe passed.

Describe distribution

eda.describe_distribution(data)

Parameters:

data: A pandas dataframe.

Passing a dataframe the function display the result of the .describe() method applied to a pandas dataframe, divided by quantitative/numerical and categorical/object columns.

Identify columns types

eda.identify_types(data)

Parameters:

data: A pandas dataframe.

Passing a dataframe the function display the result of the .dtypes method and returns a list with the name of the quantitative/numerical columns and a list with the name of the columns identified as "object" by pandas.

Convert numerical columns to categorical

eda.num_to_categorical(data, col: list[str])

Parameters:

data: A pandas dataframe.
col: A list of strings containing the names of columns to convert.

Passing a dataframe and a list with columns name, the function returns a dataframe with the columns transformed into an "object".

Missing data

eda.missing(data)

Parameters:

data: A pandas dataframe.

The function display the following elements:

A table with the percentage of NA record for every column.
A table with the percentage of 0 as a record for every column.
A table with the percentage of duplicate rows.
A list of lists that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record.

Handling Missing values

eda.handling_missing(data, col: list[str], missing_val = np.nan, method: list[str] = [])

Parameters:

data: A pandas dataframe.
col: A list of the names of the dataframe columns to handle.
missing_val: The value that represents the NA in the columns passed. By default is equal to np.nan.
method: A list of the names of the methods (mean, median, most_frequent, drop) applied to the columns passed. By default, if nothing was indicated, the function applied the most_frequent method to all the columns passed. Indicating fewer methods than the names of the columns leads to an autocompletion with the most_frequent method.

The function returns a pandas dataframe with the columns selected modified to handle the nan values.

Drop columns

eda.drop_columns(data, col: list[str]):

Parameters:

data: A pandas dataframe.
col: A list of strings containing the names of columns to drop.

The function returns a pandas dataframe with the columns selected dropped.

Plot categorical variables

eda.plot_categorical(data, col: list[str])

Parameters:

data: A pandas dataframe
col: A list of string containing the names of columns to plot

The function returns a sequence of tables and plots. For every variables the plot_categorical produce an info table that contains the information about:

The number of not nan rows.
The number of unique values.
The name of the value with the major frequency.
The frequence of the top unique value.

By the side of the info table, you can see the top cardinalities table that shows the first ten values by order of frequency. In addition, the function returns a barplot of the cardinalities frequencies. The plot_categorical function raises the message too many unique values instead of the plot if the variable has more than 1000 unique values and removes the x-axis ticks if the variable has more than 50 unique values.

In the plot_categorical function, it's not mandatory to use pandas "object" type variables, but it's strictly recommended.

Plot numerical variables

eda.plot_numerical(data, col: list[str], bins: int = 50)

Parameters:

data: A pandas dataframe.
col: A list of string containing the names of columns to plot.
bins: Number of bins to use in the histogram plot.

Like the plot_categorical, the function returns a sequence of tables and plots. For every variables the plot_quantitative function produce an info table that contains the information about:

Count of rows not nan
Mean
Std
Min
25%
50%
75%
Max
Number of unique values
Value of the skew

In addition, the function returns an histogram with an estimated density + a boxplot. In the plot_quantitative function, it's mandatory to pass numerical variables to plot the histogram and estimate the density of the distribution.

View cardinalities of variables

eda.view_cardinality(data, col: list[str])

Parameters:

data: A pandas dataframe.
col: A list of strings containing the names of columns for which we want to show the number of unique values.

The function especially helps study the cardinalities of the categorical variables. In case the variables present high cardinalities values. We need to reduce these values or drop the variable.

In addition, seeing low cardinalities values in numerical variables can be a clue for the necessity to convert a numerical variable into a categorical with the num_to_categorical function.

Modify the cardinalities of a variable

eda.modify_cardinality(data, col: list[str], threshold: list[int])

Parameters:

data: A pandas dataframe.
col: A list of strings containing the names of columns for which we want to modify the cardinalities.
threshold: A list of integer values containing the threshold values for every variable.

All the cardinalities that have a total count lower than the threshold indicated in the function are grouped into a new unique value called: Other.

The function returns a pandas dataframe with the cardinalities of the columns selected modified.

Distribution study of a numerical variable

eda.num_variable_study(data, col:str, bins: int = 50, epsilon: float = 0.0001, theory: bool = False)

Parameters:

data: A pandas dataframe.
col: The name of the dataframe column to study.
bins: The number of bins used by the histograms. By default bins=50.
epsilon: A constant for handle non strictly positive variables. By default epsilon = 0.0001
theory: A boolean value for displaying insight into the transformations applied

The function displays the following transformations of the variable col passed:

$log(x)$
$\sqrt(x)$
$x^2$
Box-cox
$1/x$

If a variable with zeros or negative values is passed, the function shows results based on the original data transformed to be strictly positive.

In case of zeros, data is transformed as: $\begin{cases} x_i = \epsilon,& \text{if } x_i = 0\ x_i, & \text{otherwise} \end{cases}$.
In case of negative values, data are transformed as: $x_i = x_i + |min(x)|\cdot\epsilon$.

Pearson's correlation matrix

eda.correlation_pearson(data, threshold: float = 0.)

Parameters:

data: A pandas dataframe.
threshold: Only the correlation values higher than the threshold are shown in the matrix. A floating value set by default to 0.

Correlation matrix for categorical columns

eda.correlation_categorical(data)

Parameters:

data: A pandas dataframe.

The function performs the Chi-Square Test of Independence between categorical variables of the dataset.

Phik Correlation matrix

eda.correlation_phik(data, theory: bool = False)

Parameters:

data: A pandas dataframe.
theory: A boolean value for displaying insight into the theory of the $\phi_k$ correlation index. By default is set to False.

Link to the paper

Interaction

eda.interaction(data)

Parameters:

data: A pandas dataframe.

The function display an interactive plot for analysing relationships between numerical columns with a scatterplot.

Inspection

eda.inspection(data, threshold: int = 10, bins: int = 50, figsize: tuple[float, float] = (6., 4.))

Parameters:

data: A pandas dataframe.
threshold: A value for determining the maximum number of distinct cardinalities the target variable can have. By default is set to 10.
bins: The number of bins used by the histograms. By default bins=50.
figsize: A tuple to determine the plot size.

The function displays an interactive plot for analysing the distribution of a variable based on the distinct cardinalities of the target variable.

Split and scaling

eda.split_and_scaling(data, target: str)

Parameters:

data: A pandas dataframe.
target: The response variable column name.

The function returns two pandas dataframes:

The regressor matrix $X$ contains all the predictors for the model.
The series $y$ contains the values of the response variable.

In addition, the function applies a step of standard scaling on the numerical columns of the $X$ matrix.

TODO

Finishing the documentation.
Add the xgboost model, PCA regression and other methods for studying the goodness of fit of the other models.
Add the classification class to the package.
Ensamble regressor/classifier method.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.59

May 26, 2023

0.58

May 25, 2023

0.57

May 25, 2023

0.56

May 25, 2023

0.55

May 24, 2023

0.54

May 24, 2023

0.53

May 22, 2023

0.52

May 13, 2023

0.51

May 6, 2023

0.50

May 6, 2023

0.49

Apr 21, 2023

0.48

Apr 16, 2023

0.47

Apr 15, 2023

0.46

Apr 8, 2023

0.45

Mar 25, 2023

0.44

Mar 24, 2023

0.43

Mar 23, 2023

0.42

Mar 22, 2023

This version

0.41

Mar 11, 2023

0.40

Feb 4, 2023

0.39

Jan 30, 2023

0.38

Jan 28, 2023

0.37

Jan 21, 2023

0.36

Jan 14, 2023

0.35

Jan 9, 2023

0.34

Jan 9, 2023

0.33

Jan 9, 2023

0.32

Jan 7, 2023

0.31

Jan 5, 2023

0.30

Jan 5, 2023

0.29

Jan 4, 2023

0.28

Jan 3, 2023

0.27

Jan 3, 2023

0.26

Jan 3, 2023

0.25

Jan 2, 2023

0.24

Jan 2, 2023

0.23

Dec 30, 2022

0.22

Dec 24, 2022

0.21

Dec 24, 2022

0.20

Dec 23, 2022

0.19

Dec 21, 2022

0.18

Dec 20, 2022

0.17

Dec 19, 2022

0.16

Dec 19, 2022

0.15

Dec 19, 2022

0.14

Dec 19, 2022

0.13

Dec 19, 2022

0.12

Dec 19, 2022

0.11

Dec 18, 2022

0.1

Dec 16, 2022

0.0.9

Dec 16, 2022

0.0.8

Dec 13, 2022

0.0.7

Dec 13, 2022

0.0.6

Dec 6, 2022

0.0.5

Dec 3, 2022

0.0.4

Dec 2, 2022

0.0.3

Dec 1, 2022

0.0.2

Dec 1, 2022

0.0.1

Nov 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

edamame-0.41.tar.gz (22.4 kB view details)

Uploaded Mar 11, 2023 Source

Built Distribution

edamame-0.41-py3-none-any.whl (22.9 kB view details)

Uploaded Mar 11, 2023 Python 3

File details

Details for the file edamame-0.41.tar.gz.

File metadata

Download URL: edamame-0.41.tar.gz
Upload date: Mar 11, 2023
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for edamame-0.41.tar.gz
Algorithm	Hash digest
SHA256	`a2898d2f3c2f893971956c87ec63594f15be7ba70d08885da1f38e231910cf9f`
MD5	`8493202c9dfb9ce263b6e610c07b2af1`
BLAKE2b-256	`1b802d68114fcf4d4d6bf1a923fefb212c4d48afa65799095c07e19c443c834b`

See more details on using hashes here.

File details

Details for the file edamame-0.41-py3-none-any.whl.

File metadata

Download URL: edamame-0.41-py3-none-any.whl
Upload date: Mar 11, 2023
Size: 22.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for edamame-0.41-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e52aaf7991680a995d4acc73c7d68bfca861ce536b7070cea4ecffc99a98ae5`
MD5	`884a48592bba1c705648ac31e3a6b888`
BLAKE2b-256	`0d7de46426d503611bac93d786bcbcb3208cf75b54e36c8d18a9db3b76eb1f60`

See more details on using hashes here.

edamame 0.41

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Why Edamame?

Exploratory data analysis functions

Dimensions

Describe distribution

Identify columns types

Convert numerical columns to categorical

Missing data

Handling Missing values

Drop columns

Plot categorical variables

Plot numerical variables

View cardinalities of variables

Modify the cardinalities of a variable

Distribution study of a numerical variable

Pearson's correlation matrix

Correlation matrix for categorical columns

Phik Correlation matrix

Interaction

Inspection

Split and scaling

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes