Exploratory data analysis tools
Project description
Installation
To install the package,
pip install edamame
the edamame package works correctly inside a .ipynb file.
import edamame as eda
Why Edamame?
Edamame is born under the inspiration of the pandas-profiling and pycaret packages. The scope of edamame is to build friendly and helpful functions for handling the EDA (exploratory data analysis) step in a dataset studied and after that train and analyze a models battery for regression or classification problems.
Exploratory data analysis functions
You can find an example of the EDA that uses the edamame package in the eda_example.ipynb notebook.
Dimensions
a prettier version of the .shape method
eda.dimensions(data)
the function displays the number of rows and columns of a pandas dataframe passed
Describe distribution
eda.describe_distribution(data)
passing a dataframe the function display the result of the .describe() method applied to a pandas dataframe, divided by quantitative/numerical and categorical/object columns.
Identify columns types
eda.identify_types(data)
passing a dataframe the function display the result of the .dtypes method and returns a list with the name of the quantitative/numerical columns and a list with the name of the columns identified as "object" by pandas.
Convert numerical columns to categorical
eda.num_to_categorical(data, col: list[str])
passing a dataframe and a list with columns name, the function transforms the types of the columns into "object".
Missing data
eda.missing(data)
the function display the following elements:
- a table with the percentage of NA record for every column
- a table with the percentage of 0 as a record for every column
- a table with the percentage of duplicate rows
- a list of lists that contains the name of the numerical columns with NA, the name of the categorical columns with NA and the name of the columns with 0 as a record
Handling Missing values
eda.handling_missing(data, col: list[str], missing_val = np.nan, method: list[str] = [])
Parameters:
- data: a pandas dataframe
- col: a list of the names of the dataframe columns to handle
- missing_val: the value that represents the NA in the columns passed. By default is equal to np.nan
- method: a list of the names of the methods (mean, median, most_frequent, drop) applied to the columns passed. By default, if nothing was indicated, the function applied the most_frequent method to all the columns passed. Indicating fewer methods than the names of the columns leads to an autocompletion with the most_frequent method
Drop columns
eda.drop_columns(data, col: list[str]):
Parameters:
- data: a pandas dataframe
- col: a list of strings containing the names of columns to drop
Plot categorical variables
eda.plot_categorical(data, col: list[str])
Parameters:
- data: a pandas dataframe
- col: a list of string containing the names of columns to plot
The function returns a sequence of tables and plots. For every variables the plot_categorical produce an info table that contains the information about:
- the number of not nan rows
- the number of unique values
- the name of the value with the major frequency
- the frequence of the top unique value
By the side of the info table, you can see the top cardinalities table that shows the first ten values by order of frequency. In addition, the function returns a barplot of the cardinalities frequencies. The plot_categorical function raises the message too many unique values instead of the plot if the variable has more than 1000 unique values and removes the x-axis ticks if the variable has more than 50 unique values.
In the plot_categorical function, it's not mandatory to use pandas "object" type variables, but it's strictly recommended
Plot numerical variables
eda.plot_numerical(data, col: list[str], bins: int = 50)
Parameters:
- data: a pandas dataframe
- col: a list of string containing the names of columns to plot
- bins: number of bins to use in the histogram plot
Like the plot_categorical, the function returns a sequence of tables and plots. For every variables the plot_quantitative function produce an info table that contains the information about:
- count of rows not nan
- mean
- std
- min
- 25%
- 50%
- 75%
- max
- number of unique values
- value of the skew
In addition, the function returns an histogram with an estimated density + a boxplot. In the plot_quantitative function, it's mandatory to pass numerical variables to plot the histogram and estimate the density of the distribution.
TODO
- Finishing the documentation
- Add the xgboost model, PCA regression and other methods for studying the goodness of fit of the other models
- Add the classification part to the package
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file edamame-0.23.tar.gz
.
File metadata
- Download URL: edamame-0.23.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74d4e015a0e0d14960bec07c169ba526e8ce86eb7f08d2f5e35982b6b5f17cbb |
|
MD5 | 12118cd42d45d0d5fe6210477698f92f |
|
BLAKE2b-256 | b06e20a66f6b3ad292be7685dc293eae607c0e00a2c5182636b238193532a879 |
File details
Details for the file edamame-0.23-py3-none-any.whl
.
File metadata
- Download URL: edamame-0.23-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | abf29f5171675bc4295dabfabb324accfc92b20cbcf7ef0f202de91e6057c3d9 |
|
MD5 | d72c669b6b55e5d898a8ef4c75dbd89b |
|
BLAKE2b-256 | 96bceed3532860969afdda3fbf1e49b927844b2941bf2389a9f600c7be08bb8a |