The perform_eda function is used to conduct Exploratory Data Analysis (EDA) on a given dataset. It provides various insights and visualizations to help understand the data.
Project description
perform_eda
The perform_eda function is used to conduct Exploratory Data Analysis (EDA) on a given dataset. It provides various insights and visualizations to help understand the data.
Function Signature
Install my-project with npm
def perform_eda(data):
Usage
perform_eda(data)
Functionality
The perform_eda
function performs the following steps:
- Prints the dimensions of the dataset.
- Displays the data types of each column in the dataset.
- Provides summary statistics for the dataset.
- Checks for missing values and displays the count of null values for each column.
- Identifies duplicate rows in the dataset and prints the count of duplicate rows.
- Visualizes the distributions and relationships in the data:
- For categorical variables, generates bar plots showing the value counts for each category.
- For numeric variables, generates histograms, box plots, scatter plots (numeric vs. numeric), and kernel density plots.
- Displays a correlation matrix heatmap.
- Generates a pairwise scatter plot for numeric variables.
- For categorical variables, if there are more than one, generates cross-tabulation bar plots to visualize the relationships between different categorical variables.
- Displays a heatmap showing the locations of missing values in the dataset.
- If a target variable name is provided, calculates the correlation between each feature and the target variable and displays a bar plot of the feature correlations with the target variable.
- Detects outliers in numeric variables by calculating z-scores and identifying values that exceed a threshold of 3 standard deviations from the mean.
Note: Replace <target_variable_name>
with the actual name of your target variable to enable the feature correlation analysis.
Dependencies
The perform_eda
function requires the following libraries:
- pandas
- numpy
- matplotlib.pyplot
- seaborn
- scipy.stats.ttest_ind
Make sure to have these libraries installed in your Python environment before using the function.
Example
import pandas as pd
# Load your dataset
data = pd.read_csv('your_dataset.csv')
# Perform EDA
perform_eda(data)
🔗 Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.