A Python package for univariate and bivariate data analysis using PySpark
Project description
pyspark_eda
pyspark_eda
is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate and bivariate analysis, handling missing values, outliers, and visualizing data distributions.
Features
- Univariate analysis: Analyze numerical and categorical columns individually. Displays histogram and frequency distribution table if required.
- Bivariate analysis: Includes correlation, Cramer's V, and ANOVA. Displays scatter plot if required.
- Automatic handling: Deals with missing values and outliers seamlessly.
- Visualization: Provides graphical representation of data distributions and relationships.
Installation
You can install pyspark_eda
via pip:
pip install pyspark_eda
Function
Univariate Analysis
Parameters
- df (DataFrame): The input PySpark DataFrame.
- table_name (str): The base table name to save the results
- numerical_columns (list): The numerical columns of the table on which you want the analysis to be performed.
- categorical_columns (list): The categorical columns of the table on which you want the analysis to be performed.
- id_list (list, optional): List of columns to exclude from analysis.
- print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.
Description
Performs univariate analysis on the DataFrame and prints summary statistics and visualizations. It returns a table with the following columns : column , total_count, min, max, mean , mode, null_percentage, skewness , kurtosis, stddev ( which is the standard deviation), q1,q2 q3 (quartiles), mean_plus_3std, mean_minus_3std, outlier_percentage and frequency_distribution. You can display the table to view the results.
Example Usage
get_univariate_analysis
from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']
# Perform univariate analysis
get_univariate_analysis(df, table_name="your_table_name", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_list=['id_column'], print_graphs=1)
Function
Bivariate Analysis
Parameters
- df (DataFrame): The input PySpark DataFrame.
- table_name (str): The base table name to save the results
- numerical_columns (list): The numerical columns of the table on which you want the analysis to be performed.
- categorical_columns (list): The categorical columns of the table on which you want the analysis to be performed.
- id_columns (list, optional): List of columns to exclude from analysis.
- correlation_analysis (int, optional): Whether to perform correlation analysis (1 for yes, 0 for no),default value is 1.
- cramer_analysis (int, optional): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 1.
- anova_analysis (int, optional): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 1.
- print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.
Description
Performs bivariate analysis on the DataFrame, including Pearsons Correlation, Cramer's V, and ANOVA. It returns a table with the following columns: Column_1, Column_2, Correlation_Coefficient, Cramers_V, Anova_F_Value,Anova_P_Value. You can display the table to view the results.
Example Usage
get_bivariate_analysis
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical and categorical columns
numerical_columns = ['col1', 'col2', 'col3']
categorical_columns = ['col4', 'col5', 'col6']
# Perform bivariate analysis
get_bivariate_analysis(df, table_name="bivariate_analysis_results", numerical_columns=numerical_columns, categorical_columns=categorical_columns, id_columns=['id_column'], correlation_analysis=1, cramer_analysis=1, anova_analysis=1, print_graphs=1)
Function
Multivariate Analysis
Parameters
- df (DataFrame): The input PySpark DataFrame.
- table_name (str): The base table name to save the results
- numerical_columns (list): The numerical columns of the table on which you want the analysis to be performed.
- id_columns (list, optional): List of columns to exclude from analysis.
- vif_analysis (int, optional): Whether to get VIF values (1 for yes, 0 for no),default value is 1.
- decision_tree_analysis (int, optional): Whether to perform decision tree (1 for yes, 0 for no), default value is 1.
- target_column (str, optional): What the target column for the decision tree is,default value is None.It is compulsary if you are perfomring decision tree.
- depth (int, optional): The depth of the decision tree ,default value is 3.
Description
Performs multivariate analysis on the DataFrame, including Variance Inflation Factor (VIF), and Decision Tree. It returns a table with the following columns: Feature, VIF. You can display the table to view the results. You get a png file with the decision tree.
Example Usage
get_multivariate_analysis
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Identify numerical columns
numerical_columns = ['col1', 'col2', 'col3']
# Perform bivariate analysis
get_bivariate_analysis(df, table_name="multivariate_analysis_results", numerical_columns=numerical_columns, id_columns=['id_column'], vif_analysis=1, decision_tree_analysis=1, target_column="target_column_name", depth=3)
Contact
- Author: Tanya Irani
- Email: tanyairani22@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyspark_eda-1.3.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab6b86af9849409b4e9f95bc8b407d71e43bb2b65136fa52f811611cdec754fc |
|
MD5 | 914ec24ec78beeb740d552c9be83904f |
|
BLAKE2b-256 | 8aabef566d72722704ee3779d157004bd129502bf98a573b55397b3e1a616274 |