A Python package for univariate and bivariate data analysis using PySpark
Project description
pyspark_eda
pyspark_eda
is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate and bivariate analysis, handling missing values, outliers, and visualizing data distributions.
Features
- Univariate analysis: Analyze numerical and categorical columns individually.
- Bivariate analysis: Includes correlation, Cramer's V, and ANOVA.
- Automatic handling: Deals with missing values and outliers seamlessly.
- Visualization: Provides graphical representation of data distributions and relationships.
Installation
You can install pyspark_eda
via pip:
pip install pyspark_eda
Example Usage
Univariate Analysis
from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Perform univariate analysis
get_univariate_analysis(df,table_name="your_table_name",print_graphs=1 ,id_list=['id_column'])
Bivariate Analysis
from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis
# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()
# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)
# Perform bivariate analysis
get_bivariate_analysis(df,table_name="bivariate_analysis_results", print_graphs=1, id_columns=['id_column'], correlation_analysis=1, cramer_analysis=1, anova_analysis=1)
Functions
get_univariate_analysis
Parameters
- df (DataFrame): The input PySpark DataFrame.
- table_name (str): The base table name to save the results
- print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.
- id_list (list, optional): List of columns to exclude from analysis.
Description
Performs univariate analysis on the DataFrame and prints summary statistics and visualizations.
get_bivariate_analysis
Parameters
- df (DataFrame): The input PySpark DataFrame.
- table_name (str): The base table name to save the results
- print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no),default value is 0.
- id_columns (list, optional): List of columns to exclude from analysis.
- correlation_analysis (int, optional): Whether to perform correlation analysis (1 for yes, 0 for no),default value is 1.
- cramer_analysis (int, optional): Whether to perform Cramer's V analysis (1 for yes, 0 for no), default value is 1.
- anova_analysis (int, optional): Whether to perform ANOVA analysis (1 for yes, 0 for no),default value is 1.
Description
Performs bivariate analysis on the DataFrame, including correlation, Cramer's V, and ANOVA.
Contact
- Author: Tanya Irani
- Email: tanyairani22@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyspark_eda-1.2.1.tar.gz
(7.0 kB
view hashes)
Built Distribution
Close
Hashes for pyspark_eda-1.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ba5887e1de1426cd6cf1f9fb44a5a615797bccfdac0c366b8881beefa567590 |
|
MD5 | 0408fc148c18f1fc29b6cbcc7cf6d0d6 |
|
BLAKE2b-256 | e0939357f65f9c1dd721343b2d7873805838eb4897f80be080442b1bb6a45e3b |