Skip to main content

A Python package for univariate and bivariate data analysis using PySpark

Project description

pyspark_eda

pyspark_eda is a Python library for performing exploratory data analysis (EDA) using PySpark. It offers functionalities for both univariate and bivariate analysis, handling missing values, outliers, and visualizing data distributions.

Features

  • Univariate analysis: Analyze numerical and categorical columns individually.
  • Bivariate analysis: Includes correlation, Cramer's V, and ANOVA.
  • Automatic handling: Deals with missing values and outliers seamlessly.
  • Visualization: Provides graphical representation of data distributions and relationships.

Installation

You can install pyspark_eda via pip:

pip install pyspark_eda

Example Usage

Univariate Analysis

from pyspark.sql import SparkSession
from pyspark_eda import get_univariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Perform univariate analysis
get_univariate_analysis(df, id_list=['id_column'], print_graphs=1)

Bivariate Analysis

from pyspark.sql import SparkSession
from pyspark_eda import get_bivariate_analysis

# Initialize Spark session
spark = SparkSession.builder.appName('DataAnalysis').getOrCreate()

# Load your data into a PySpark DataFrame
df = spark.read.csv('your_data.csv', header=True, inferSchema=True)

# Perform bivariate analysis
get_bivariate_analysis(df, print_graphs=1, id_columns=['id_column'], correlation_analysis=1, cramer_analysis=1, anova_analysis=1)

Functions

get_univariate_analysis

Parameters

  • df (DataFrame): The input PySpark DataFrame.
  • id_list (list, optional): List of columns to exclude from analysis.
  • print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no).

Description

Performs univariate analysis on the DataFrame and prints summary statistics and visualizations.

get_bivariate_analysis

Parameters

  • df (DataFrame): The input PySpark DataFrame.
  • print_graphs (int, optional): Whether to print graphs (1 for yes, 0 for no).
  • id_columns (list, optional): List of columns to exclude from analysis.
  • correlation_analysis (int, optional): Whether to perform correlation analysis (1 for yes, 0 for no).
  • cramer_analysis (int, optional): Whether to perform Cramer's V analysis (1 for yes, 0 for no).
  • anova_analysis (int, optional): Whether to perform ANOVA analysis (1 for yes, 0 for no).

Description

Performs bivariate analysis on the DataFrame, including correlation, Cramer's V, and ANOVA.

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_eda-1.0.1.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

pyspark_eda-1.0.1-py3-none-any.whl (6.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page