Skip to main content

A Python library for Exploratory Data Analysis and Profiling.

Project description

Pydata-visualizer

PyPI version Python versions License: MIT

A powerful and intuitive Python library for exploratory data analysis and data profiling. Pydata-visualizer automatically analyzes your dataset, generates interactive visualizations, and provides detailed statistical insights with minimal code.

Features

  • Comprehensive Data Profiling: Analyze numerical, categorical, boolean, and string data types with detailed statistics
  • Automated Data Quality Checks: Detect missing values, outliers (IQR/Z-score methods), skewed distributions, duplicate rows, and more
  • Interactive Visualizations: Generate distribution plots, correlation heatmaps, word clouds, and statistical charts using Plotly or Seaborn
  • Dual Rendering Modes: Choose between interactive Plotly charts or static Seaborn/Matplotlib visualizations
  • Text Analysis: Automatic word frequency analysis and word cloud generation for text columns
  • Rich HTML Reports: Export analysis to visually appealing and shareable HTML reports with interactive or static charts
  • Performance Optimized: Fast analysis even on large datasets with minimal mode and modular settings
  • Correlation Analysis: Calculate Pearson, Spearman, and Cramér's V correlations between variables
  • Flexible Configuration: Customize analysis thresholds and options via the comprehensive Settings class
  • Modular Analysis: Toggle individual components (plots, correlations, alerts, sample data, overview) on/off

Installation

pip install pydata-visualizer

Quick Start

import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Create a report with default settings
report = AnalysisReport(df)
report.to_html("report.html")

Advanced Usage

Customizing Analysis Settings

from data_visualizer.profiler import AnalysisReport, Settings

# Configure analysis settings
report_settings = Settings(
    minimal=False,                      # Set to True for faster, minimal analysis
    top_n_values=5,                     # Show top 5 values in categorical columns
    skewness_threshold=2.0,             # Tolerance for skewness alerts
    outlier_method='iqr',               # Outlier detection method: 'iqr' or 'zscore'
    outlier_threshold=1.5,              # IQR multiplier for outlier detection
    duplicate_threshold=5.0,            # Percentage threshold for duplicate alerts
    text_analysis=True,                 # Enable word frequency analysis for text columns
    use_plotly=True,                    # Use Plotly for interactive visualizations (default: False for Seaborn)
    include_plots=True,                 # Include visualizations/plots in the analysis
    include_correlations=True,          # Include correlation analysis
    include_correlations_plots=True,    # Include correlation heatmaps
    include_correlations_json=False,    # Include correlation data in JSON format
    include_alerts=True,                # Include data quality alerts
    include_sample_data=True,           # Include head/tail samples
    include_overview=True               # Include dataset overview statistics
)

# Create report with custom settings
report = AnalysisReport(df, settings=report_settings)

# Perform analysis and get results dictionary
results = report.analyse()

# Generate HTML report
report.to_html("custom_report.html")

Report Structure

The generated report includes:

  • Overview: Dataset dimensions, missing values, duplicate rows (count, percentage, indices, and samples of duplicate data)
  • Variable Analysis: Detailed per-column statistics and visualizations including:
    • Distribution plots for numeric data with outlier highlighting (outliers shown in red)
    • Bar charts for categorical data
    • Word clouds and bar charts for text data (when text_analysis is enabled)
    • Outlier detection using IQR or Z-score methods with outlier counts and percentages
    • Skewness and kurtosis for numeric columns
    • Cardinality assessment (High/Low) for categorical and text columns
  • Sample Data: Head and tail samples of the dataset (first and last 10 rows)
  • Correlations: Correlation matrices and heatmaps for:
    • Pearson correlation (linear relationships between numerical variables)
    • Spearman correlation (monotonic relationships between numerical variables)
    • Cramér's V (associations between categorical variables)
  • Data Quality Alerts: Automated detection of data quality issues:
    • High Missing Values (>20% threshold)
    • Skewness (configurable threshold, default 1.0)
    • Outliers (detected via IQR or Z-score methods)
    • High Duplicates (configurable percentage threshold, default 5.0%)

API Reference

AnalysisReport Class

class AnalysisReport:
    def __init__(self, data, settings=None):
        """
        Initialize the analysis report object.
        
        Parameters:
        -----------
        data : pandas.DataFrame
            The dataset to analyze
        settings : Settings, optional
            Configuration settings for the analysis
        """
        
    def analyse(self):
        """
        Perform the data analysis.
        
        Returns:
        --------
        dict
            A dictionary containing all analysis results
        """
        
    def to_html(self, filename="report.html"):
        """
        Generate an HTML report from the analysis.
        
        Parameters:
        -----------
        filename : str, optional
            Path to save the HTML report (default: "report.html")
        """

Settings Class

class Settings(pydantic.BaseModel):
    """
    Settings for the analysis report.
    
    Attributes:
    -----------
    minimal : bool, default=False
        Whether to perform minimal analysis (skips type-specific analysis and visualizations)
    
    top_n_values : int, default=10
        Number of top values to show for categorical columns (must be >= 1)
    
    skewness_threshold : float, default=1.0
        Threshold for skewness alerts (must be >= 0.0)
    
    outlier_method : str, default='iqr'
        Outlier detection method: 'iqr' (Interquartile Range) or 'zscore'
    
    outlier_threshold : float, default=1.5
        IQR multiplier for outlier detection (must be >= 0.0)
        Standard: 1.5 for moderate outliers, 3.0 for extreme outliers
    
    duplicate_threshold : float, default=5.0
        Percentage of duplicate rows to trigger an alert (must be >= 0.0)
    
    text_analysis : bool, default=True
        Enable word frequency analysis and word cloud generation for text columns
    
    use_plotly : bool, default=False
        Use Plotly for interactive visualizations instead of Seaborn/Matplotlib static plots
    
    include_plots : bool, default=True
        Include visualizations/plots in the analysis
    
    include_correlations : bool, default=True
        Include correlation analysis
    
    include_correlations_plots : bool, default=True
        Include correlation heatmaps
    
    include_correlations_json : bool, default=False
        Include correlation data in JSON format
    
    include_alerts : bool, default=True
        Include data quality alerts (column and dataset-level)
    
    include_sample_data : bool, default=True
        Include head/tail data samples
    
    include_overview : bool, default=True
        Include dataset overview statistics
    """

Type Analyzers

The library automatically detects and applies the appropriate analysis for different data types:

  • Numeric (Integer/Float): Statistical measures (mean, std, min, max, quartiles), distribution plots with KDE, skewness, kurtosis, outlier detection (IQR/Z-score methods), outlier counts and percentages, outlier highlighting in visualizations
  • Categorical/Object: Value counts, cardinality analysis (High/Low based on 50 unique values threshold), frequency distributions, top N values (configurable), bar charts
  • String: Unique value counts, cardinality analysis (High/Low), top N values (configurable), word frequency analysis (when text_analysis is enabled), word cloud generation (Plotly scatter or WordCloud library), bar charts for value distribution
  • Boolean: Value counts, proportions, and frequency distribution visualizations
  • Generic: Basic analysis (unique value count) for unrecognized types

Correlation Analysis

Three correlation methods are calculated when applicable:

  • Pearson: Linear correlation between numerical variables (range: -1 to 1)
  • Spearman: Rank correlation capturing monotonic relationships (range: -1 to 1)
  • Cramér's V: Measure of association between categorical variables (range: 0 to 1)

Data Quality Alerts

The library automatically detects potential issues in your data:

  • High Missing Values: Columns with more than 20% missing data
  • Skewness: Distributions exceeding the configured skewness threshold
  • Outliers: Data points detected using IQR or Z-score methods
  • High Duplicates: Duplicate rows exceeding the configured threshold percentage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

Created by Aditya Deshmukh (adideshmukh2005@gmail.com)

GitHub: https://github.com/Adi-Deshmukh/Pydata-visualizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydata_visualizer-1.1.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydata_visualizer-1.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file pydata_visualizer-1.1.0.tar.gz.

File metadata

  • Download URL: pydata_visualizer-1.1.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for pydata_visualizer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a85ac34830f7780bcc899c36a4af081676e536ae7724714f6f61c20af96345cd
MD5 d924873101721fd2d56eb5950f6d2441
BLAKE2b-256 373f3db6272fa861fc892d131c2bd0da9e62e6eceaecd49755d7b41a06eaff61

See more details on using hashes here.

File details

Details for the file pydata_visualizer-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pydata_visualizer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fdf353acfe6fd6e70c75db0fcd860f4f271bee899dce40ddc25cc63f0b55b3ad
MD5 d79c271e44c2a2cdc6a6a909ce848201
BLAKE2b-256 0713ed895f019d778db6d7dad6cfc4989e0693fdecc5ef65cbb0b9d6db161d2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page