Skip to main content

An interactive data profiling library for Python notebooks with rich HTML reports and PDF export capabilities

Project description

pytics

PyPI version Python Versions License: MIT Tests

An interactive data profiling library for Python that generates comprehensive HTML reports with rich visualizations and PDF export capabilities.

Features

  • 📊 Interactive Visualizations: Built with Plotly for dynamic, interactive charts
  • 📱 Responsive Design: Reports adapt to different screen sizes
  • 📄 PDF Export: Generate publication-ready PDF reports
  • 🎯 Target Analysis: Special insights for classification/regression tasks
  • 🔍 Comprehensive Profiling: Detailed statistics and distributions
  • Performance Optimized: Efficient handling of large datasets
  • 🛠️ Customizable: Configure sections and visualization options
  • ↔️ DataFrame Comparison: Compare two datasets for differences in schema, stats, and distributions

Example Reports

Full Profile Report

Full Profile Report

Targeted Analysis Report

Targeted Analysis Report

Installation

pip install pytics

Quick Start

import pandas as pd
from pytics import profile, compare

# --- Basic Profiling ---
# Method 1: Profile a DataFrame object
df = pd.read_csv('your_data.csv')
profile(df, output_file='report.html')

# Method 2: Profile directly from a file path
# Supports CSV and Parquet files
profile('path/to/your_data.csv', output_file='report.html')
profile('path/to/your_data.parquet', output_file='report.html')

# --- Advanced Profiling ---
# Generate a PDF report
profile(df, output_format='pdf', output_file='report.pdf')

# Profile with a target variable for enhanced analysis
profile(
    df,
    target='target_column',  # Enables target-specific analysis
    output_file='targeted_report.html'
)

# Select specific sections to include/exclude
profile(
    df,
    include_sections=['overview', 'correlations'],
    exclude_sections=['target_analysis'],
    output_file='custom_report.html'
)

# --- DataFrame Comparison ---
# Method 1: Compare two DataFrame objects
df_train = pd.read_csv('train_data.csv')
df_test = pd.read_csv('test_data.csv')

compare(
    df_train, 
    df_test,
    name1='Train Set',    # Optional: Custom names for the datasets
    name2='Test Set',
    output_file='comparison.html'
)

# Method 2: Compare directly from file paths
compare(
    'path/to/train_data.csv',
    'path/to/test_data.csv',
    name1='Train Set',
    name2='Test Set',
    output_file='comparison.html'
)

Target Variable Analysis

When you specify a target variable using the target parameter, pytics enhances the analysis with:

  • Target distribution visualization
  • Feature importance analysis
  • Target-specific correlations
  • Conditional distributions of features
  • Statistical tests for feature-target relationships

Example:

# Profile with target variable analysis
profile(
    df,
    target='target_column',
    output_file='targeted_report.html'
)

Configuration Options

Profile Configuration

profile(
    df,
    target='target_column',           # Target variable for supervised learning
    include_sections=['overview'],    # Sections to include
    exclude_sections=['correlations'],# Sections to exclude
    output_format='pdf',             # 'html' or 'pdf'
    output_file='report.html',       # Output file path
    theme='light',                   # Report theme ('light' or 'dark')
    title='Custom Report Title'      # Report title
)

Compare Configuration

compare(
    df1,
    df2,
    name1='First Dataset',           # Custom name for first dataset
    name2='Second Dataset',          # Custom name for second dataset
    output_file='comparison.html',   # Output file path
    theme='light',                   # Report theme ('light' or 'dark')
    title='Dataset Comparison'       # Report title
)

Available Sections

  • overview: Dataset summary and memory usage
  • variables: Detailed variable analysis
  • correlations: Correlation analysis
  • target_analysis: Target-specific insights (requires target parameter)
  • interactions: Feature interaction analysis
  • missing_values: Missing value patterns
  • duplicates: Duplicate record analysis

Report Sections

  1. Overview

    • Dataset summary
    • Memory usage
    • Data types distribution
    • Missing values summary
  2. DataFrame Summary

    • Complete DataFrame info output
    • Numerical and categorical statistics
    • Data preview (head/tail)
    • Memory usage details
  3. Variable Analysis

    • Detailed statistics
    • Distribution plots
    • Missing value patterns
    • Unique values analysis
  4. Correlations

    • Correlation matrix
    • Feature relationships
    • Interactive heatmaps
  5. Target Analysis (when target specified)

    • Target distribution
    • Feature importance
    • Target correlations
  6. Missing Values

    • Missing value patterns
    • Distribution analysis
    • Correlation with other features
  7. Duplicates

    • Duplicate record analysis
    • Pattern identification
    • Impact assessment
  8. About

    • Project information
    • Feature overview
    • GitHub repository links

Edge Cases and Limitations

Data Size Limits

  • Recommended maximum rows: 1 million
  • Recommended maximum columns: 1000
  • Large datasets may require increased memory allocation

Special Cases

  • Missing Values: Automatically handled and reported
  • Categorical Variables: Limited to 1000 unique values by default
  • Date/Time: Automatically detected and analyzed
  • Mixed Data Types: Handled with appropriate warnings

Error Handling

  • Custom exceptions for clear error reporting
  • Warning system for non-critical issues
  • Graceful degradation for memory constraints

Best Practices

  1. Memory Management

    • Sample large datasets if needed
    • Use section selection for focused analysis
    • Monitor memory usage for big datasets
  2. Performance Optimization

    • Limit categorical variables when possible
    • Use targeted section selection
    • Consider data sampling for initial exploration
  3. Report Generation

    • Choose appropriate output format
    • Use meaningful report titles
    • Save reports with descriptive filenames

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. See the CONTRIBUTING.md file for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytics-1.1.4.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytics-1.1.4-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file pytics-1.1.4.tar.gz.

File metadata

  • Download URL: pytics-1.1.4.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pytics-1.1.4.tar.gz
Algorithm Hash digest
SHA256 a08c95a42fdc88a6c82880e2078d3ed6a9fd7307ce2609491ab3ece753d89a3b
MD5 b8786514d6471ce37217ca6208fe4346
BLAKE2b-256 24be3d9565423aeb907d672bd1ca6ae452c95b985d460686a1a6bf3c9cc12f4f

See more details on using hashes here.

File details

Details for the file pytics-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: pytics-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pytics-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5a9be3881b49cc26fc5ffbbc5af21ecdeef5ed92815e32bc3dca3780f089cb4f
MD5 cc298da7e47019185ad43208cdba591c
BLAKE2b-256 a3a3129e04ff9f20bed98a266d1dfa78aaf69ce6fa4177f48e146bea3e43c296

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page