An interactive data profiling library for Python notebooks with rich HTML reports and PDF export capabilities
Project description
pytics
An interactive data profiling library for Python that generates comprehensive HTML reports with rich visualizations and PDF export capabilities.
Features
- 📊 Interactive Visualizations: Built with Plotly for dynamic, interactive charts
- 📱 Responsive Design: Reports adapt to different screen sizes
- 📄 PDF Export: Generate publication-ready PDF reports
- 🎯 Target Analysis: Special insights for classification/regression tasks
- 🔍 Comprehensive Profiling: Detailed statistics and distributions
- ⚡ Performance Optimized: Efficient handling of large datasets
- 🛠️ Customizable: Configure sections and visualization options
- ↔️ DataFrame Comparison: Compare two datasets for differences in schema, stats, and distributions
Example Reports
Full Profile Report
Targeted Analysis Report
Installation
pip install pytics
Quick Start
import pandas as pd
from pytics import profile, compare
# --- Basic Profiling ---
# Method 1: Profile a DataFrame object
df = pd.read_csv('your_data.csv')
profile(df, output_file='report.html')
# Method 2: Profile directly from a file path
# Supports CSV and Parquet files
profile('path/to/your_data.csv', output_file='report.html')
profile('path/to/your_data.parquet', output_file='report.html')
# --- Advanced Profiling ---
# Generate a PDF report
profile(df, output_format='pdf', output_file='report.pdf')
# Profile with a target variable for enhanced analysis
profile(
df,
target='target_column', # Enables target-specific analysis
output_file='targeted_report.html'
)
# Select specific sections to include/exclude
profile(
df,
include_sections=['overview', 'correlations'],
exclude_sections=['target_analysis'],
output_file='custom_report.html'
)
# --- DataFrame Comparison ---
# Method 1: Compare two DataFrame objects
df_train = pd.read_csv('train_data.csv')
df_test = pd.read_csv('test_data.csv')
compare(
df_train,
df_test,
name1='Train Set', # Optional: Custom names for the datasets
name2='Test Set',
output_file='comparison.html'
)
# Method 2: Compare directly from file paths
compare(
'path/to/train_data.csv',
'path/to/test_data.csv',
name1='Train Set',
name2='Test Set',
output_file='comparison.html'
)
Target Variable Analysis
When you specify a target variable using the target parameter, pytics enhances the analysis with:
- Target distribution visualization
- Feature importance analysis
- Target-specific correlations
- Conditional distributions of features
- Statistical tests for feature-target relationships
Example:
# Profile with target variable analysis
profile(
df,
target='target_column',
output_file='targeted_report.html'
)
Configuration Options
Profile Configuration
profile(
df,
target='target_column', # Target variable for supervised learning
include_sections=['overview'], # Sections to include
exclude_sections=['correlations'],# Sections to exclude
output_format='pdf', # 'html' or 'pdf'
output_file='report.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Custom Report Title' # Report title
)
Compare Configuration
compare(
df1,
df2,
name1='First Dataset', # Custom name for first dataset
name2='Second Dataset', # Custom name for second dataset
output_file='comparison.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Dataset Comparison' # Report title
)
Available Sections
overview: Dataset summary and memory usagevariables: Detailed variable analysiscorrelations: Correlation analysistarget_analysis: Target-specific insights (requires target parameter)interactions: Feature interaction analysismissing_values: Missing value patternsduplicates: Duplicate record analysis
Report Sections
-
Overview
- Dataset summary
- Memory usage
- Data types distribution
- Missing values summary
-
DataFrame Summary
- Complete DataFrame info output
- Numerical and categorical statistics
- Data preview (head/tail)
- Memory usage details
-
Variable Analysis
- Detailed statistics
- Distribution plots
- Missing value patterns
- Unique values analysis
-
Correlations
- Correlation matrix
- Feature relationships
- Interactive heatmaps
-
Target Analysis (when target specified)
- Target distribution
- Feature importance
- Target correlations
-
Missing Values
- Missing value patterns
- Distribution analysis
- Correlation with other features
-
Duplicates
- Duplicate record analysis
- Pattern identification
- Impact assessment
-
About
- Project information
- Feature overview
- GitHub repository links
Edge Cases and Limitations
Data Size Limits
- Recommended maximum rows: 1 million
- Recommended maximum columns: 1000
- Large datasets may require increased memory allocation
PDF Export Limitations
When exporting reports to PDF format:
- Plots are intentionally omitted due to a known issue with Kaleido version >= 0.2.1 that causes PDF export to hang indefinitely
- A message is displayed in place of each plot indicating it has been omitted
- All other report content (statistics, tables, etc.) remains fully functional
- For viewing plots, use the HTML export format which provides fully interactive visualizations
- If PDF plots are required, consider using pytics version 1.1.3 which supports them
Special Cases
- Missing Values: Automatically handled and reported
- Categorical Variables: Limited to 1000 unique values by default
- Date/Time: Automatically detected and analyzed
- Mixed Data Types: Handled with appropriate warnings
Error Handling
- Custom exceptions for clear error reporting
- Warning system for non-critical issues
- Graceful degradation for memory constraints
Best Practices
-
Memory Management
- Sample large datasets if needed
- Use section selection for focused analysis
- Monitor memory usage for big datasets
-
Performance Optimization
- Limit categorical variables when possible
- Use targeted section selection
- Consider data sampling for initial exploration
-
Report Generation
- Choose appropriate output format
- Use meaningful report titles
- Save reports with descriptive filenames
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. See the CONTRIBUTING.md file for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytics-1.1.5.tar.gz.
File metadata
- Download URL: pytics-1.1.5.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b87e158a3b629df90716b84c77259999224f59b271d97f0d8fa6778f04d2cc9
|
|
| MD5 |
91634485e8f1e07bf9a22fdb0ab2fe79
|
|
| BLAKE2b-256 |
963f184cc222328f90b30e1d161e280d7e7b3f041c008d38a278837fb41b30b1
|
File details
Details for the file pytics-1.1.5-py3-none-any.whl.
File metadata
- Download URL: pytics-1.1.5-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2af6f2b2b4bf428585030f4337c0da62c5ce3653a885f4ddd360c29fbf35f5bb
|
|
| MD5 |
80519f0d8c2decc1732f54858d8bfeb8
|
|
| BLAKE2b-256 |
5d4946feb77b808ddffaadf8a8afd0ed4ccbf8c874d99f73b6ee2dcad74b921d
|