Automated data profiling and visualization tool
Project description
Data-Auto-Profiler
Data-Auto-Profiler is a powerful Python package designed to streamline the data analysis process by automatically generating comprehensive insights about your datasets. Whether you're a data scientist looking to quickly understand a new dataset or an analyst preparing a detailed data quality report, Data-Auto-Profiler provides the tools you need to uncover meaningful patterns and potential issues in your data.
Understanding Data-Auto-Profiler's Core Features
Data-Auto-Profiler excels at four key areas of data analysis:
Data Quality Assessment
At its heart, Data-Auto-Profiler helps you understand the reliability and completeness of your data. The package automatically evaluates your dataset for common quality issues by:
- Calculating an overall completeness score that tells you at a glance how much of your data is actually usable
- Identifying missing values and their patterns across different features
- Detecting numerical outliers that might skew your analysis
- Finding duplicate records that could affect your model's performance
- Determining the most appropriate data type for each column
Statistical Analysis
Data-Auto-Profiler performs a thorough statistical examination of your data, helping you understand the underlying distributions and characteristics of each feature. This includes:
- Computing essential descriptive statistics like mean, median, and standard deviation
- Analyzing the shape of your data distributions through skewness and kurtosis measurements
- Calculating variance to understand the spread of your numerical features
- Generating visualizations that make these statistics intuitive and actionable
Feature Relationships
Understanding how different features relate to each other is crucial for any data analysis project. Data-Auto-Profiler provides several methods to explore these relationships:
- Calculating Pearson correlations between numerical features to identify linear relationships
- Using Cramér's V analysis to understand associations between categorical variables
- Creating interactive pairplot visualizations that let you explore relationships visually
- Analyzing how each feature relates to your target variable
Predictive Power Assessment
For machine learning projects, Data-Auto-Profiler helps you understand which features might be most useful through:
- Information Value (IV) calculations that measure each feature's predictive strength
- Feature importance rankings that help you prioritize which variables to focus on
- Detailed analysis of how each feature relates to your target variable
Getting Started with Data-Auto-Profiler
Installation
First, install Data-Auto-Profiler using pip:
pip install data-auto-profiler
The package requires several common data science libraries:
pip install pandas numpy plotly scipy
Basic Usage
Here's a simple example to get you started:
import pandas as pd
from data_auto_profiler import AutoProfile
# Load your dataset
data = pd.read_csv('your_dataset.csv')
# Create an Data-Auto-Profiler instance
# The target_column parameter tells Data-Auto-Profiler which variable you're trying to predict
profiler = AutoProfile(data=data, target_column='target')
Detailed Analysis Methods
Let's explore each analysis method in detail:
Completeness Analysis
profiler.summary(completeness=True)
This generates an intuitive gauge chart showing your dataset's overall completeness on a 0-100% scale. The visualization uses color coding to quickly communicate data quality:
- Green (≥80%): Excellent completeness
- Orange (50-80%): Moderate completeness
- Red (≤50%): Poor completeness
Missing Value Analysis
profiler.summary(missing=True)
This creates a detailed bar chart showing the percentage of missing values in each column. This visualization helps you:
- Identify which features have the most missing data
- Understand patterns in data collection issues
- Make informed decisions about imputation strategies
Outlier Detection
profiler.summary(outliers=True)
Using the Interquartile Range (IQR) method, this analysis identifies statistical outliers in your numerical features. The resulting visualization shows:
- The number of outliers per feature
- Their distribution within the data
- Potential data quality issues that need investigation
Feature Importance Analysis with Information Value/gain
profiler.summary(iv_analysis=True)
This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:
- Very strong predictors (IV > 0.3)
- Strong predictors (0.1 ≤ IV < 0.3)
- Medium predictors (0.02 ≤ IV < 0.1)
- Weak predictors (IV < 0.02)
Cramér's V Association Analysis
profiler.summary(cramers_v_analysis=True)
This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:
- Very strong predictors (IV > 0.3)
- Strong predictors (0.1 ≤ IV < 0.3)
- Medium predictors (0.02 ≤ IV < 0.1)
- Weak predictors (IV < 0.02)
Correlation Analysis
profiler.summary(correlation=True)
This analysis calculates the correlation between all pairs of numerical features, providing insights into:
- Pearson correlation for numeric features
- Cramér's V for categorical features
Autodistribution plots for a given column
profiler.summary(distribution='column_name')
This analysis generates an autodistribution plot for the specified column, providing insights into its distribution and potential data skewedness. it accounts for binary columns, categorical columns and numeric columns.
Important Considerations
When using Data-Auto-Profiler, keep these points in mind:
-
Memory Usage: The pairplot analysis can be memory-intensive for large datasets with many features. Consider using it selectively on smaller feature sets.
-
Performance: Some analyses, particularly Cramér's V calculations for categorical variables, may take longer with large datasets or features with many unique values.
-
Target Variable Requirements: For certain analyses like Information Value calculations, your target variable must be numeric.
-
Missing Value Handling: While Data-Auto-Profiler handles missing values automatically, their presence may affect certain statistical calculations.
📋 Requirements
- Python 3.8+
- pandas
- numpy
- plotly
- scipy
Contributing to Data-Auto-Profiler
We welcome contributions from the community! If you'd like to improve Data-Auto-Profiler:
- Fork the repository
- Create a new branch for your feature
- Submit a pull request with your changes
For significant changes, please open an issue first to discuss your proposed modifications.
Contribution Guidelines
- Follow PEP 8 Style Guide
- Write Comprehensive Tests
- Document New Features
- Maintain Code Quality
License
Data-Auto-Profiler is available under the MIT License, allowing for both personal and commercial use with proper attribution.
📞 Support
- Open GitHub Issues
- Email: [maponyacl@gmail.com]
🌟 Acknowledgements
- Inspired by data science community
- Built with ❤️ for data explorers
🚀 Future Roadmap
- Machine Learning Model Integration
- Advanced Anomaly Detection
- Enhanced Visualization Themes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_auto_profiler-0.1.1.tar.gz.
File metadata
- Download URL: data_auto_profiler-0.1.1.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3558f06b8a867315f9af29385bc06e4d66a76feca101af73dd768f7e9cde9fd
|
|
| MD5 |
4b083b32204dc30279c9b7ad7fa42dcc
|
|
| BLAKE2b-256 |
1f53c9d04a47dd7f0fc3decae763162cb02ea5a51332513e5a36be763cfa8eec
|
File details
Details for the file data_auto_profiler-0.1.1-py3-none-any.whl.
File metadata
- Download URL: data_auto_profiler-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92fd3af09d208ca1614357f0ef5cbfcb2c1ea54a0d7d82ccb15536650e295de7
|
|
| MD5 |
f92fe60fc57953f24986b9535cd599cc
|
|
| BLAKE2b-256 |
c39cdbf98162d53c95b1450f6a1c0fd8b68caf335c61bc6885696e3a596f2edc
|