Automated data profiling and visualization tool

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Data-Auto-Profiler

Data-Auto-Profiler is a powerful Python package designed to streamline the data analysis process by automatically generating comprehensive insights about your datasets. Whether you're a data scientist looking to quickly understand a new dataset or an analyst preparing a detailed data quality report, Data-Auto-Profiler provides the tools you need to uncover meaningful patterns and potential issues in your data.

Understanding Data-Auto-Profiler's Core Features

Data-Auto-Profiler excels at four key areas of data analysis:

Data Quality Assessment

At its heart, Data-Auto-Profiler helps you understand the reliability and completeness of your data. The package automatically evaluates your dataset for common quality issues by:

Calculating an overall completeness score that tells you at a glance how much of your data is actually usable
Identifying missing values and their patterns across different features
Detecting numerical outliers that might skew your analysis
Finding duplicate records that could affect your model's performance
Determining the most appropriate data type for each column

Statistical Analysis

Data-Auto-Profiler performs a thorough statistical examination of your data, helping you understand the underlying distributions and characteristics of each feature. This includes:

Computing essential descriptive statistics like mean, median, and standard deviation
Analyzing the shape of your data distributions through skewness and kurtosis measurements
Calculating variance to understand the spread of your numerical features
Generating visualizations that make these statistics intuitive and actionable

Feature Relationships

Understanding how different features relate to each other is crucial for any data analysis project. Data-Auto-Profiler provides several methods to explore these relationships:

Calculating Pearson correlations between numerical features to identify linear relationships
Using Cramér's V analysis to understand associations between categorical variables
Creating interactive pairplot visualizations that let you explore relationships visually
Analyzing how each feature relates to your target variable

Predictive Power Assessment

For machine learning projects, Data-Auto-Profiler helps you understand which features might be most useful through:

Information Value (IV) calculations that measure each feature's predictive strength
Feature importance rankings that help you prioritize which variables to focus on
Detailed analysis of how each feature relates to your target variable

Getting Started with Data-Auto-Profiler

Installation

First, install Data-Auto-Profiler using pip:

pip install data-auto-profiler

The package requires several common data science libraries:

pip install pandas numpy plotly scipy

Basic Usage

Here's a simple example to get you started:

import pandas as pd
from data_auto_profiler import AutoProfile

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Create an Data-Auto-Profiler instance
# The target_column parameter tells Data-Auto-Profiler which variable you're trying to predict
profiler = AutoProfile(data=data, target_column='target')

Detailed Analysis Methods

Let's explore each analysis method in detail:

Completeness Analysis

profiler.summary(completeness=True)

This generates an intuitive gauge chart showing your dataset's overall completeness on a 0-100% scale. The visualization uses color coding to quickly communicate data quality:

Green (≥80%): Excellent completeness
Orange (50-80%): Moderate completeness
Red (≤50%): Poor completeness

Missing Value Analysis

profiler.summary(missing=True)

This creates a detailed bar chart showing the percentage of missing values in each column. This visualization helps you:

Identify which features have the most missing data
Understand patterns in data collection issues
Make informed decisions about imputation strategies

Outlier Detection

profiler.summary(outliers=True)

Using the Interquartile Range (IQR) method, this analysis identifies statistical outliers in your numerical features. The resulting visualization shows:

The number of outliers per feature
Their distribution within the data
Potential data quality issues that need investigation

Feature Importance Analysis with Information Value/gain

profiler.summary(iv_analysis=True)

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:

Very strong predictors (IV > 0.3)
Strong predictors (0.1 ≤ IV < 0.3)
Medium predictors (0.02 ≤ IV < 0.1)
Weak predictors (IV < 0.02)

Cramér's V Association Analysis

profiler.summary(cramers_v_analysis=True)

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:

Very strong predictors (IV > 0.3)
Strong predictors (0.1 ≤ IV < 0.3)
Medium predictors (0.02 ≤ IV < 0.1)
Weak predictors (IV < 0.02)

Correlation Analysis

profiler.summary(correlation=True)

This analysis calculates the correlation between all pairs of numerical features, providing insights into:

Pearson correlation for numeric features
Cramér's V for categorical features

Autodistribution plots for a given column

profiler.summary(distribution='column_name')

This analysis generates an autodistribution plot for the specified column, providing insights into its distribution and potential data skewedness. it accounts for binary columns, categorical columns and numeric columns.

Important Considerations

When using Data-Auto-Profiler, keep these points in mind:

Memory Usage: The pairplot analysis can be memory-intensive for large datasets with many features. Consider using it selectively on smaller feature sets.
Performance: Some analyses, particularly Cramér's V calculations for categorical variables, may take longer with large datasets or features with many unique values.
Target Variable Requirements: For certain analyses like Information Value calculations, your target variable must be numeric.
Missing Value Handling: While Data-Auto-Profiler handles missing values automatically, their presence may affect certain statistical calculations.

📋 Requirements

Python 3.8+
pandas
numpy
plotly
scipy

Contributing to Data-Auto-Profiler

We welcome contributions from the community! If you'd like to improve Data-Auto-Profiler:

Fork the repository
Create a new branch for your feature
Submit a pull request with your changes

For significant changes, please open an issue first to discuss your proposed modifications.

Contribution Guidelines

Follow PEP 8 Style Guide
Write Comprehensive Tests
Document New Features
Maintain Code Quality

License

Data-Auto-Profiler is available under the MIT License, allowing for both personal and commercial use with proper attribution.

📞 Support

Open GitHub Issues
Email: [maponyacl@gmail.com]

🌟 Acknowledgements

Inspired by data science community
Built with ❤️ for data explorers

🚀 Future Roadmap

Machine Learning Model Integration
Advanced Anomaly Detection
Enhanced Visualization Themes

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Jan 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_auto_profiler-0.1.1.tar.gz (15.6 kB view details)

Uploaded Jan 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_auto_profiler-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Jan 2, 2025 Python 3

File details

Details for the file data_auto_profiler-0.1.1.tar.gz.

File metadata

Download URL: data_auto_profiler-0.1.1.tar.gz
Upload date: Jan 2, 2025
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for data_auto_profiler-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a3558f06b8a867315f9af29385bc06e4d66a76feca101af73dd768f7e9cde9fd`
MD5	`4b083b32204dc30279c9b7ad7fa42dcc`
BLAKE2b-256	`1f53c9d04a47dd7f0fc3decae763162cb02ea5a51332513e5a36be763cfa8eec`

See more details on using hashes here.

File details

Details for the file data_auto_profiler-0.1.1-py3-none-any.whl.

File metadata

Download URL: data_auto_profiler-0.1.1-py3-none-any.whl
Upload date: Jan 2, 2025
Size: 12.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for data_auto_profiler-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92fd3af09d208ca1614357f0ef5cbfcb2c1ea54a0d7d82ccb15536650e295de7`
MD5	`f92fe60fc57953f24986b9535cd599cc`
BLAKE2b-256	`c39cdbf98162d53c95b1450f6a1c0fd8b68caf335c61bc6885696e3a596f2edc`

See more details on using hashes here.

data-auto-profiler 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data-Auto-Profiler

Understanding Data-Auto-Profiler's Core Features

Data Quality Assessment

Statistical Analysis

Feature Relationships

Predictive Power Assessment

Getting Started with Data-Auto-Profiler

Installation

Basic Usage

Detailed Analysis Methods

Completeness Analysis

Missing Value Analysis

Outlier Detection

Feature Importance Analysis with Information Value/gain

Cramér's V Association Analysis

Correlation Analysis

Autodistribution plots for a given column

Important Considerations

📋 Requirements

Contributing to Data-Auto-Profiler

Contribution Guidelines

License

📞 Support

🌟 Acknowledgements

🚀 Future Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes