Skip to main content

Automated data profiling and visualization tool

Project description

Data-Auto-Profiler

Data-Auto-Profiler is a powerful Python package designed to streamline the data analysis process by automatically generating comprehensive insights about your datasets. Whether you're a data scientist looking to quickly understand a new dataset or an analyst preparing a detailed data quality report, Data-Auto-Profiler provides the tools you need to uncover meaningful patterns and potential issues in your data.

Understanding Data-Auto-Profiler's Core Features

Data-Auto-Profiler excels at four key areas of data analysis:

Data Quality Assessment

At its heart, Data-Auto-Profiler helps you understand the reliability and completeness of your data. The package automatically evaluates your dataset for common quality issues by:

  • Calculating an overall completeness score that tells you at a glance how much of your data is actually usable
  • Identifying missing values and their patterns across different features
  • Detecting numerical outliers that might skew your analysis
  • Finding duplicate records that could affect your model's performance
  • Determining the most appropriate data type for each column

Statistical Analysis

Data-Auto-Profiler performs a thorough statistical examination of your data, helping you understand the underlying distributions and characteristics of each feature. This includes:

  • Computing essential descriptive statistics like mean, median, and standard deviation
  • Analyzing the shape of your data distributions through skewness and kurtosis measurements
  • Calculating variance to understand the spread of your numerical features
  • Generating visualizations that make these statistics intuitive and actionable

Feature Relationships

Understanding how different features relate to each other is crucial for any data analysis project. Data-Auto-Profiler provides several methods to explore these relationships:

  • Calculating Pearson correlations between numerical features to identify linear relationships
  • Using Cramér's V analysis to understand associations between categorical variables
  • Creating interactive pairplot visualizations that let you explore relationships visually
  • Analyzing how each feature relates to your target variable

Predictive Power Assessment

For machine learning projects, Data-Auto-Profiler helps you understand which features might be most useful through:

  • Information Value (IV) calculations that measure each feature's predictive strength
  • Feature importance rankings that help you prioritize which variables to focus on
  • Detailed analysis of how each feature relates to your target variable

Getting Started with Data-Auto-Profiler

Installation

First, install Data-Auto-Profiler using pip:

pip install data-auto-profiler

The package requires several common data science libraries:

pip install pandas numpy plotly scipy

Basic Usage

Here's a simple example to get you started:

import pandas as pd
from data_auto_profiler import AutoProfile

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Create an Data-Auto-Profiler instance
# The target_column parameter tells Data-Auto-Profiler which variable you're trying to predict
profiler = AutoProfile(data=data, target_column='target')

Detailed Analysis Methods

Let's explore each analysis method in detail:

Completeness Analysis

profiler.summary(completeness=True)

This generates an intuitive gauge chart showing your dataset's overall completeness on a 0-100% scale. The visualization uses color coding to quickly communicate data quality:

  • Green (≥80%): Excellent completeness
  • Orange (50-80%): Moderate completeness
  • Red (≤50%): Poor completeness

Missing Value Analysis

profiler.summary(missing=True)

This creates a detailed bar chart showing the percentage of missing values in each column. This visualization helps you:

  • Identify which features have the most missing data
  • Understand patterns in data collection issues
  • Make informed decisions about imputation strategies

Outlier Detection

profiler.summary(outliers=True)

Using the Interquartile Range (IQR) method, this analysis identifies statistical outliers in your numerical features. The resulting visualization shows:

  • The number of outliers per feature
  • Their distribution within the data
  • Potential data quality issues that need investigation

Feature Importance Analysis with Information Value/gain

profiler.summary(iv_analysis=True)

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:

  • Very strong predictors (IV > 0.3)
  • Strong predictors (0.1 ≤ IV < 0.3)
  • Medium predictors (0.02 ≤ IV < 0.1)
  • Weak predictors (IV < 0.02)

Cramér's V Association Analysis

profiler.summary(cramers_v_analysis=True)

This analysis calculates the Information Value (IV) for each feature, helping you understand their predictive power:

  • Very strong predictors (IV > 0.3)
  • Strong predictors (0.1 ≤ IV < 0.3)
  • Medium predictors (0.02 ≤ IV < 0.1)
  • Weak predictors (IV < 0.02)

Correlation Analysis

profiler.summary(correlation=True)

This analysis calculates the correlation between all pairs of numerical features, providing insights into:

  • Pearson correlation for numeric features
  • Cramér's V for categorical features

Autodistribution plots for a given column

profiler.summary(distribution='column_name')

This analysis generates an autodistribution plot for the specified column, providing insights into its distribution and potential data skewedness. it accounts for binary columns, categorical columns and numeric columns.

Important Considerations

When using Data-Auto-Profiler, keep these points in mind:

  1. Memory Usage: The pairplot analysis can be memory-intensive for large datasets with many features. Consider using it selectively on smaller feature sets.

  2. Performance: Some analyses, particularly Cramér's V calculations for categorical variables, may take longer with large datasets or features with many unique values.

  3. Target Variable Requirements: For certain analyses like Information Value calculations, your target variable must be numeric.

  4. Missing Value Handling: While Data-Auto-Profiler handles missing values automatically, their presence may affect certain statistical calculations.

📋 Requirements

  • Python 3.8+
  • pandas
  • numpy
  • plotly
  • scipy

Contributing to Data-Auto-Profiler

We welcome contributions from the community! If you'd like to improve Data-Auto-Profiler:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Submit a pull request with your changes

For significant changes, please open an issue first to discuss your proposed modifications.

Contribution Guidelines

  • Follow PEP 8 Style Guide
  • Write Comprehensive Tests
  • Document New Features
  • Maintain Code Quality

License

Data-Auto-Profiler is available under the MIT License, allowing for both personal and commercial use with proper attribution.

📞 Support

🌟 Acknowledgements

  • Inspired by data science community
  • Built with ❤️ for data explorers

🚀 Future Roadmap

  • Machine Learning Model Integration
  • Advanced Anomaly Detection
  • Enhanced Visualization Themes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_auto_profiler-0.1.1.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_auto_profiler-0.1.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file data_auto_profiler-0.1.1.tar.gz.

File metadata

  • Download URL: data_auto_profiler-0.1.1.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for data_auto_profiler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a3558f06b8a867315f9af29385bc06e4d66a76feca101af73dd768f7e9cde9fd
MD5 4b083b32204dc30279c9b7ad7fa42dcc
BLAKE2b-256 1f53c9d04a47dd7f0fc3decae763162cb02ea5a51332513e5a36be763cfa8eec

See more details on using hashes here.

File details

Details for the file data_auto_profiler-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for data_auto_profiler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 92fd3af09d208ca1614357f0ef5cbfcb2c1ea54a0d7d82ccb15536650e295de7
MD5 f92fe60fc57953f24986b9535cd599cc
BLAKE2b-256 c39cdbf98162d53c95b1450f6a1c0fd8b68caf335c61bc6885696e3a596f2edc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page