Skip to main content

megaprofiler is a highly customizable and extensible data profiling library designed to help data scientists and engineers understand their datasets before performing analysis or building models.

Project description

Downloads

MegaProfiler is an easy-to-use, highly customizable Python library designed for profiling and analyzing datasets. It provides deep insights into your data's structure, distributions, missing values, anomalies, and more. With built-in support for data validation, anomaly detection, and data drift tracking, it's the perfect tool for data scientists and engineers looking to automate exploratory data analysis (EDA) and quality checks for large datasets.

While other libraries like pandas-profiling exist, MegaProfiler stands out for its extensibility, scalability, and integration with data validation and anomaly detection, making it ideal for data preprocessing and ETL pipelines.


Key Features

  • Automatic Data Summaries:

    • Automatically generate statistical summaries, distributions, unique values, missing values, and data types for each column.
  • Anomaly Detection:

    • Flag unusual distributions, outliers, or inconsistent data using z-score, IQR, or machine learning techniques (e.g., Isolation Forest).
  • Data Validation:

    • Set custom validation rules (e.g., no missing values in specific columns, data type constraints) and receive alerts for rule violations.
  • Custom Reports:

    • Generate configurable reports in various formats (e.g., HTML, PDF), with customizable thresholds for anomalies.
  • Data Drift Detection:

    • Track changes in data distributions over time to detect shifts in data quality or content, useful for continuous monitoring of data pipelines.
  • Multicollinearity and Correlation Analysis:

    • Perform advanced correlation analysis and detect multicollinearity with Variance Inflation Factor (VIF).
  • Time Series Analysis:

    • Decompose and analyze time series data to identify trends, seasonality, and residuals.

Benefits

MegaProfiler is an invaluable tool for:

  • Data Scientists and Engineers: It automates exploratory data analysis, saving valuable time and reducing manual inspection of large datasets.
  • ETL Pipelines: Easily detect issues such as missing data, outliers, or data drift, and ensure the quality of data moving through your pipeline.
  • Data Quality Assurance: Validate the integrity of your data before model training or analysis, minimizing the risk of poor model performance due to flawed data.

Installation

You can install MegaProfiler using pip:

pip install megaprofiler

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megaprofiler-1.0.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

megaprofiler-1.0.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file megaprofiler-1.0.0.tar.gz.

File metadata

  • Download URL: megaprofiler-1.0.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for megaprofiler-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4d2a573d7f4ae9ce569de4a0daf33e832bc625ca84b7578a23a0f2d2c1814f1e
MD5 ec30c61df06cc9e24beb29507890d597
BLAKE2b-256 fd2c325b67ef7c5395b0bc2bc7bea8ccc3aee5663e809809ffd1d39a4b3728f4

See more details on using hashes here.

File details

Details for the file megaprofiler-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: megaprofiler-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for megaprofiler-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 52a7ec40e2ca27b584fff7bec340e264f8ab66c121238a29eb427d727638ab6d
MD5 ebd1ad705a1f583de16a7aadf0b49e0d
BLAKE2b-256 e18d84efde276b1a5dabe3d8e648ebfb56cfb39612b81862bbe6fac809edc566

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page