Skip to main content

Generate profile report for pandas DataFrame

Project description

pandas-profiling

Pandas Profiling Logo Header

Build Status PyPI download month Code Coverage Release Version Python Version Code style: black

Documentation | Slack | Stack Overflow | Latest changelog

Do you like this project? Show us your love and give feedback!

pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding.

For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:

  • Type inference: detect the types of columns in a DataFrame
  • Essentials: type, unique values, indication of missing values
  • Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent and extreme values
  • Histograms: categorical and numerical
  • Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
  • Missing values: through counts, matrix and heatmap
  • Duplicate rows: list of the most common duplicated rows
  • Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
  • File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata

The report contains three additional sections:

  • Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
  • Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
  • Reproduction: technical details about the analysis (time, version and configuration)

⚡ Looking for a Spark backend to profile large datasets? It's work in progress.

⌛ Interested in uncovering temporal patterns? Check out popmon.

▶️ Quickstart

Start by loading your pandas DataFrame as you normally would, e.g. by using:

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

To generate the standard profiling report, merely run:

profile = ProfileReport(df, title="Pandas Profiling Report")

Using inside Jupyter Notebooks

There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.

Notebook Widgets

The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:

profile.to_widgets()

The HTML report can be directly embedded in a cell in a similar fashion:

profile.to_notebook_iframe()
HTML

Exporting the report to a file

To generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, the report's data can be obtained as a JSON file:

# As a JSON string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Using in the command line

For standard formatted CSV files (which can be read directly by pandas without additional settings), the pandas_profiling executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml, in the file report.html by processing a data.csv dataset.

pandas_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html

Additional details on the CLI are available on the documentation.

👀 Examples

The following example reports showcase the potentialities of the package across a wide range of dataset and data types:

🛠️ Installation

Additional details, including information about widget support, are available on the documentation.

Using pip

PyPi Downloads PyPi Monthly Downloads PyPi Version

You can install using the pip package manager by running:

pip install -U pandas-profiling

Extras

The package declares "extras", sets of additional dependencies.

  • [notebook]: support for rendering the report in Jupyter notebook widgets.
  • [unicode]: support for more detailed Unicode analysis, at the expense of additional disk space.

Install these with e.g.

pip install -U pandas-profiling[notebook,unicode]

Using conda

Conda Downloads Conda Version

You can install using the conda package manager by running:

conda install -c conda-forge pandas-profiling

From source (development)

Download the source code by cloning the repository or click on Download ZIP to download the latest stable version.

Install it by navigating to the proper directory and running:

pip install -e .

The profiling report is written in HTML and CSS, which means a modern browser is required.

You need Python 3 to run the package. Other dependencies can be found in the requirements files:

Filename Requirements
requirements.txt Package requirements
requirements-dev.txt Requirements for development
requirements-test.txt Requirements for testing
setup.py Requirements for widgets etc.

📝 Use cases

The documentation includes guides, tips and tricks for tackling common use cases:

Use case Description
Profiling large datasets Tips on how to prepare data and configure pandas-profiling for working with large datasets
Handling sensitive data Generating reports which are mindful about sensitive data in the input dataset
Comparing datasets Comparing multiple version of the same dataset
Dataset metadata and data dictionaries Complementing the report with dataset details and column-specific data dictionaries
Customizing the report's appearance Changing the appearance of the report's page and of the contained visualizations

🔗 Integrations

To maximize its usefulness in real world contexts, pandas-profiling has a set of implicit and explicit integrations with a variety of other actors in the Data Science ecosystem:

Integration type Description
Other DataFrame libraries How to compute the profiling of data stored in libraries other than pandas
Great Expectations Generating Great Expectations expectations suites directly from a profiling report
Interactive applications Embedding profiling reports in Streamlit, Dash or Panel applications
Pipelines Integration with DAG workflow execution tools like Airflow or Kedro
Cloud services Using pandas-profiling in hosted computation services like Lambda, Google Cloud or Kaggle
IDEs Using pandas-profiling directly from integrated development environments such as PyCharm

🙋 Support

Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:

  • Stack Overflow: ideal for asking questions on how to use the package
  • GitHub Issues: bugs, proposals for changes, feature requests
  • Slack: general chat, questions, collaborations
  • Email: project collaborations or sponsoring

❗ Before reporting an issue on GitHub, check out Common Issues.

🤝🏽 Contributing

Learn how to get involved in the Contribution Guide.

A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Slack.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-profiling-3.5.0.tar.gz (261.1 kB view details)

Uploaded Source

Built Distribution

pandas_profiling-3.5.0-py2.py3-none-any.whl (325.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pandas-profiling-3.5.0.tar.gz.

File metadata

  • Download URL: pandas-profiling-3.5.0.tar.gz
  • Upload date:
  • Size: 261.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pandas-profiling-3.5.0.tar.gz
Algorithm Hash digest
SHA256 5a34d54b23ac0e190a528d7766f803ee2e70f3ec23e5244d329159932325ac0e
MD5 4978f590f7174baf116d4eadccce7184
BLAKE2b-256 a13bf1e7a2b45bbf2b2f1620308664c9247eb5494522d35aebb7c6aee29ae16c

See more details on using hashes here.

File details

Details for the file pandas_profiling-3.5.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for pandas_profiling-3.5.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2f0ebc57dd9ed6ff28ecaf9c8e968eec1bbc677bda37669a7a2fd88365d7b28e
MD5 b7e214a414e6a50685e74a81bfc1460b
BLAKE2b-256 c54f99aea38f8127aa554a3194a47a3da17a4ef081c91ee2fff15302af54004e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page