Skip to main content

Pandera Report for row-based reporting by using the power of pandera.

Project description

Pandera Extension for row-based reporting


🚀 Description

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust

If you have to report potential quality issues resulting from the dataframe validation via pandera, than pandera-report is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.

With pandera-report, you can:

  • Seamlessly integrates with the pandera library to provide enhanced data validation capabilities without interfering with the pandera functionality.
  • Provides a convenient way to enrich your data with information about why specific rows failed validation.

⚡ Setup

Using pip:

pip install pandera-report

Using poetry:

poetry add pandera-report

Quick start

The following example is taken from the pandera documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.

import pandas as pd
import pandera as pa


# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema(df)
print(validated_df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

To make usage of the pandera-report functionality for the same schema and dataframe, you can do this:

validator = DataFrameValidator() # default is quality_report=True, lazy=True
print(validator.validate(schema, df))

#     column1  column2  column3 quality_issues quality_status
#  0        1     -1.3  value_1           None          Valid
#  1        4     -1.4  value_2           None          Valid
#  2        0     -2.9  value_3           None          Valid
#  3       10    -10.1  value_2           None          Valid
#  4        9    -20.4  value_1           None          Valid

You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.

But what if the dataframe contains data quality issues? pandera will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see what pandera-report does, if we change the dataframe against the schema definition:

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value1"]
})

validator = DataFrameValidator()
print(validator.validate(schema, df))

#     column1  column2  column3                              quality_issues quality_status
#  0        1     -1.3  value_1                                        None          Valid
#  1        4     -1.4  value_2                                        None          Valid
#  2        0     -2.9  value_3                                        None          Valid
#  3       10    -10.1  value_2                                        None          Valid
#  4        9    -20.4   value1  Column <column3>: str_startswith('value_')        Invalid

Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandera_report-0.1.2.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

pandera_report-0.1.2-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file pandera_report-0.1.2.tar.gz.

File metadata

  • Download URL: pandera_report-0.1.2.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for pandera_report-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b2ad0dff6155702ca4deac57878851debdd17c26a8085f7cd16f35263cb287bf
MD5 8d88d43a11e82ba8a01cbed1c8016329
BLAKE2b-256 a905343f0e87702d87122c7517c9959fd350b7167db7954c93b767d9da1ed419

See more details on using hashes here.

File details

Details for the file pandera_report-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pandera_report-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c75de6dbc6b6e4f68b2485769f328b1fc6990bfe5911a262f4a26237ee7844a
MD5 991605c9605639df2b2bf9c438b84078
BLAKE2b-256 7247abc881079b5ad978e46bdb3433521f508bf85aa2907ce9d1366eeb44f05f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page