Skip to main content

Validation library for Pandas Dataframe

Project description

pandas-validity

PyPI - Version PyPI - Python Version Test and lint codecov Code style: black Checked with mypy Poetry PyPI - License

What is it?

pandas-validity is a Python library for the validation of pandas DataFrames. It provides a DataFrameValidator class that serves as a context manager. Within this context, you can perform multiple validations and checks. Any encountered errors are collected and raised at the end of the process. The DataFrameValidator raises a ValidationErrorsGroup exception to summarize the errors.

Installation

You can easily install the latest released version using binary installers from the Python Package Index (PyPI):

pip install pandas-validity

Development Installation

Prerequisites: poetry for environment management

The source code is currently hosted on GitHub at ohmycoffe/pandas-validity. To get the development version:

git clone git@github.com:ohmycoffe/pandas-validity.git

To install the project and development dependencies:

make install 

To run tests:

make test 

To view all possible commands, use:

make help

Usage

import pandas as pd
import datetime
from pandas_validity import DataFrameValidator

# Create a sample DataFrame
df = pd.DataFrame(
        {
            "A": [1, 2, 3],
            "B": ["a", None, "c"],
            "C": [2.3, 4.5, 9.2],
            "D": [
                datetime.datetime(2023, 1, 1, 1),
                datetime.datetime(2023, 1, 1, 2),
                datetime.datetime(2023, 1, 1, 3),
            ],
        }
    )

# Define your expectations and data type mappings
expected_columns = ['A', 'B', 'C', 'E']
data_types_mapping = {
            "A": 'float',
            "D": 'datetime'
        }

# Use DataFrameValidator for validation
with DataFrameValidator(df) as validator:
    validator.is_empty()
    validator.has_required_columns(expected_columns)
    validator.has_no_redundant_columns(expected_columns)
    validator.has_valid_data_types(data_types_mapping)
    validator.has_no_missing_data()

Output:

Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has missing columns: ['E']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has redundant columns: ['D']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Column 'A' has an invalid data type: 'int64'
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
  + Exception Group Traceback (most recent call last):
...
  | pandas_validity.exceptions.ValidationErrorsGroup: Validation errors found: 4. (4 sub-exceptions)
  +-+---------------- 1 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has missing columns: ['E']
    +---------------- 2 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has redundant columns: ['D']
    +---------------- 3 ----------------
    | pandas_validity.exceptions.ValidationError: Column 'A' has an invalid data type: 'int64'
    +---------------- 4 ----------------
    | pandas_validity.exceptions.ValidationError: Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
    +------------------------------------

The library supports the following data types for validation:

  • predefined: "str", "int", "float","datetime", "bool"
  • or any Callable that accepts a data type/dtype object and returns a boolean value to indicate the validation status - example: pd.api.types.is_string_dtype

Development

Prerequisites: poetry for environment management

The source code is currently hosted on GitHub at: https://github.com/ohmycoffe/pandas-validity

git clone git@github.com:ohmycoffe/pandas-validity.git

To install the project and development dependencies:

make install 

To run tests:

make test 

To view all possible commands, use:

make 

License

This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_validity-0.1.1.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

pandas_validity-0.1.1-py3-none-any.whl (7.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page