Skip to main content

Validation library for Pandas Dataframe

Project description

pandas-validity

PyPI - Version PyPI - Python Version Test and lint codecov Code style: black Checked with mypy Poetry PyPI - License

What is it?

pandas-validity is a Python library for the validation of pandas DataFrames. It provides a DataFrameValidator class that serves as a context manager. Within this context, you can perform multiple validations and checks. Any encountered errors are collected and raised at the end of the process. The DataFrameValidator raises a ValidationErrorsGroup exception to summarize the errors.

Installation

You can easily install the latest released version using binary installers from the Python Package Index (PyPI):

pip install pandas-validity

Development Installation

Prerequisites: poetry for environment management

The source code is currently hosted on GitHub at ohmycoffe/pandas-validity. To get the development version:

git clone git@github.com:ohmycoffe/pandas-validity.git

To install the project and development dependencies:

make install 

To run tests:

make test 

To view all possible commands, use:

make help

Usage

import pandas as pd
import datetime
from pandas_validity import DataFrameValidator

# Create a sample DataFrame
df = pd.DataFrame(
        {
            "A": [1, 2, 3],
            "B": ["a", None, "c"],
            "C": [2.3, 4.5, 9.2],
            "D": [
                datetime.datetime(2023, 1, 1, 1),
                datetime.datetime(2023, 1, 1, 2),
                datetime.datetime(2023, 1, 1, 3),
            ],
        }
    )

# Define your expectations and data type mappings
expected_columns = ['A', 'B', 'C', 'E']
data_types_mapping = {
            "A": 'float',
            "D": 'datetime'
        }

# Use DataFrameValidator for validation
with DataFrameValidator(df) as validator:
    validator.is_empty()
    validator.has_required_columns(expected_columns)
    validator.has_no_redundant_columns(expected_columns)
    validator.has_valid_data_types(data_types_mapping)
    validator.has_no_missing_data()

Output:

Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has missing columns: ['E']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has redundant columns: ['D']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Column 'A' has an invalid data type: 'int64'
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
  + Exception Group Traceback (most recent call last):
...
  | pandas_validity.exceptions.ValidationErrorsGroup: Validation errors found: 4. (4 sub-exceptions)
  +-+---------------- 1 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has missing columns: ['E']
    +---------------- 2 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has redundant columns: ['D']
    +---------------- 3 ----------------
    | pandas_validity.exceptions.ValidationError: Column 'A' has an invalid data type: 'int64'
    +---------------- 4 ----------------
    | pandas_validity.exceptions.ValidationError: Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
    +------------------------------------

The library supports the following data types for validation:

  • predefined: "str", "int", "float","datetime", "bool"
  • or any Callable that accepts a data type/dtype object and returns a boolean value to indicate the validation status - example: pd.api.types.is_string_dtype

Development

Prerequisites: poetry for environment management

The source code is currently hosted on GitHub at: https://github.com/ohmycoffe/pandas-validity

git clone git@github.com:ohmycoffe/pandas-validity.git

To install the project and development dependencies:

make install 

To run tests:

make test 

To view all possible commands, use:

make 

License

This project is licensed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_validity-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

pandas_validity-0.1.1-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file pandas_validity-0.1.1.tar.gz.

File metadata

  • Download URL: pandas_validity-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.7 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for pandas_validity-0.1.1.tar.gz
Algorithm Hash digest
SHA256 51db9fc1121cb9a9c22fc6bf08bfc71e52398f21b5d9ab516f6bb684a22a95d1
MD5 87e80c8c3480f885063d2eba25cd046d
BLAKE2b-256 8b435c62c45801b4caa25976f5376db1fdce0565c4d1d9de9786a193204127a2

See more details on using hashes here.

File details

Details for the file pandas_validity-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pandas_validity-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.7 Linux/5.15.90.1-microsoft-standard-WSL2

File hashes

Hashes for pandas_validity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eec0ed82eeae0894c34e61e3f5c55542cd07fadc1ce5b6ed1a4cc7c801bce8c8
MD5 d740ce8743e345e8e5f28ce57e09eaea
BLAKE2b-256 b75443f6405c10b64363e6dea92082fa226003aefad03779015f4b255d7d4aee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page