Skip to main content

Validated Pandas DataFrames

Project description

Validated Pandas DataFrames

FrameGuard is a wrapper class around the Pandas DataFrame that stores and manages schema to ensure the integrity of the underlying data. The FrameGuard API allows you to append instances to the underlying DataFrame—but only if they have been successfully validated against the schema. FrameGuard checks for:

  • data type equality,
  • boundary conditions (minimum/maximum) on numerical features,
  • set membership for categorical features,
  • regex pattern matching,
  • and more!

FrameGuard is presently in the alpha stage and more features and tests are being developed actively. Please send bug reports and feature requests to the author or post them as issues.

Quick Start

Installation

Install FrameGuard, e.g., from PyPI:

$ python -m pip install frameguard

FrameGuard depends on numpy, pandas and pyyaml.

Usage

In this example, we'll use the iris flower dataset:

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris["data"], columns=iris["feature_names"])
target = pd.DataFrame(iris["target"], columns=["species"])
df = pd.concat([df, target], axis=1)

We begin by importing and instantiating the FrameGuard class:

from frameguard.frameguard import FrameGuard
fg = FrameGuard(df, auto_detect=True, categories=["species"])
Building schema...
=============================================================================
Schema for feature 'sepal length (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'sepal width (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'petal length (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'petal width (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'species':
{'data_type': 'int32', 'levels': array([0, 1, 2]), 'allow_null': False}
=============================================================================
Done! Created constraints for 5 features.

We instructed FrameGuard to generate the schema automatically, indicating that the "species" column represents a categorical variable.

So far, so good. Let's see what happens when we try to append bad data:

batch = pd.DataFrame({
    "sepal length (cm)": [4.8, 5.2, 4.7],
    "sepal width (cm)": [3.3, 3.4, 3.0],
    "petal length (cm)": [1.4, 1.2, 1.3],
    "petal width (cm)": [0.2, 0.2, 0.3],
    "species": [0, 0, 3] # Bad target label
})
fg.append(batch)
---------------------------------------------------------------------------
[...]
ValidationError: Incorrect type for 'species' in batch.

During handling of the above exception, another exception occurred:
[...]
FrameGuardError: Batch does not satisfy schema. Operation cancelled...

Thus, the integrity of the underlying DataFrame is assured.

Presently, automatic schema detection is perhaps too simple for most real-world use cases. FrameGuard allows you to add and update constraints manually:

fg = FrameGuard(df)
fg.add_constraint(
    features=[
      "sepal length (cm)",
      "sepal width (cm)",
      "petal length (cm)",
      "petal width (cm)"
    ],
    data_type="float64",
    allow_null=False
)
fg.add_constraint(
    features=["species"],
    data_type="int32",
    levels=[0, 1, 2],
    allow_null=False
)

Modifications to schemata will not be accepted if they do not match the data:

fg.add_constraint(
    features=["species"],
    data_type="str",
    levels=["setosa", "versicolor", "virginica"],
    allow_null=False
)
SchemaWarning: Type mismatch for 'species'. Skipping...

When we're satisfied, we can export our schema in JSON or YAML form. By default, schema are exported to the current working directory in YAML format:

fg.export_schema()
Schema exported successfully to schema-2020-11-21-162209.yml.

This is what the output looks like:

features:
  petal length (cm):
    allow_null: false
    data_type: float64
  petal width (cm):
    allow_null: false
    data_type: float64
  sepal length (cm):
    allow_null: false
    data_type: float64
  sepal width (cm):
    allow_null: false
    data_type: float64
  species:
    allow_null: false
    data_type: int32
    levels:
    - 0
    - 1
    - 2

Just as well, we may import a schema after initialization. The DataFrame will be checked automatically against the schema provided that the schema was loaded correctly:

fg = FrameGuard(df)
fg.import_schema("schema-2020-11-21-162209.yml")
Schema loaded successfully!
Validating DataFrame...

Checking feature 'sepal length (cm)'...
	Done checking feature 'sepal length (cm)'.
	Found 0 integrity violation(s).

Checking feature 'sepal width (cm)'...
	Done checking feature 'sepal width (cm)'.
	Found 0 integrity violation(s).

Checking feature 'petal length (cm)'...
	Done checking feature 'petal length (cm)'.
	Found 0 integrity violation(s).

Checking feature 'petal width (cm)'...
	Done checking feature 'petal width (cm)'.
	Found 0 integrity violation(s).

Checking feature 'species'...
	Done checking feature 'species'.
	Found 0 integrity violation(s).

Done validating DataFrame. Found 0 integrity violation(s).

Alternatively, if you have a schema in the form of a mapping, YAML or JSON object in memory, you could load it using the load_schema() method.

Constraints

Presently, the following constraints are supported:

  • "data_type" – the data type (NumPy types only);
  • "min" – the minimum value for numerical features;
  • "max" – the maximum value for numerical features;
  • "levels" – the allowed levels for categorical features;
  • "pattern" – a pattern for matching regular expressions;
  • "all_unique" – whether duplicated values are permitted; and
  • "allow_null" – whether null values are allowed.

Planned Updates

  • Write more tests and complete documentation
  • Improve automatic detection of schema
  • Add support for datetime detection and formatting
  • Add support for conformity of numeric features to statistical distributions

Authors

FrameGuard is written and maintained by Hannah White.

Acknowledgements

The FrameGuard logo is set in Google's Roboto Bold 700 Italic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frameguard-0.1.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

frameguard-0.1.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file frameguard-0.1.1.tar.gz.

File metadata

  • Download URL: frameguard-0.1.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0

File hashes

Hashes for frameguard-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9da49033d5bf3bc1271ebe8a5d11d6611f80b2c400162d3e7f87ae4c62f244f2
MD5 ca9a413d6d877de268cd8cc9b5f6e6f9
BLAKE2b-256 b152c3a4693ead02171db2fe33cccdf6808fbe0b9700fa6132e68dc8396547d1

See more details on using hashes here.

File details

Details for the file frameguard-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: frameguard-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0

File hashes

Hashes for frameguard-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6cd08f95287a15d80c9c69c2c387d2a03883e2270aa9958c4c30895024a946a1
MD5 4319d493030dd82531de676c50d4a14e
BLAKE2b-256 ac85906b1451ced08b4dff80e24a55378e173110942f7b88b4f7d6060b26b317

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page