Validated Pandas DataFrames
Project description
Validated Pandas DataFrames
FrameGuard is a wrapper class around the Pandas DataFrame that stores and manages schema to ensure the integrity of the underlying data. The FrameGuard API allows you to append instances to the underlying DataFrame—but only if they have been successfully validated against the schema. FrameGuard checks for:
- data type equality,
- boundary conditions (minimum/maximum) on numerical features,
- set membership for categorical features,
- regex pattern matching,
- and more!
FrameGuard is presently in the alpha stage and more features and tests are being developed actively. Please send bug reports and feature requests to the author or post them as issues.
Quick Start
Installation
Install FrameGuard, e.g., from PyPI:
$ python -m pip install frameguard
FrameGuard depends on numpy
, pandas
and pyyaml
.
Usage
In this example, we'll use the iris flower dataset:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris["data"], columns=iris["feature_names"])
target = pd.DataFrame(iris["target"], columns=["species"])
df = pd.concat([df, target], axis=1)
We begin by importing and instantiating the FrameGuard
class:
from frameguard.frameguard import FrameGuard
fg = FrameGuard(df, auto_detect=True, categories=["species"])
Building schema...
=============================================================================
Schema for feature 'sepal length (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'sepal width (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'petal length (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'petal width (cm)':
{'data_type': 'float64', 'allow_null': False}
=============================================================================
Schema for feature 'species':
{'data_type': 'int32', 'levels': array([0, 1, 2]), 'allow_null': False}
=============================================================================
Done! Created constraints for 5 features.
We instructed FrameGuard to generate the schema automatically, indicating that the "species"
column represents a categorical variable.
So far, so good. Let's see what happens when we try to append bad data:
batch = pd.DataFrame({
"sepal length (cm)": [4.8, 5.2, 4.7],
"sepal width (cm)": [3.3, 3.4, 3.0],
"petal length (cm)": [1.4, 1.2, 1.3],
"petal width (cm)": [0.2, 0.2, 0.3],
"species": [0, 0, 3] # Bad target label
})
fg.append(batch)
---------------------------------------------------------------------------
[...]
ValidationError: Incorrect type for 'species' in batch.
During handling of the above exception, another exception occurred:
[...]
FrameGuardError: Batch does not satisfy schema. Operation cancelled...
Thus, the integrity of the underlying DataFrame is assured.
Presently, automatic schema detection is perhaps too simple for most real-world use cases. FrameGuard allows you to add and update constraints manually:
fg = FrameGuard(df)
fg.add_constraint(
features=[
"sepal length (cm)",
"sepal width (cm)",
"petal length (cm)",
"petal width (cm)"
],
data_type="float64",
allow_null=False
)
fg.add_constraint(
features=["species"],
data_type="int32",
levels=[0, 1, 2],
allow_null=False
)
Modifications to schemata will not be accepted if they do not match the data:
fg.add_constraint(
features=["species"],
data_type="str",
levels=["setosa", "versicolor", "virginica"],
allow_null=False
)
SchemaWarning: Type mismatch for 'species'. Skipping...
When we're satisfied, we can export our schema in JSON or YAML form. By default, schema are exported to the current working directory in YAML format:
fg.export_schema()
Schema exported successfully to schema-2020-11-21-162209.yml.
This is what the output looks like:
features:
petal length (cm):
allow_null: false
data_type: float64
petal width (cm):
allow_null: false
data_type: float64
sepal length (cm):
allow_null: false
data_type: float64
sepal width (cm):
allow_null: false
data_type: float64
species:
allow_null: false
data_type: int32
levels:
- 0
- 1
- 2
Just as well, we may import a schema after initialization. The DataFrame will be checked automatically against the schema provided that the schema was loaded correctly:
fg = FrameGuard(df)
fg.import_schema("schema-2020-11-21-162209.yml")
Schema loaded successfully!
Validating DataFrame...
Checking feature 'sepal length (cm)'...
Done checking feature 'sepal length (cm)'.
Found 0 integrity violation(s).
Checking feature 'sepal width (cm)'...
Done checking feature 'sepal width (cm)'.
Found 0 integrity violation(s).
Checking feature 'petal length (cm)'...
Done checking feature 'petal length (cm)'.
Found 0 integrity violation(s).
Checking feature 'petal width (cm)'...
Done checking feature 'petal width (cm)'.
Found 0 integrity violation(s).
Checking feature 'species'...
Done checking feature 'species'.
Found 0 integrity violation(s).
Done validating DataFrame. Found 0 integrity violation(s).
Alternatively, if you have a schema in the form of a mapping, YAML or JSON object in memory, you could load it using the load_schema()
method.
Constraints
Presently, the following constraints are supported:
"data_type"
– the data type (NumPy types only);"min"
– the minimum value for numerical features;"max"
– the maximum value for numerical features;"levels"
– the allowed levels for categorical features;"pattern"
– a pattern for matching regular expressions;"all_unique"
– whether duplicated values are permitted; and"allow_null"
– whether null values are allowed.
Planned Updates
- Write more tests and complete documentation
- Improve automatic detection of schema
- Add support for datetime detection and formatting
- Add support for conformity of numeric features to statistical distributions
Authors
FrameGuard is written and maintained by Hannah White.
Acknowledgements
The FrameGuard logo is set in Google's Roboto Bold 700 Italic.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file frameguard-0.1.1.tar.gz
.
File metadata
- Download URL: frameguard-0.1.1.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9da49033d5bf3bc1271ebe8a5d11d6611f80b2c400162d3e7f87ae4c62f244f2 |
|
MD5 | ca9a413d6d877de268cd8cc9b5f6e6f9 |
|
BLAKE2b-256 | b152c3a4693ead02171db2fe33cccdf6808fbe0b9700fa6132e68dc8396547d1 |
File details
Details for the file frameguard-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: frameguard-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cd08f95287a15d80c9c69c2c387d2a03883e2270aa9958c4c30895024a946a1 |
|
MD5 | 4319d493030dd82531de676c50d4a14e |
|
BLAKE2b-256 | ac85906b1451ced08b4dff80e24a55378e173110942f7b88b4f7d6060b26b317 |