Skip to main content

A package to validate the schema of a pandas dataframe

Project description

Build Status PyPI version

Opulent-Pandas

Opulent-Pandas is a schema validation packages aimed specifically at validating the schema of pandas dataframes. It takes heavy inspiration from voluptuous, and tries to stay as close as possible to the API defined in this package. Opulent-Pandas is different from voluptuous in that it heavily relies on Pandas to perform the validation. This makes Opulent-Pandas considerably faster than voluptuous on larger datasets. It does, however, mean that the input format is also a Pandas DataFrame, rather than a dict (as is the case for voluptuous) A performance comparison of voluptuous and Opulent-Pandas will be added to this readme soon!

Example

Defining a schema in Opulent-Pandas is very similar to how you would in voluptuous. To make the similarities and differences clear, let's walk through the same example as is done in the voluptuous readme.

Twitter's user search API accepts query URLs like:

$ curl 'https://api.twitter.com/1.1/users/search.json?q=python&per_page=20&page=1'

To validate this we might use a schema like:

>>> from opulent_pandas import Schema, TypeValidator, Required
>>> schema = Schema({
...   Required('q'): [TypeValidator(str)],
...   Required('per_page'): [TypeValidator(int)],
...   Required('page'): [TypeValidator(int)],
... })

Comparing with voluptuous, you'll notice that the validators per field are always specified as a list. Other than that, it's very similar to how you would define the schema with voluptuous

If we look at the more complex schema, as defined in the readme of voluptuous, we see very similar schemas:

>>> from opulent_pandas.validator import Required, RangeValidator, TypeValidator, ValueLengthValidator 
>>> schema = Schema({
...   Required('q'): [TypeValidator(str), ValueLengthValidator(min_length=1)],
...   Required('per_page'): [TypeValidator(int), RangeValidator(min=1, max=20)],
...   Required('page'): [TypeValidator(int), RangeValidator(min=0)],
... })

One difference between Opulent-Pandas and voluptuous is that Opulent-Pandas has a validate function that can be used to validate a given data structure rather tha voluptuous' approach of passing the data directly to your schema as a parameter.

If you pass data in that does not satisfy the requirements specified in your Opulent-Pandas schema, you'll get a corresponding error message. Walking through the examples provided in the voluptuous readme:

There are 3 required fields: TODO: this example should also tell you which columns are missing. Seems to be a bug.

>>> from opulent_pandas import MissingColumnError
>>> try:
...   schema.validate({})
...   raise AssertionError('MissingColumnError not raised')
... except MissingColumnError as e:
...   exc = e
>>> str(exc) == "Columns missing"
True

q must be a string:

>>> from opulent_pandas import InvalidTypeError
>>> try:
...   schema.validate(pd.DataFrame({'q': [123], 'per_page':[10], 'page': [1]})
...   raise AssertionError('InvalidTypeError not raised')
... except InvalidTypeError as e:
...   exc = e
>>> str(exc) == "Invalid data type found for column: q. Required: <class 'str'>"
True

...and must be at least one character in length:

>>> from opulent_pandas import ValueLengthError
>>> try:
...   schema.validate(pd.DataFrame({'q': [''], 'per_page': 5, 'page': 12}))
...   raise AssertionError('ValueLengthError not raised')
... except ValueLengthError as e:
...   exc = e
>>> str(exc) == "Value found with length smaller than enforced minimum length for column: q. Minimum Length: 1"
True

"per_page" is a positive integer no greater than 20:

>>> from opulent_pandas import RangeError
>>> try:
...    schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [900], 'page': [12]}))
...    raise AssertionError('RangeError not raised')
... except RangeError as e:
...    exc = e
>>> str(exc) == "Value found larger than enforced maximum for column: per_page. Required maximum: 20"
True

>>> try:
...    schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [-10], 'page': [12]}))
...    raise AssertionError('RangeError not raised')
... except RangeError as e:
...    exc = e
>>> str(exc) == "Value found larger than enforced minimum for column: per_page. Required minimum: 1"
True

"page" is an integer >= 0:

>>> try:
...   schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': ['one']})
...   raise AssertionError('InvalidTypeError not raised')
... except InvalidTypeError as e:
...   exc = e
>>> str(exc) == "Invalid data type found for column: page. Required type: <class 'int'>"
True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opulent-pandas-0.0.4.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

opulent_pandas-0.0.4-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file opulent-pandas-0.0.4.tar.gz.

File metadata

  • Download URL: opulent-pandas-0.0.4.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for opulent-pandas-0.0.4.tar.gz
Algorithm Hash digest
SHA256 c8615b0827907b02625d8c250ad03e5e38a59b337ae86e30e90f19bfa801219a
MD5 3cb181e91a394775ccd6212036260bc8
BLAKE2b-256 eb6ef037eb2645b9bcd53c228ffb51b92ba2fb5efc822852b96a31100d6a51c6

See more details on using hashes here.

File details

Details for the file opulent_pandas-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: opulent_pandas-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for opulent_pandas-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f6babbd3fa1e8b2d9a12927017f620f80fdf35fed55fe8cb67ca3f37b2882697
MD5 6bfeebe59431c235d416b70f2d65263f
BLAKE2b-256 b096caf421e6ad9c8d964cf925d88a9a3995d098439f7eca3bbc35e2c0f26f3c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page