Skip to main content

Validate OCA dataset in python workflows

Project description

OCA Data Set Validator

This is a Python package for validating Overlays Capture Architecture (OCA) data sets. It includes three classes: OCADataSet, OCADataSetErr, and OCABundle. For more information about OCA, please check OCA Specification v1.0.0.

  • OCADataSet represents an OCA data set to be validated, and can be loaded from a pandas DataFrame, an OCA Excel Data Entry File, or a CSV file.

  • OCADataSetErr represents the result set of an OCA data set validation. This class is generated by the data set validation, contains all the error information, and also provides three methods for a quick overview: overview(), first_error_col(), and get_error_col(attr_name).

  • OCABundle represents schema overlays from a loaded .json OCA bundle used to validate the data set.

Dependencies

  • pandas
  • pathlib

Usage

Installation

Install the package by typing pip install oca_ds_validator to the console. Then you could import the classes from any Python scripts.

Validation steps

  1. Import the OCA Bundle using OCABundle(path).
  2. Import the OCA Data Set using OCADataSet(pandas_dataframe) or OCADataSet.from_path(path).
  3. Generate the validation result using validate() method for class OCABundle.
from oca_ds_validator import OCADataSet, OCADataSetErr, OCABundle

test_bundle = OCABundle("/path/to/oca/bundle.json")

test_data = OCADataSet(data_set_dataframe)
# test_data = OCADataSet.from_path("/path/to/oca/data_entry_file.xlsx")
# test_data = OCADataSet.from_path("/path/to/oca/data_set_file.csv")

test_rslt = test_bundle.validate(test_data)
#########################################################################################
# Example of a possible test_rslt:
#   attr_err:
#     [('missing_attribute',
#       'Missing attribute (attribute not found in the data set).'),
#      ('unmatched_attribute',
#       'Unmatched attribute (attribute not found in the OCA Bundle).')]
#   format_err:
#     {'attribute_with_format_error_on_row_0': {0: 'Format mismatch.'},
#      'array_attribute_without_array_data_on_row_0': {0: 'Valid array required.'},
#      'attribute_with_format_error_on_row_42': {42: 'Format mismatch.'},
#      'attribute_with_errors_on_row_0_and_1': {0: 'Format mismatch.',
#                                               1: 'Valid array required.'},
#      'mandatory_attribute_with_missing_data': {0: 'Missing mandatory attribute.'},
#      'attribute_without_error': {}}
#   ecode_err:  # Not matching any of the entry codes
#     {'attribute_with_entry_codes': {0: 'One of the entry codes required.'}}
#########################################################################################

Optional Messages

There are three optional boolean arguments to control the message printed.

argument default value usage
show_data_preview False If enabled, prints a pandas preview of the data set before validation.
enable_flagged_alarm True If enabled, prints a warning message for the existence of flagged attributes.
enable_version_alarm True If enabled, prints a warning message for each overlay that contains an OCA version number different from the development version of this script (1.0).

Result Observation

The errors of the data set is stored in the generated OCADataSetErr class.

# Prints a brief summary of errors.
test_rslt.overview()
#########################################################################################
# Attribute error.
# {'missing_attribute'} found in the OCA Bundle but not in the data set;
# {'unmatched_attribute'} found in the data set but not in the OCA Bundle.
# Found 3 problematic row(s) in the following attribute(s):
# {'attribute_with_format_error_on_row_0',
#  'array_attribute_without_array_data_on_row_0',
#  'attribute_with_format_error_on_row_42',
#  'attribute_with_errors_on_row_0_and_1',
#  'mandatory_attribute_with_missing_data',
#  'attribute_with_entry_codes'}
#########################################################################################

# Prints the information of the first problematic column.
test_rslt.first_err_col()
#########################################################################################
# The first problematic column is: attribute_with_format_error_on_row_0
# Format error(s) would occur in the following rows:
# row 0 : Format mismatch.
# No entry code error found in the column.
#########################################################################################

# Prints the information of some certain column.
test_rslt.get_err_col("attribute_with_format_error_on_row_42")
#########################################################################################
# Format error(s) would occur in the following rows of column
# attribute_with_format_error_on_row_42:
# row 42 : Format mismatch.
# No entry code error found in the column.
#########################################################################################

Further Processing

# Get objects of full error details.
# You may find it useful for data visualization or further analysis.
test_rslt.get_attr_err()
test_rslt.get_format_err()
test_rslt.get_ecode_err()
test_rslt.get_char_encode_err()

Development Status

This script is created with support by Agri-food Data Canada, funded by CFREF through the Food from Thought grant held at the University of Guelph. Currently, we do not provide any warranty of any kind regarding the accuracy, security, completeness or reliability of this script or any of its parts.

At the moment, this script is developed for the validation of the following OCA attribute types:

  • Text (with regular expressions)
  • DateTime (with ISO 8601 formats)
  • Array[Type]; for any Types that are not mentioned above, only the validness of the array will be checked.

Also, besides the format overlay, the data set will be validated with the following overlays:

JSON data types are NOT validated due to the type coercion while importing Excel or CSV files. We also recommend that you import the data set as Pandas DataFrame to prevent unexpected DateTime formatting by software such as Microsoft Excel.

Any validation errors other than the above are NOT guaranteed to be filtered by this script. Please feel free to contact us with any suggestions for future development.

You could also find a well-developed OCA Validator by The Human Colossus Lab (Rust required).

License

EUPL (European Union Public License), version 1.2

We have distilled the most crucial license specifics to make your adoption seamless: see here for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oca_ds_validator-0.0.1.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

oca_ds_validator-0.0.1-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file oca_ds_validator-0.0.1.tar.gz.

File metadata

  • Download URL: oca_ds_validator-0.0.1.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for oca_ds_validator-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0d43992ad39d605ab4d840caa6b133c758fc059c08842c8e141b13e040631c2d
MD5 aa9b1c09f79fcf810146dfcafc7234ea
BLAKE2b-256 fa27e09d644f8b288494a4692ee7c6254c23b776d68013ca304319b2637104ea

See more details on using hashes here.

File details

Details for the file oca_ds_validator-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for oca_ds_validator-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cbc9c79c413f5b30350344b01b58ad5a9f8950cfadc60b7fcc1dad60349335e4
MD5 0d76e51c5efceda0f187b6206bb14965
BLAKE2b-256 1ec66f271c5a2769dd3951973c3a21a1d6c840bb3363a0a3fb55667146f1573d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page