Validate OCA dataset in python workflows
Project description
OCA Data Set Validator
This is a Python package for validating Overlays Capture Architecture (OCA) data sets. It includes three classes: OCADataSet
, OCADataSetErr
, and OCABundle
. For more information about OCA, please check OCA Specification v1.0.0.
-
OCADataSet
represents an OCA data set to be validated, and can be loaded from a pandas DataFrame, an OCA Excel Data Entry File, or a CSV file. -
OCADataSetErr
represents the result set of an OCA data set validation. This class is generated by the data set validation, contains all the error information, and also provides three methods for a quick overview:overview()
,first_error_col()
, andget_error_col(attr_name)
. -
OCABundle
represents schema overlays from a loaded.json
OCA bundle used to validate the data set.
Dependencies
- pandas
- pathlib
Usage
Installation
Install the package by typing pip install oca_ds_validator
to the console. Then you could import the classes from any Python scripts.
- The package can be found here: oca_ds_validator
Validation steps
- Import the OCA Bundle using
OCABundle(path)
. - Import the OCA Data Set using
OCADataSet(pandas_dataframe)
orOCADataSet.from_path(path)
. - Generate the validation result using
validate()
method for classOCABundle
.
from oca_ds_validator import OCADataSet, OCADataSetErr, OCABundle
test_bundle = OCABundle("/path/to/oca/bundle.json")
test_data = OCADataSet(data_set_dataframe)
# test_data = OCADataSet.from_path("/path/to/oca/data_entry_file.xlsx")
# test_data = OCADataSet.from_path("/path/to/oca/data_set_file.csv")
test_rslt = test_bundle.validate(test_data)
#########################################################################################
# Example of a possible test_rslt:
# attr_err:
# [('missing_attribute',
# 'Missing attribute (attribute not found in the data set).'),
# ('unmatched_attribute',
# 'Unmatched attribute (attribute not found in the OCA Bundle).')]
# format_err:
# {'attribute_with_format_error_on_row_0': {0: 'Format mismatch.'},
# 'array_attribute_without_array_data_on_row_0': {0: 'Valid array required.'},
# 'attribute_with_format_error_on_row_42': {42: 'Format mismatch.'},
# 'attribute_with_errors_on_row_0_and_1': {0: 'Format mismatch.',
# 1: 'Valid array required.'},
# 'mandatory_attribute_with_missing_data': {0: 'Missing mandatory attribute.'},
# 'attribute_without_error': {}}
# ecode_err: # Not matching any of the entry codes
# {'attribute_with_entry_codes': {0: 'One of the entry codes required.'}}
#########################################################################################
Optional Messages
There are three optional boolean arguments to control the message printed.
argument | default value | usage |
---|---|---|
show_data_preview |
False |
If enabled, prints a pandas preview of the data set before validation. |
enable_flagged_alarm |
True |
If enabled, prints a warning message for the existence of flagged attributes. |
enable_version_alarm |
True |
If enabled, prints a warning message for each overlay that contains an OCA version number different from the development version of this script (1.0). |
Result Observation
The errors of the data set is stored in the generated OCADataSetErr
class.
# Prints a brief summary of errors.
test_rslt.overview()
#########################################################################################
# Attribute error.
# {'missing_attribute'} found in the OCA Bundle but not in the data set;
# {'unmatched_attribute'} found in the data set but not in the OCA Bundle.
# Found 3 problematic row(s) in the following attribute(s):
# {'attribute_with_format_error_on_row_0',
# 'array_attribute_without_array_data_on_row_0',
# 'attribute_with_format_error_on_row_42',
# 'attribute_with_errors_on_row_0_and_1',
# 'mandatory_attribute_with_missing_data',
# 'attribute_with_entry_codes'}
#########################################################################################
# Prints the information of the first problematic column.
test_rslt.first_err_col()
#########################################################################################
# The first problematic column is: attribute_with_format_error_on_row_0
# Format error(s) would occur in the following rows:
# row 0 : Format mismatch.
# No entry code error found in the column.
#########################################################################################
# Prints the information of some certain column.
test_rslt.get_err_col("attribute_with_format_error_on_row_42")
#########################################################################################
# Format error(s) would occur in the following rows of column
# attribute_with_format_error_on_row_42:
# row 42 : Format mismatch.
# No entry code error found in the column.
#########################################################################################
Further Processing
# Get objects of full error details.
# You may find it useful for data visualization or further analysis.
test_rslt.get_attr_err()
test_rslt.get_format_err()
test_rslt.get_ecode_err()
test_rslt.get_char_encode_err()
Development Status
This script is created with support by Agri-food Data Canada, funded by CFREF through the Food from Thought grant held at the University of Guelph. Currently, we do not provide any warranty of any kind regarding the accuracy, security, completeness or reliability of this script or any of its parts.
At the moment, this script is developed for the validation of the following OCA attribute types:
- Text (with regular expressions)
- DateTime (with ISO 8601 formats)
- Array[Type]; for any Types that are not mentioned above, only the validness of the array will be checked.
Also, besides the format overlay, the data set will be validated with the following overlays:
- Conformance Overlay, for any missing mandatory data
- Entry Code Overlay, for any data that mismatches entry codes
- Character Encoding, for any data that mismatches the specified character encoding
JSON data types are NOT validated due to the type coercion while importing Excel or CSV files. We also recommend that you import the data set as Pandas DataFrame to prevent unexpected DateTime formatting by software such as Microsoft Excel.
Any validation errors other than the above are NOT guaranteed to be filtered by this script. Please feel free to contact us with any suggestions for future development.
You could also find a well-developed OCA Validator by The Human Colossus Lab (Rust required).
License
EUPL (European Union Public License), version 1.2
We have distilled the most crucial license specifics to make your adoption seamless: see here for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file oca_ds_validator-0.0.2.tar.gz
.
File metadata
- Download URL: oca_ds_validator-0.0.2.tar.gz
- Upload date:
- Size: 44.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9d26b5d6b02a0108d8c14217be2ff4b58ded6d5e3ac68d75bc44d12b231814d |
|
MD5 | 427c92ee53f3a61bc176f0745b6d0fe7 |
|
BLAKE2b-256 | c9817a9268556ed66309f8b9aaf939ca7c881f367f27372c35af2b3df6a20753 |
File details
Details for the file oca_ds_validator-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: oca_ds_validator-0.0.2-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea40149a488de906075d888a15b5afefbcc377c13fb02f49f870a99b6d35e042 |
|
MD5 | c05650268e141c88a4b3ea0e1f10b328 |
|
BLAKE2b-256 | 6031b2ed103200399938220a502c75059985ba8616b49d2fd637688c96f23f4b |