Skip to main content

automated metadata validation for ONS metadata templates

Reason this release was yanked:

deprecated: use latest version

Project description

Automated metadata validation

This project is for automatically validating metadata templates that accompany IDS data deliveries. The fields in a filled metadata template are each checked against a set of defined conditions. For example, many fields are mandatory, some fields may not contain spaces or special characters, and so on. Note that some metadata requirements cannot be automatically validated. Some human inspection will always be necessary, for example to sense-check free text fields.

Project structure

Below is the folder structure

automated_metadata_validation/
|- io
    |- cell_template.py
    |- input_functions.py
    |- output_functions.py
|- processing
    |- dev_utils.py
    |- processing_utils.py
|- reference
    |- enums.py
    |- lookups.py
    |- v1_temp.py
    |- v2_temp.py
|- utils
    |- logger.py
|- validation
    |- _validation_checks.py
    |- _validation_utils.py
    |- back_office_validations.py
    |- codes_and_values_validations.py
    |- dataset_file_validations.py
    |- dataset_resource_validations.py
    |- dataset_series_validations.py
    |- variables_validations.py

io

Houses input and output functions and the MetadataCell and MetadataValues objects that extract the data from the openpyxl workbook into the format the rest of the project uses.

processing

Processing functions that apply the validation checks to each of the fields.

reference

Data structures that hold the cell information for the templates, enums for each of the dropdowns used in the template and lookups for conversion of column names between templates. E.g. in one version a cell is called "File format" and another is "File Format"

utils

basic logger

validation

This is the core module of the project.

_validation_checks.py

This has atomised validation checks. The format is that the check takes a single item as an argument and returns True if the item passes the check.

An example check function:

def must_start_with_capital(item: str) -> bool:
    if not isinstance(item, str):
        raise TypeError(f"expected type str but got {type(item)}")
    return item[0].isupper()

The name of the function is the error that the user will see if the check fails.

E.g.

must_start_with_capital("example")
# False

This means the output will list the cell location, the value "example" and the name of the function that failed to clearly inform the user of the required fix: "example": "must_start_with_capital. ". The warning ends with a full stop and space to allow the chaining of warnings for multiple fails: "example ": "must_start_with_capital. must_not_end_with_whitespace. ".

_validation_utils.py

Contains the utils that are used by multiple of the validation checks.

Per tab validation checks

The rest of the files are the validation checks for each of the tabs. The structure of a function is as follows:

def validate_Variables_personally_identifiable_information(
    values: Sequence,
) -> Tuple:
    hard_checks = [
        # checks from the validation checks are put in this list
        # they must not be called though
        vc.must_meet_condition,
        vc.must_meet_another_condition,
    ] + STRING_HYGIENE_CHECKS
    soft_checks = []
    return check_fails(values, hard_checks), check_fails(values, soft_checks)

The function naming standard is validate_{TabName}_{variable_name}(). The tab name should be in CamelCase, and the variable name should be in snake_case, this is to differentiate between the two.

The function must only take a Sequence of values and return check_fails() for each of the hard and soft checks.

Hard checks are conditions that can be conclusively measured automatically. Failing a hard check means that something is definitely wrong and needs changing. This also means that hard check fails will usually also cause an ingest failure if untreated, since the ingest process also has fixed expectations about machine-readable content and formats.

Soft checks are checks that require inspection, but not necessarily action, if they fail. Either they cover preferences that aren't strict requirements, or they involve checking something that can't be perfectly measured automatically. For example, we may expect a certain style of response most of the time, but there may be corner cases where unusual answers are still acceptable and correct.

The output returns hard check and soft check fails as dictionaries.

Note that STRING_HYGIENE_CHECKS are common to most string-type variables, and includes checking for leading and trailing spaces, as well as double spaces.

MetadataCell and MetadataValues

MetadataCell

These objects help translate the data from the excel file into manipulatable entities. MetadataCells store the following attributes about a variable from the template:

@attr.define
class MetadataCell:
    tab: str                # the tab the variable is from
    name: str               # the name of the variable
    ref_cell: str           # the cell that the name is stored in the template
    value_col: str          # the column the values are stored: e.g. "F"
    column: bool            # True if the variable is an entire column, False if it's a single cell
    row_start: int          # the index that the values start at
    mandatory: bool         # if the values are mandatory
    enum: list              # the dropdown that the values must be from (if applicable)
    datatype: Callable      # the expected datatype (Python native): str, int, float
    func: Callable          # the validation function 

MetadataValues

These consist of a MetadataCell object and values. they have a validate() method which performs the majority of the work. Populating the hard_fails and soft_fails attributes.

@attr.define
class MetadataValues:
    cell: MetadataCell      # MetadataCell object that stores the variable details
    values: list            # values extracted from the excel template
    hard_fails: dict        # the fails from the hard validation checks
    soft_fails: dict        # the fails from the soft validation checks

There is a simplified version of the validate() method below to show the flow:

    def validate(self) -> None:
        none_fails, enum_fails, hard_fails = {}, {}, {}

        # convert all nones to str representation this is important for the
        # fail dicts
        none_set = [None, np.nan]
        self.convert_set_to_string(none_set)

        # convert to a set to remove validating repeat values
        unchecked_values = set(self.values)

        # don't check if not a mandatory cell at the moment
        if self.cell.mandatory:

            # remove nones and empty strings
            none_fails = remove_nones_etc()
            unchecked_values = self._remove_checked_values()

            # convert to required type before validating ignoring the string nones
            unchecked_values, uncastable_values = self.convert_to_cell_datatype()

            # if the cell has an enum check values are in the enum
            if self.cell.enum:
                enum_fails: Dict = remove_values_not_in_enums()
                unchecked_values = self._remove_checked_values()

            # remaining values are checked against the validation function
            hard_fails, self.soft_fails = self.cell.func(unchecked_values)


            self.hard_fails = {
                **uncastable_values,
                **none_fails,
                **enum_fails,
                **hard_fails,
            }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ons_metadata_validation-0.1.0.tar.gz (55.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ons_metadata_validation-0.1.0-py3-none-any.whl (66.6 kB view details)

Uploaded Python 3

File details

Details for the file ons_metadata_validation-0.1.0.tar.gz.

File metadata

  • Download URL: ons_metadata_validation-0.1.0.tar.gz
  • Upload date:
  • Size: 55.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for ons_metadata_validation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1fea77a43b557818a7eea126803001acea86a0d31ab9ecd0a5621aa703b77342
MD5 b2e85fa0c1f06343e55134fe66d33345
BLAKE2b-256 6c9933b6a4b69a7423096439f8f9487f56b423cd5b458b41c3dfa1c56a5a6db2

See more details on using hashes here.

File details

Details for the file ons_metadata_validation-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ons_metadata_validation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f003d1860ae07ff01fb5d0a13bcabfa465478e9b2cf8b6059d19cf0007d14ac9
MD5 a813723f4bd832ee197d910eb7cb5e76
BLAKE2b-256 b424d60435b0728bc5b810a3905a15ccceb62cc7e6f94851674d797004827352

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page