automated metadata validation for ONS metadata templates

These details have not been verified by PyPI

Project links

homepage

Reason this release was yanked:

deprecated: use latest version

Project description

Automated metadata validation

This project is for automatically validating metadata templates that accompany IDS data deliveries. The fields in a filled metadata template are each checked against a set of defined conditions. For example, many fields are mandatory, some fields may not contain spaces or special characters, and so on. Note that some metadata requirements cannot be automatically validated. Some human inspection will always be necessary, for example to sense-check free text fields.

Project structure

Below is the folder structure

automated_metadata_validation/
|- io
    |- cell_template.py
    |- input_functions.py
    |- output_functions.py
|- processing
    |- dev_utils.py
    |- processing_utils.py
|- reference
    |- enums.py
    |- lookups.py
    |- v1_temp.py
    |- v2_temp.py
|- utils
    |- logger.py
|- validation
    |- _validation_checks.py
    |- _validation_utils.py
    |- back_office_validations.py
    |- codes_and_values_validations.py
    |- dataset_file_validations.py
    |- dataset_resource_validations.py
    |- dataset_series_validations.py
    |- variables_validations.py

io

Houses input and output functions and the MetadataCell and MetadataValues objects that extract the data from the openpyxl workbook into the format the rest of the project uses.

processing

Processing functions that apply the validation checks to each of the fields.

reference

Data structures that hold the cell information for the templates, enums for each of the dropdowns used in the template and lookups for conversion of column names between templates. E.g. in one version a cell is called "File format" and another is "File Format"

utils

basic logger

validation

This is the core module of the project.

_validation_checks.py

This has atomised validation checks. The format is that the check takes a single item as an argument and returns True if the item passes the check.

An example check function:

def must_start_with_capital(item: str) -> bool:
    if not isinstance(item, str):
        raise TypeError(f"expected type str but got {type(item)}")
    return item[0].isupper()

The name of the function is the error that the user will see if the check fails.

E.g.

must_start_with_capital("example")
# False

This means the output will list the cell location, the value "example" and the name of the function that failed to clearly inform the user of the required fix: "example": "must_start_with_capital. ". The warning ends with a full stop and space to allow the chaining of warnings for multiple fails: "example ": "must_start_with_capital. must_not_end_with_whitespace. ".

_validation_utils.py

Contains the utils that are used by multiple of the validation checks.

Per tab validation checks

The rest of the files are the validation checks for each of the tabs. The structure of a function is as follows:

def validate_Variables_personally_identifiable_information(
    values: Sequence,
) -> Tuple:
    hard_checks = [
        # checks from the validation checks are put in this list
        # they must not be called though
        vc.must_meet_condition,
        vc.must_meet_another_condition,
    ] + STRING_HYGIENE_CHECKS
    soft_checks = []
    return check_fails(values, hard_checks), check_fails(values, soft_checks)

The function naming standard is validate_{TabName}_{variable_name}(). The tab name should be in CamelCase, and the variable name should be in snake_case, this is to differentiate between the two.

The function must only take a Sequence of values and return check_fails() for each of the hard and soft checks.

Hard checks are conditions that can be conclusively measured automatically. Failing a hard check means that something is definitely wrong and needs changing. This also means that hard check fails will usually also cause an ingest failure if untreated, since the ingest process also has fixed expectations about machine-readable content and formats.

Soft checks are checks that require inspection, but not necessarily action, if they fail. Either they cover preferences that aren't strict requirements, or they involve checking something that can't be perfectly measured automatically. For example, we may expect a certain style of response most of the time, but there may be corner cases where unusual answers are still acceptable and correct.

The output returns hard check and soft check fails as dictionaries.

Note that STRING_HYGIENE_CHECKS are common to most string-type variables, and includes checking for leading and trailing spaces, as well as double spaces.

MetadataCell and MetadataValues

MetadataCell

These objects help translate the data from the excel file into manipulatable entities. MetadataCells store the following attributes about a variable from the template:

@attr.define
class MetadataCell:
    tab: str                # the tab the variable is from
    name: str               # the name of the variable
    ref_cell: str           # the cell that the name is stored in the template
    value_col: str          # the column the values are stored: e.g. "F"
    column: bool            # True if the variable is an entire column, False if it's a single cell
    row_start: int          # the index that the values start at
    mandatory: bool         # if the values are mandatory
    enum: list              # the dropdown that the values must be from (if applicable)
    datatype: Callable      # the expected datatype (Python native): str, int, float
    func: Callable          # the validation function

MetadataValues

These consist of a MetadataCell object and values. they have a validate() method which performs the majority of the work. Populating the hard_fails and soft_fails attributes.

@attr.define
class MetadataValues:
    cell: MetadataCell      # MetadataCell object that stores the variable details
    values: list            # values extracted from the excel template
    hard_fails: dict        # the fails from the hard validation checks
    soft_fails: dict        # the fails from the soft validation checks

There is a simplified version of the validate() method below to show the flow:

    def validate(self) -> None:
        none_fails, enum_fails, hard_fails = {}, {}, {}

        # convert all nones to str representation this is important for the
        # fail dicts
        none_set = [None, np.nan]
        self.convert_set_to_string(none_set)

        # convert to a set to remove validating repeat values
        unchecked_values = set(self.values)

        # don't check if not a mandatory cell at the moment
        if self.cell.mandatory:

            # remove nones and empty strings
            none_fails = remove_nones_etc()
            unchecked_values = self._remove_checked_values()

            # convert to required type before validating ignoring the string nones
            unchecked_values, uncastable_values = self.convert_to_cell_datatype()

            # if the cell has an enum check values are in the enum
            if self.cell.enum:
                enum_fails: Dict = remove_values_not_in_enums()
                unchecked_values = self._remove_checked_values()

            # remaining values are checked against the validation function
            hard_fails, self.soft_fails = self.cell.func(unchecked_values)


            self.hard_fails = {
                **uncastable_values,
                **none_fails,
                **enum_fails,
                **hard_fails,
            }

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

1.2.10

Mar 23, 2026

1.2.9

Feb 16, 2026

1.2.8

Oct 9, 2025

1.2.7

Jul 18, 2025

1.2.6

Jun 6, 2025

1.2.5

May 22, 2025

1.2.4

May 9, 2025

1.2.3

May 6, 2025

1.2.2

Feb 25, 2025

1.2.1

Jan 27, 2025

1.2.0

Jan 23, 2025

1.1.1

Jan 21, 2025

1.1.0

Jan 13, 2025

1.0.1

Dec 19, 2024

1.0.0

Dec 19, 2024

0.1.7 yanked

Oct 29, 2024

0.1.6 yanked

Oct 28, 2024

Reason this release was yanked:

deprecated: use latest version

0.1.5

Sep 27, 2024

0.1.4

Sep 6, 2024

0.1.3 yanked

Sep 6, 2024

Reason this release was yanked:

broken

0.1.2

Jul 16, 2024

0.1.1

Jul 16, 2024

This version

0.1.0 yanked

Jul 12, 2024

Reason this release was yanked:

deprecated: use latest version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ons_metadata_validation-0.1.0.tar.gz (55.8 kB view details)

Uploaded Jul 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ons_metadata_validation-0.1.0-py3-none-any.whl (66.6 kB view details)

Uploaded Jul 12, 2024 Python 3

File details

Details for the file ons_metadata_validation-0.1.0.tar.gz.

File metadata

Download URL: ons_metadata_validation-0.1.0.tar.gz
Upload date: Jul 12, 2024
Size: 55.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for ons_metadata_validation-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1fea77a43b557818a7eea126803001acea86a0d31ab9ecd0a5621aa703b77342`
MD5	`b2e85fa0c1f06343e55134fe66d33345`
BLAKE2b-256	`6c9933b6a4b69a7423096439f8f9487f56b423cd5b458b41c3dfa1c56a5a6db2`

See more details on using hashes here.

File details

Details for the file ons_metadata_validation-0.1.0-py3-none-any.whl.

File metadata

Download URL: ons_metadata_validation-0.1.0-py3-none-any.whl
Upload date: Jul 12, 2024
Size: 66.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for ons_metadata_validation-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f003d1860ae07ff01fb5d0a13bcabfa465478e9b2cf8b6059d19cf0007d14ac9`
MD5	`a813723f4bd832ee197d910eb7cb5e76`
BLAKE2b-256	`b424d60435b0728bc5b810a3905a15ccceb62cc7e6f94851674d797004827352`

See more details on using hashes here.

ons-metadata-validation 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Automated metadata validation

Project structure

io

processing

reference

utils

validation

_validation_checks.py

_validation_utils.py

Per tab validation checks

MetadataCell and MetadataValues

MetadataCell

MetadataValues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes