Skip to main content

`nhs data validation engine` is a framework used to validate data

Project description

Data Validation Engine

License Version CI Unit Tests CI Formatting & Linting

The Data Validation Engine (DVE) is a configuration driven data validation library built and utilised by NHS England. Currently the package has been reverted from v1.0.0 release to a 0.x as we feel the package is not yet mature enough to be considered a 1.0.0 release. So please bear this in mind if reading through the commits and references to a v1+ release when on v0.x.

As mentioned above, the DVE is "configuration driven" which means the majority of development for you as a user will be building a JSON document to describe how the data will be validated. The JSON document is known as a dischema file and example files can be accessed here. If you'd like to learn more about JSON document and how to build one from scratch, then please read the documentation here.

Once a dischema file has been defined, you are ready to use the DVE. The DVE is typically orchestrated based on four key "services". These are...

Service Purpose
1. File Transformation This service will take submitted files and turn them into stringified parquet file(s) to ensure that a consistent data structure can be passed through the other services.
2. Data Contract This service will validate and perform type casting against a stringified parquet file using pydantic models.
3. Business Rules The business rules service will perform more complex validations such as comparisons between fields and tables, aggregations, filters etc to generate new entities.
4. Error Reports The error reports service will take all the errors raised in previous services and surface them into a readable format for a downstream users/service. Currently, this implemented to be an excel spreadsheet but could be reconfigured to meet other requirements/use cases.

If you'd like more detailed documentation around these services the please read the extended documentation here.

The DVE has been designed in a way that's modular and can support users who just want to utilise specific "services" from the DVE (i.e. just the file transformation + data contract). Additionally, the DVE is designed to support different backend implementations. As part of the base installation of DVE, you will find backend support for Spark and DuckDB. So, if you need a MySQL backend implementation, you can implement this yourself. Given our organisations requirements, it will be unlikely that we add anymore specific backend implementations into the base package beyond Spark and DuckDB. So, if you are unable to implement this yourself, I would recommend reading the guidance on requesting new features and raising bug reports here.

Additionally, if you'd like to contribute a new backend implementation into the base DVE package, then please look at the Contributing section.

Installation and usage

The DVE is a Python package and can be installed using package managers such as pip. As of the latest release we support Python 3.10 & 3.11, with Spark v3.4 and DuckDB v1.1. In the future we will be looking to upgrade the DVE to working on a higher versions of Python, DuckDB and Spark.

If you're planning to use the Spark backend implementation, you will also need OpenJDK 11 installed.

Python dependencies are listed in pyproject.toml.

To install the DVE package you can simply install using a package manager such as pip.

pip install data-validation-engine

Note - Only versions >=0.6.2 are available on PyPi. For older versions please install directly from the git repo or build from source.

Once you have installed the DVE you are ready to use it. For guidance on how to create your dischema JSON document (configuration), please read the documentation.

Version 0.0.1 does support a working Python 3.7 installation. However, we will not be supporting any issues with that version of the DVE if you choose to use it. Use at your own risk.

Requesting new features and raising bug reports

Before creating new issues, please check to see if the same bug/feature has been created already. Where a duplicate is created, the ticket will be closed and referenced to an existing issue.

If you have spotted a bug with the DVE then please raise an issue here using the "bug template".

If you have feature request then please follow the same process whilst using the "Feature request template".

Upcoming features

Below is a list of features that we would like to implement or have been requested.

Feature Release Version Released?
Open source release 0.1.0 Yes
Uplift to Python 3.11 0.2.0 Yes
Uplift Pyspark to 3.5 TBA No
Allow DVE to run on Python 3.12+ TBA No
Upgrade to Pydantic 2.0 TBA No
Uplift Pyspark to 4.0+ TBA No
Create a more user friendly interface for building and modifying dischema files Not yet confirmed No

Beyond the Python and Pydantic upgrade, we cannot confirm the other features will be made available anytime soon. Therefore, if you have the interest and desire to make these features available, then please read the Contributing section and get involved.

Contributing

Please see guidance here.

Legal

This codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.

Any HTML or Markdown documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_validation_engine-0.7.5.tar.gz (153.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_validation_engine-0.7.5-py3-none-any.whl (201.9 kB view details)

Uploaded Python 3

File details

Details for the file data_validation_engine-0.7.5.tar.gz.

File metadata

  • Download URL: data_validation_engine-0.7.5.tar.gz
  • Upload date:
  • Size: 153.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for data_validation_engine-0.7.5.tar.gz
Algorithm Hash digest
SHA256 72dfa31cf331373ba24a28f95f04b69a86c27e1df408f361d2abfb30f1860c39
MD5 6a1ca4c1fb7b1d26f632175e1a2291f3
BLAKE2b-256 e234e13f88d3b63c9d7c1d4d3c0d910bad02715e5452d288bd977247a8f32d45

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_engine-0.7.5.tar.gz:

Publisher: ci_pypi_publish.yml on NHSDigital/data-validation-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_validation_engine-0.7.5-py3-none-any.whl.

File metadata

File hashes

Hashes for data_validation_engine-0.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 718ed512dc9f2c4b6c1cb8d1f837ca620d0375c416698f57cedc14ded627ab22
MD5 606a193714c887f4cee080ea21554024
BLAKE2b-256 a16d08c61cb712ceb9de9532a0c498f9333bff20c529cc94c1b8f48a2ef6539f

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_validation_engine-0.7.5-py3-none-any.whl:

Publisher: ci_pypi_publish.yml on NHSDigital/data-validation-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page