A lightweight data checker for pandas, polars or pyspark built using pandera
Project description
onsdatachecker
A Data checker which uses pandera to perform the bulk checks. This wrapper can be used to check contents of a dataframe against a pre defined schema either defined directly within your python script, or saved as a supported configuration file.
Getting started
To start using this project, first make sure your system meets its requirements.
It's suggested that you install this package and its requirements within a virtual environment.
Requirements
- Python 3.10+ installed
Contributors have some additional requirements - please see our contributing guidance.
Installing the package
Whilst in the root folder, in a terminal, you can install the package and its Python dependencies using:
pip install git+https://github.com/ONSdigital/onsdatachecker.git
or from PyPI:
pip install onsdatachecker
It is not recommended to install specific requirements via Github, instead use the following methods for installing from PyPI
Polars support
To install the required packages for polars please run the one of the following commands:
# Polars support from PyPI
pip install onsdatachecker[polars]
Pyspark support
To install the required packages for pyspark please run the one of the following commands:
pip install onsdatachecker[pyspark]
N.B. this does not install the pyspark package, due to ONS DAP having a different install for pyspark.
You may need to install this separately, but shouldn't be needed for ONS colleagues working in DAP.
For networked laptops you will need to install pyspark as usual (pip install pyspark[connect])
Pre-Defined Checks
These checks can be included in the lists for individual columns in your schema, depending on the data type.
| Data Type | Check Name | Parameter | Check Definition |
|---|---|---|---|
| integer / double | Minimum value | min_val | Checks that all values are above or equal to the minimum value |
| integer / double | Maximum value | max_val | Checks that all values are below or equal to the maximum value |
| character | Minimum length | min_length | Checks that all strings have length are above or equal to the minimum length |
| character | Maximum length | max_length | Checks that all strings have length below or equal to the maximum length |
| any | allowed values | allowed_values | Validates that entries match a set of permitted values, list or regex can be used. (Optional and can use forbidden values instead) |
| any | forbidden values | forbidden_values | Validates that entries do not contain a set of forbidden values, list can be used. (Optional and can use allowed values instead. Does not support regex to use regex we recommend using allowed_characters. A TypeError message will be provided with further details) |
| any | Missing values check | allow_na | Checks for missing or NA values in the column. |
| double | Minimum decimal places | min_decimal | Checks that all values have more or equal amounts of decimal places |
| double | Maximum decimal places | max_decimal | Checks that all values have less or equal amounts of decimal places |
| date / datetime | Minimum Date | min_date | Checks that all dates are after the minimum date using the format “YYYY-MM-DD” |
| date / datetime | Maximum Date | max_date | Checks that all dates are before the maximum date using the format “YYYY-MM-DD” |
| date/ datetime | Minimum Datetime | min_datetime | Checks that all dates are after the minimum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS |
| date/ datetime | Maximum Datetime | max_datetime | Checks that all dates are before the maximum datetime. Accepted formats: Y, YM, YMD, YMDH, YMDHM and YMDHMS |
Custom Checks
The ability to add custom checks is supported through pandera using lambda functions.
Custom checks cannot be defined in the main schema and must instead be defined as its own dictionary in your python script.
Then when creating your DataValidator object, simply pass this as an additional argument and your custom check will be applied across the entire dataframe.
Note: You will get a log entry per column for this check, even for columns that are not contained in your custom check.
my_custom_checks = {
"my_custom_check_name" : lambda df: (df["column_1"] < 100) & (df["column_2"].isna())
}
new_validator = DataValidator(
schema = schema,
data=df,
file = "output_report.yaml",
format="yaml",
custom_checks = my_custom_checks)
new_validator.validate()
Install for contributors/developers
To install the contributing requirements, use:
python -m pip install -U pip setuptools
pip install -e .[dev]
pre-commit install
This installs an editable version of the package. This means that when you update the package code you do not have to reinstall it for the changes to take effect. This saves a lot of time when you test your code.
Remember to update the setup and requirement files inline with any changes to your package.
Creating the documentation locally
first pip instal the developer requirements, then run mkdocs serve to host the documentation on your local environment
pip install -e .[docs]
mkdocs serve
Project structure layout
The cookiecutter template generated for each project will follow this folder structure:
.
├── onsdatachecker/
│ └── onsdatachecker/
│ ├── checks_loaders_and_exporters/
│ │ ├── __init__.py
│ │ └── checks.py
│ │ └── schema_loader.py
│ │ └── validator_exporter.py
│ │ └── validator_template.html
│ ├── __init__.py
│ ├── main.yml
└── ...
Bumping project version
bump-my-version is used to streamline the process for creating new versions and releases.
To view the possible version bumps use bump-my-version show-bump and the three possible options will be presented.
Then to bump the version use bump-my-version bump <increment> where <increment> is replaced with either major, minor or patch.
A git tag is also created and can be pushed using the folowing code (pushing the v1.0.0 release tag!)
git push origin v1.0.0
Licence
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.
Contributing
If you want to help us build and improve onsdatachecker, please take a look at our
contributing guidelines.
Acknowledgements
This project structure is based on the govcookiecutter template project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onsdatachecker-2.1.0.tar.gz.
File metadata
- Download URL: onsdatachecker-2.1.0.tar.gz
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
689f545952393b4525a4b30be1726678d99f5ddcaa1e114a3aeab254bcc20af3
|
|
| MD5 |
e54427507eebc34d623578d1a6050106
|
|
| BLAKE2b-256 |
76e0cb88de60e7b74ed818c224b082d33dcd12230b3c1731d4a148e0a4f27a96
|
Provenance
The following attestation bundles were made for onsdatachecker-2.1.0.tar.gz:
Publisher:
build-and-test.yml on ONSdigital/datachecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
onsdatachecker-2.1.0.tar.gz -
Subject digest:
689f545952393b4525a4b30be1726678d99f5ddcaa1e114a3aeab254bcc20af3 - Sigstore transparency entry: 1776793914
- Sigstore integration time:
-
Permalink:
ONSdigital/datachecker@bcac8b131871f64bfed2ff1a3aa5ce57edc5b2d5 -
Branch / Tag:
refs/tags/v2.1.0 - Owner: https://github.com/ONSdigital
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-test.yml@bcac8b131871f64bfed2ff1a3aa5ce57edc5b2d5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file onsdatachecker-2.1.0-py3-none-any.whl.
File metadata
- Download URL: onsdatachecker-2.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16e44185aeba261bd78feb2caff8d74ca5f889d7712694701eeae6ea74b1aab7
|
|
| MD5 |
2a21361a65623b66aab72f6f7f54598f
|
|
| BLAKE2b-256 |
99e1d337a1b85a3206b7340e00a564155dae4735dcd6d64587a94ba790fe0e20
|
Provenance
The following attestation bundles were made for onsdatachecker-2.1.0-py3-none-any.whl:
Publisher:
build-and-test.yml on ONSdigital/datachecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
onsdatachecker-2.1.0-py3-none-any.whl -
Subject digest:
16e44185aeba261bd78feb2caff8d74ca5f889d7712694701eeae6ea74b1aab7 - Sigstore transparency entry: 1776794155
- Sigstore integration time:
-
Permalink:
ONSdigital/datachecker@bcac8b131871f64bfed2ff1a3aa5ce57edc5b2d5 -
Branch / Tag:
refs/tags/v2.1.0 - Owner: https://github.com/ONSdigital
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-test.yml@bcac8b131871f64bfed2ff1a3aa5ce57edc5b2d5 -
Trigger Event:
push
-
Statement type: