data linter
Project description
Data Linter
A python package to to allow automatic validation of data as part of a Data Engineering pipeline. It is designed to automate the process of moving data from Land to Raw-History as described in the ETL pipline guide
This package implements different validators using different packages:
frictionless
: Uses Frictionless data to validate the data against our metadata schemas. More information can be found heregreat-expectations
: Uses the Great Expectations data to validate the data against our metadata schemas. More information can be found here
Installation
pip install data_linter # frictionless only
pip install data_linter[ge] # To include packages required for teh great-expectations validator
Usage
This package takes a yaml
based config file written by the user (see example below), and validates data in the specified Land bucket against specified metadata. If the data conforms to the metadata, it is moved to the specified Raw bucket for the next step in the pipeline. Any failed checks are passed to a separate bucket for testing. The package also generates logs to allow you to explore issues in more detail.
To run the validation, at most simple you can use the following:
from data_linter import run_validation
config_path = "config.yaml"
run_validation(config_path)
Example config file
land-base-path: s3://land-bucket/my-folder/ # Where to get the data from
fail-base-path: s3://fail-bucket/my-folder/ # Where to write the data if failed
pass-base-path: s3://pass-bucket/my-folder/ # Where to write the data if passed
log-base-path: s3://log-bucket/my-folder/ # Where to write logs
compress-data: true # Compress data when moving elsewhere
remove-tables-on-pass: true # Delete the tables in land if validation passes
all-must-pass: true # Only move data if all tables have passed
fail-unknown-files:
exceptions:
- additional_file.txt
- another_additional_file.txt
validator-engine: frictionless # will default to this if unspecified
# Tables to validate
tables:
table1:
required: true # Does the table have to exist
pattern: null # Assumes file is called table1
metadata: meta_data/table1.json
table2:
required: true
pattern: ^table2
metadata: meta_data/table2.json
row-limit: 10000 # for big tables - only take the first x rows
You can also run the validator as part of a python script, where you might want to dynamically generate your config:
from data_linter.validation import run_validation
base_config = {
"land-base-path": "s3://my-bucket/land/",
"fail-base-path": "s3://my-bucket/fail/",
"pass-base-path": "s3://my-bucket/pass/",
"log-base-path": "s3://my-bucket/log/",
"compress-data": False,
"remove-tables-on-pass": False,
"all-must-pass": False,
"validator-engine": "great-expectations",
"validator-engine-params": {"default_result_fmt": "BASIC", "ignore_missing_cols": True},
"tables": {}
}
def get_table_config(table_name):
d = {
"required": False,
"expect-header": True,
"metadata": f"metadata/{table_name}.json",
"pattern": r"^{}\.jsonl$".format(table_name),
"headers-ignore-case": True,
"only-test-cols-in-metadata": True # Only currently supported by great-expectations validator
}
return d
for table in ["table1", "table2"]:
base_config["tables"][table_name] = get_table_config(table_name)
run_validation(base_config) # Then watch that log go...
Validators
Frictionless
Known errors / gotchas:
- Frictionless will drop cols in a jsonl files if keys are not present in the first row (would recommend using the
great-expectations
validator for jsonl as it uses pandas to read in the data). Link to raised issue.
Great Expectations
Known errors / gotchas:
- When setting the "default_result_fmt" to "COMPLETE" current default behavour of codebase. You may get errors due to the fact that the returned result from great expectations tries to serialise a
pd.NA
(as a value sample in you row that failed an expectation) when writing the result as a json blob. This can be avoided by setting the "default_result_fmt" to "BASIC" as seen in the Python example above. Link to raised issue.
Additional Parameters
-
default_result_fmt
: This is passed to the GE validator, if unset default option is to set the value to"COMPLETE"
. This value sets out how much information to be returned in the result from each "expectation". For more information see here. Also note the safest option is to set it to"BASIC"
for reasons discussed in the gotcha section above. -
ignore_missing_cols
: Will not fail if columns don't exist in data but do in metadata (it ignores this).
Process Diagram
How logic works
How to update
We have tests that run on the current state of the poetry.lock
file (i.e. the current dependencies). We also run tests based on the most up to date dependencies allowed in pyproject.toml
. This allows us to see if there will be any issues when updating dependences. These can be run locally in the tests
folder.
When updating this package, make sure to change the version number in pyproject.toml
and describe the change in CHANGELOG.md.
If you have changed any dependencies in pyproject.toml
, run poetry update
to update poetry.lock
.
Once you have created a release in GitHub, to publish the latest version to PyPI, run:
poetry build
poetry publish -u <username>
Here, you should substitute for your PyPI username. In order to publish to PyPI, you must be an owner of the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_linter-4.1.2.tar.gz
.
File metadata
- Download URL: data_linter-4.1.2.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.5 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfc922c9f1c8b28998046ce6c81ab7e8d5c61823f3901a51a4d12f5a78a893c0 |
|
MD5 | 54552edc5564d449336e591e164360ca |
|
BLAKE2b-256 | ce01d071d4683f2fb338e4f06818c13e4e2a8ca70a551152cb3c0bab69a89c26 |
File details
Details for the file data_linter-4.1.2-py3-none-any.whl
.
File metadata
- Download URL: data_linter-4.1.2-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.5 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90ff65f90007f52b195b7ae1dc2e2b79e6cbe2e3718afef238cba492413e3076 |
|
MD5 | 3322e7deee79e908478ea6b67ec7dcea |
|
BLAKE2b-256 | 07c13d7ae6ffb659219b17e54951797e180d7873f7db78a17b77507fd7c4aa8e |