Skip to main content

data linter

Project description

Data Linter

A python package to to allow automatic validation of data as part of a Data Engineering pipeline. It is designed to automate the process of moving data from Land to Raw-History as described in the ETL pipline guide

The validation is based on the goodtables package, from the fine folk at Frictionless Data. More information can be found at their website.

Installation

pip install data_linter

Usage

This package takes a yaml based config file written by the user (see example below), and validates data in the specified Land bucket against specified metadata. If the data conforms to the metadata, it is moved to the specified Raw bucket for the next step in the pipeline. Any failed checks are passed to a separate bucket for testing. The package also generates logs to allow you to explore issues in more detail.

To run the validation, at most simple you can use the following:

from data_linter import run_validation

config_path = "config.yaml"

run_validation(config_path)

Example config file

land-base-path: s3://land-bucket/my-folder/  # Where to get the data from
fail-base-path: s3://fail-bucket/my-folder/  # Where to write the data if failed
pass-base-path: s3://pass-bucket/my-folder/  # Where to write the data if passed
log-base-path: s3://log-bucket/my-folder/  # Where to write logs
compress-data: true  # Compress data when moving elsewhere
remove-tables-on-pass: true  # Delete the tables in land if validation passes
all-must-pass: true  # Only move data if all tables have passed
fail-unknown-files:
    exceptions:
        - additional_file.txt
        - another_additional_file.txt

# Tables to validate
tables:
    table1:
        required: true  # Does the table have to exist
        pattern: null  # Assumes file is called table1
        metadata: meta_data/table1.json
        linter: goodtables

    table2:
        required: true
        pattern: ^table2
        metadata: meta_data/table2.json

How to update

We have tests that run on the current state of the poetry.lock file (i.e. the current dependencies). We also run tests based on the most up to date dependencies allowed in pyproject.toml. This allows us to see if there will be any issues when updating dependences. These can be run locally in the tests folder.

When updating this package, make sure to change the version number in pyproject.toml and describe the change in CHANGELOG.md.

If you have changed any dependencies in pyproject.toml, run poetry update to update poetry.lock.

Once you have created a release in GitHub, to publish the latest version to PyPI, run:

poetry build
poetry publish -u <username>

Here, you should substitute for your PyPI username. In order to publish to PyPI, you must be an owner of the project.

Process Diagram

How logic works

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_linter-1.1.4.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

data_linter-1.1.4-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file data_linter-1.1.4.tar.gz.

File metadata

  • Download URL: data_linter-1.1.4.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.10 CPython/3.8.5 Darwin/19.6.0

File hashes

Hashes for data_linter-1.1.4.tar.gz
Algorithm Hash digest
SHA256 6062ce6a50c6617ee563cf6473512a6a8112072cca116ca19e62e02730e28098
MD5 8ce7f5b280d2e1e144ae605433606b9d
BLAKE2b-256 36f2520d731d2ebd29d9c6a06d47e8ab7460caa190e5f377c11e3805424ccadd

See more details on using hashes here.

File details

Details for the file data_linter-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: data_linter-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.10 CPython/3.8.5 Darwin/19.6.0

File hashes

Hashes for data_linter-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8cd8b0147bad6d11397aeb96ab718f6c3785db49f4f8b3586adb460d28f0dcca
MD5 725a8c3cc9158efa613c262f3a5ebbef
BLAKE2b-256 1001a239e06eaf09bff26d3059acd54f0a088c5ef5d4a3b0d8f513e6b672fe0a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page