Skip to main content

A python package that validates datasets against a metadata schema

Project description

data_linter

A python package that validates datasets against a metadata schema which is defined here.

It performs the following checks:

  • Are the columns of the correct data types (or can they be converted without error using pd.Series.astype in the case of untyped data formats like csv)
  • Column names:
    • Are the columns named correctly?
    • Are they in the same order specified in the meta data
    • Are there any missing columns?
  • Where a regex pattern is provided in the metadata, does the actual data always fit the pattern
  • Where an enum is provided in the metadata, does the actual data contain only values in the enum
  • Where nullable is set to false in the metadata, are there really no nulls in the data?

The package also provides functionality to impose_metadata_types_on_pd_df, which allows the user to safely convert a pandas dataframe to the datatypes specified in the metadata. This is useful in the case you have an untyped data file such as a csv and want to ensure it is conformant with the metadata.

Usage

For detailed information about how to use the package, please see the demo repo. This includes an interactive tutorial that you can run in your web browser.

Here's a very basic example

import pandas as pd
import json

from data_linter.lint import Linter

def read_json_from_path(path):
    with open(path) as f:
        return_json = json.load(f)
    return return_json

meta = read_json_from_path("tests/meta/test_meta_cols_valid.json")
df = pd.read_parquet("tests/data/test_parquet_data_valid.parquet")
l = Linter(df, meta)
l.check_all()
l.markdown_report()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_linter-0.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

data_linter-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file data_linter-0.1.0.tar.gz.

File metadata

  • Download URL: data_linter-0.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.17 CPython/3.6.9 Darwin/18.0.0

File hashes

Hashes for data_linter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a4ea28e01c8459189051a778c9779fed85256f44c318f429f634a93d01110fb
MD5 3c6f620617f15365d59c9a6a71eccac3
BLAKE2b-256 c31e5a0a2c964d2fa07b7a7d855c406eef9b36a03fefc85d01ff3b880b03a725

See more details on using hashes here.

File details

Details for the file data_linter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: data_linter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.17 CPython/3.6.9 Darwin/18.0.0

File hashes

Hashes for data_linter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea4920005c62f89a3daaa4c3357dfe64e0373ec321c8f14b39257fdf0a0d2139
MD5 1f29ac40c43eac4b8dd74ffaec860e65
BLAKE2b-256 49a5a0393051dfb52b00fd3df67717788ae0a4892a285270b4c56f39a986cbca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page