Skip to main content

Data validation library

Project description

Datavalid

This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.

Installation

pip install datavalid

Usage

Create a datavalid.yml file in your data folder:

files:
  fuse/complaint.csv:
    schema:
      uid:
        description: >
          accused officer's unique identifier. This references the `uid` column in personnel.csv
      tracking_number:
        description: >
          complaint tracking number from the agency the data originate from
      complaint_uid:
        description: >
          complaint unique identifier
        unique: true
        no_na: true
    validation_tasks:
      - name: "`complaint_uid`, `allegation` and `uid` should be unique together"
        unique:
          - complaint_uid
          - uid
          - allegation
      - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
        empty:
          and:
            - column: allegation_finding
              op: equal
              value: sustained
            - column: disposition
              op: not_equal
              value: sustained
  fuse/event.csv:
    schema:
      event_uid:
        description: >
          unique identifier for each event
        unique: true
        no_na: true
      kind:
        options:
          - officer_level_1_cert
          - officer_pc_12_qualification
          - officer_rank
    validation_tasks:
      - name: no officer with more than 1 left date in a calendar month
        where:
          column: kind
          op: equal
          value: officer_left
        group_by: uid
        no_more_than_once_per_30_days:
          date_from:
            year_column: year
            month_column: month
            day_column: day
save_bad_rows_to: invalid_rows.csv

Then run datavalid command in that folder:

python -m datavalid

You can also specify a data folder that isn't the current working directory:

python -m datavalid --dir my_data_folder

Config specification

A config file is a file named datavalid.yml and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains config object in YAML format.

Config object

  • files: required, a mapping between file names and file configurations. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to file object to learn more about file configuration.
  • save_bad_rows_to: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.

File object

  • schema: optional, description of each column in this file. This field accepts a column schema object.
  • validation_tasks: optional, additional validation tasks to perform on this file. Refer to task object to learn more.

Column schema object

  • description: optional, textual description of this column.
  • unique: optional, if set to true then this column can not contain duplicates.
  • no_na: optional, if set to true then this column cannot contain empty values.
  • integer: optional, if set to true then this column can only contain integers.
  • float: optional, if set to true then this column can only contain floats.
  • options: optional, list of valid values for this column.
  • range: optional, list of 2 numbers. Lower bound and higher bound of what values are considered valid. Setting this imply float: true.
  • title_case: optional, if set to true then all words in this column must begin with an upper case letter.
  • match_regex: optional, regexp pattern to match against all values.

Task object

Common fields:

  • name: required, name of validation task.
  • where: optional, how to filter the data. This field accepts a condition object.
  • group_by: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.
  • warn_only: optional, if set to true then failing this validation only generate a warning rather than failing the whole run.

Checker fields (define exactly one of these fields):

  • unique: optional, column name or list of column names to ensure uniqueness.
  • empty: optional, accepts a condition object and ensure that no row fulfill this condition.
  • no_more_than_once_per_30_days: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
    • date_from: required, how to parse date from the given data. Accepts a date parser object.
  • no_consecutive_date: optional, ensure that no row occur on consecutive days. Accepts the following fields:
    • date_from: required, how to parse date from the given data. Accepts a date parser object.

Condition object

There are 3 ways to define a condition. The first way is to provide column, op and value:

  • column: optional, column name to compare
  • op: optional, compare operation to use. Possible value are:
    • equal
    • not_equal
    • greater_than
    • less_than
    • greater_equal
    • less_equal
  • value: optional, the value to compare with.

The second way is to provide and field:

  • and: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a condition object.

Finally the last way is to provide or field:

  • or: optional, same as and except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.

Date parser

Combines multiple columns to create dates.

  • year_column: required, year column name.
  • month_column: required, month column name.
  • day_column: required, day column name.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datavalid-0.3.6.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datavalid-0.3.6-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file datavalid-0.3.6.tar.gz.

File metadata

  • Download URL: datavalid-0.3.6.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for datavalid-0.3.6.tar.gz
Algorithm Hash digest
SHA256 0befd8b7b529a7ec6fa02faf06f96050af131840aa4740c5217f78768f8dd540
MD5 b5c8731e77c657f3cf7b04228cd7008c
BLAKE2b-256 9d26458a8714b9eda5a7670af3ff7e86d1edd25885d8080c35ad17efa1e9bcaf

See more details on using hashes here.

File details

Details for the file datavalid-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: datavalid-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for datavalid-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3a456cba35f96d51e87390fdc5a9a332c3e8165023b2d9b1ff67214362e24b52
MD5 e6cb97638a89b41169be262879db4d86
BLAKE2b-256 5e895cb575e5a5393269c73ece05fb005fbc7dab0d86ae16e12c10dd549f4487

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page