Data validation library

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Datavalid

This library allow you to declare validation tasks to check for CSV files. This ensure data correctness for ETL pipeline that update frequently.

Installation

pip install datavalid

Usage

Create a datavalid.yml file in your data folder:

files:
  fuse/complaint.csv:
    - name: "`complaint_uid` should be unique per `allegation` x `uid`"
      unique:
        - complaint_uid
        - uid
        - allegation
    - name: if `allegation_finding` is "sustained" then `disposition` should also be "sustained"
      empty:
        and:
          - column: allegation_finding
            op: equal
            value: sustained
          - column: disposition
            op: not_equal
            value: sustained
  fuse/event.csv:
    - name: no officer with more than 1 left date in a calendar month
      where:
        column: kind
        op: equal
        value: officer_left
      group_by: uid
      no_more_than_once_per_30_days:
        date_from:
          year_column: year
          month_column: month
          day_column: day
save_bad_rows_to: invalid_rows.csv

Then run datavalid command in that folder:

python -m datavalid

You can also specify a data folder that isn't the current working directory:

python -m datavalid --dir my_data_folder

Config specification

A config file is a file named datavalid.yml and it must be placed in your root data folder. Your root data folder is the folder that contain all of your data files. Config file contains config object in YAML format.

Config object

files: required, a mapping between files and validation tasks for each file. Each file path is evaluated relative to root data folder and each file must be in CSV format. Refer to task object to learn more about validation task.
save_bad_rows_to: optional, which file to save offending rows to. If not defined then bad rows will just be output to terminal.

Task object

Common fields:

name: required, name of validation task.
where: optional, how to filter the data. This field accepts a condition object.
group_by: optional, how to divide the data before validation. This could be a single column name or a list of column names to group the data with.

Checker fields (define exactly one of these fields):

unique: optional, column name or list of column names to ensure uniqueness.
empty: optional, accepts a condition object and ensure that no row fulfill this condition.
no_more_than_once_per_30_days: optional, ensure that no 2 rows occur closer than 30 days apart. Accepts the following fields:
- date_from: required, how to parse date from the given data. Accepts a date parser object.
no_consecutive_date: optional, ensure that no row occur on consecutive days. Accepts the following fields:
- date_from: required, how to parse date from the given data. Accepts a date parser object.

Condition object

There are 3 ways to define a condition. The first way is to provide column, op and value:

column: optional, column name to compare
op: optional, compare operation to use. Possible value are:
- equal
- not_equal
- greater_than
- less_than
- greater_equal
- less_equal
value: optional, the value to compare with.

The second way is to provide and field:

and: optional, list of conditions to combine into one condition. The condition is fulfilled when all of sub-conditions are fulfilled. Each sub-condition can have any field which is valid for a condition object.

Finally the last way is to provide or field:

or: optional, same as and except that the sub-conditions are or-ed together which mean the condition is fulfilled if any of the sub-conditions is fulfilled.

Date parser

Combines multiple columns to create dates.

year_column: required, year column name.
month_column: required, month column name.
day_column: required, day column name.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.6

Nov 24, 2021

0.3.5

Nov 24, 2021

0.3.4

Nov 12, 2021

0.3.3

Nov 12, 2021

0.3.2

Nov 12, 2021

0.3.1

Nov 12, 2021

0.3.0

Nov 9, 2021

0.2.3

Aug 17, 2021

0.2.2

Aug 17, 2021

0.2.1

Jun 8, 2021

0.2.0

Jun 8, 2021

0.1.0

May 31, 2021

This version

0.0.2

May 18, 2021

0.0.1

May 18, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datavalid-0.0.2.tar.gz (14.8 kB view details)

Uploaded May 18, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datavalid-0.0.2-py3-none-any.whl (19.4 kB view details)

Uploaded May 18, 2021 Python 3

File details

Details for the file datavalid-0.0.2.tar.gz.

File metadata

Download URL: datavalid-0.0.2.tar.gz
Upload date: May 18, 2021
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for datavalid-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`72b8b0bf6df90aa2da787c6191cfe982176b0bf964cad10ab2b417bfae39f7bf`
MD5	`6327e765b9751198e4a30acbd5749c0a`
BLAKE2b-256	`53cc16975fe5a3644714fe590b2f20bc62cefd79d709203e54028c8c144318e9`

See more details on using hashes here.

File details

Details for the file datavalid-0.0.2-py3-none-any.whl.

File metadata

Download URL: datavalid-0.0.2-py3-none-any.whl
Upload date: May 18, 2021
Size: 19.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.2

File hashes

Hashes for datavalid-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b328cd96e8090fd2cc3e2fafed40f349c2cf46f511b46e78aa2a7c156730ba58`
MD5	`2f48beb6eaedd12a4eb260682916d0f7`
BLAKE2b-256	`59ebea27ed4c221b839bcde06c82c0f2e55252fd1707d090ca26579d404fbbdd`

See more details on using hashes here.

datavalid 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Datavalid

Installation

Usage

Config specification

Config object

Task object

Condition object

Date parser

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes