A package for data validation with pandas dataframes

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

dfval

dfval can be used to compare two dataframes and identify differences between the dataframes. It also has methods to validate attributes or update dataframes.

Below is example creates two small dataframes and does a comparison using dfval as well as more detail on how to use dfval.

Example of `compare()`

1. Import dfval and pandas

import dfval as dv
import pandas as pd

2. Create two dataframes (x and y)

dx = [[5000, '2019-12-15', 100, 2], [5000, '2019-12-22', 200, 3], [5000, '2019-12-29', 500, 4], [5000, '2019-12-08', 75, 3]]
dfx = pd.DataFrame(dx, columns = ['loc_id', 'greg_d', 'qty_sum', 'qty_ct'])

dy = [[5000, '2019-12-15', 100.000001, 2], [5000, '2019-12-22', 190, 3], [5000, '2019-12-29', 500, 4], [5000, '2019-12-01', 75, 3]]
dfy = pd.DataFrame(dy, columns = ['loc_id', 'greg_d', 'qty_sum', 'qty_ct'])

3. Define the key and quantity fields for comparison

k = ['loc_id', 'greg_d']

qty_f = ['qty_sum', 'qty_ct']

4. Call compare()

# call compare passing the x dataframe, y dataframe, key, the quantity field name, and optionally the keep_comparison (True or False), decimal rounding, and threshold for quantity comparison
c = dv.compare(dfx, dfy, k, qty_f, keep_comparison = True, decimal_round = 4, threshold = 0.01)

5. Contents of c.results

rec_chk_pass	rec_count_x	rec_count_y	rec_count_z	rec_count_same	rec_count_diff	rec_count_notinx	rec_count_notiny	rec_count_ex
True	4	4	5	2	1	1	1	3

6. Contents of c.exceptions

loc_id	greg_d	qty_sum_x	qty_ct_x	qty_sum_y	qty_ct_y	diff_qty_sum	diff_qty_ct	exception_type
5000	2019-12-22	200.0	3.0	190.0	3.0	10.0	0.0	diff_qty
5000	2019-12-08	75.0	3.0	NaN	NaN	NaN	NaN	not_in_y
5000	2019-12-01	NaN	NaN	75.0	3.0	NaN	NaN	not_in_x

7. Contents of c.comparison

loc_id	greg_d	qty_sum_x	qty_ct_x	qty_sum_y	qty_ct_y	diff_qty_sum	diff_qty_ct	exception_type
5000	2019-12-15	100.0	2.0	100.0	2.0	0.0	0.0	match
5000	2019-12-22	200.0	3.0	190.0	3.0	10.0	0.0	diff_qty
5000	2019-12-29	500.0	4.0	500.0	4.0	0.0	0.0	match
5000	2019-12-08	75.0	3.0	NaN	NaN	NaN	NaN	not_in_y
5000	2019-12-01	NaN	NaN	75.0	3.0		NaN	NaN

`compare()` detail

compare() can be used to compare dimensions in two dataframes or dimensions and quantity values. To compare dimensions only, leave out the qty_n, decimal_round, and threshold parameters listed below.

Input

compare() has three required parameters and four optional parameters:

Required parameters
- x: one of two dataframes used the comparison (order doesn't matter)
- y: second of two dataframes used in comparison (order doesn't matter)
- k: a list of the common key in the dataframes used for comparison
Optional parameters
- qty_n: a list of the quantity field(s) to compare
  - Note: if the qty_n is not passed, compare() will do a comparison of the dimensions (k) only
- keep_comparison: retains all the records of the comparison in results comparison dataframe
- decimal_round: number of digits to round the quantity fields in the comparison (fields listed in qty_n)
- threshold: a threshold for what should be considered an exception in the quantity comparison

Output

compare() returns an object with two dataframes describing the results of the comparison and the exceptions.

exceptions: one record for each difference
- exception types (exception_type):
  - not_in_x: key is present in y dataframe but not x
  - not_in_y: key is present in x dataframe but not y
  - diff_qty: key is present in both x and y dataframe but difference in one or more of the quantity columns is greater than threshold
result: one record showing counts for comparison
- rec_chk_pass:
  - Does the record count in the outer join of x and y equal the sum of the record counts of: records that match, records with quantity difference, records in x by not y, and records in y but not x
  - As long as there are not duplicates in the key in the x or y dataset, this should always be True
comparison: all records from the result of the comparison (only retained if keep_comparison = True
- exception types are the same as the exceptions dataframe, plus this has the records that match

Notes:

compare() first checks the x and y DataFrames to make sure both have all columns defined in the passed in key and quantity field(s). If one or more columns is missing in either DataFrame, a message prints ("column name check failed") and the comparison is not run. exceptions and results will be None.
NaN values in the quantity fields in the x and y datasets are converted to zero. If this happens, a message is printed
compare() checks for duplicate keys and prints a message if duplicates are found. The comparison continues to run and creates results. However, the results may be be what is expected. It is better to handle duplicates in the key prior to using compare().
The result of the record count check prints if the result is FALSE:
- "Record counts for same, diff, and in one but not other DO NOT match"

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Sep 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfval-0.1.0.tar.gz (6.3 kB view hashes)

Uploaded Sep 11, 2020 Source

Built Distribution

dfval-0.1.0-py3-none-any.whl (12.3 kB view hashes)

Uploaded Sep 11, 2020 Python 3

Hashes for dfval-0.1.0.tar.gz

Hashes for dfval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`07e07dcd06b2b9fcdcfc45ce967c24fb6b3a6aed9af6e7d7d8671b09744de618`
MD5	`e6ca358f604236832884bb83973b3237`
BLAKE2b-256	`5643638169618b4fe25bd71ab258386cec0046f265133324d955831d062daa1e`

Hashes for dfval-0.1.0-py3-none-any.whl

Hashes for dfval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56e6460a5228f59f832b15d1ba5d427067e994710115eb1acbc30d30a63062ef`
MD5	`9a210a7000cfa1041e6f9d8e0e4b2326`
BLAKE2b-256	`3eb66cf1ff64d9a504552e69aebc4c66a72494173a96144dc126e7c6c538dc7b`

dfval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

dfval

Example of `compare()`

`compare()` detail

Input

Output

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

dfval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

dfval

Example of compare()

compare() detail

Input

Output

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Example of `compare()`

`compare()` detail