Skip to main content

Rules for validating and correcting datasets

Project description

DataRules

Goal and motivation

The idea of this project is to define rules to validate and correct datasets. Whenever possible, it does this in a vectorized way, which makes this library fast.

Reasons to make this:

  • Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.
  • Implement both validation and correction. Most existing packages provide validation only.
  • Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.
  • Apply vectorization to make processing fast.

Usage

This package provides two operations on data:

  • checks (if data is correct). Also knows as validations.
  • corrections (how to fix incorrect data)

Checks

In checks.py

from datarules import check


@check(tags=["P1"])
def check_almost_square(width, height):
    return (width - height).abs() <= 4


@check(tags=["P3", "completeness"])
def check_not_too_deep(depth):
    return depth <= 2

In your main code:

import pandas as pd
from datarules import CheckList

df = pd.DataFrame([
    {"width": 3, "height": 7},
    {"width": 3, "height": 5, "depth": 1},
    {"width": 3, "height": 8},
    {"width": 3, "height": 3},
    {"width": 3, "height": -2, "depth": 4},
])

checks = CheckList.from_file('checks.py')
report = checks.run(df)
print(report)

Output:

                  name                           condition  items  passes  fails  NAs error  warnings
0  check_almost_square  check_almost_square(width, height)      5       3      2    0  None         0
1   check_not_too_deep           check_not_too_deep(depth)      5       1      4    0  None         0

Corrections

In corrections.py

from datarules import correction
from checks import check_almost_square


@correction(condition=check_almost_square.fails)
def make_square(width, height):
    return {"height": height + (width - height) / 2}

In your main code:

from datarules import CorrectionList

corrections = CorrectionList.from_file('corrections.py')
report = corrections.run(df)
print(report)

Output:

          name                                 condition                      action  applied error  warnings
0  make_square  check_almost_square.fails(width, height)  make_square(width, height)        2  None         0

Similar work (python)

These work on pandas, but only do validation:

  • Pandera - Like us, their checks are also vectorized.
  • Pandantic - Combination of validation and parsing based on pydantic.

The following offer validation only, but none of them seem to be vectorized or support pandas directly.

Similar work (R)

This project is inspired by https://github.com/data-cleaning/. Similar functionality can be found in the following R packages:

  • validate - Checking data (implemented)
  • dcmodify - Correcting data (implemented)
  • errorlocate - Identifying and removing errors (not yet implemented)
  • deductive - Deductivate correction based on checks (not yet implemented)

Features found in one of the packages above but not implemented here, might eventually make it into this package too.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

datarules-0.2.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file datarules-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: datarules-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.28.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.10.6

File hashes

Hashes for datarules-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ba045b1d2300d97eb948999e6cedb660b26724b90dd3db315a5136dd8372ea5
MD5 b0d541fe2585a9eac0ca9a7c506b3698
BLAKE2b-256 19546bd889ec1e9f940c949e66b97efdf84b77f40483202569bdf768f6b19704

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page