Skip to main content

Rules for validating and correcting datasets

Project description

DataRules

Goal and motivation

The idea of this project is to define rules to validate and correct datasets. Whenever possible, it does this in a vectorized way, which makes this library fast.

Reasons to make this:

  • Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.
  • Implement both validation and correction. Most existing packages provide validation only.
  • Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.
  • Apply vectorization to make processing fast.

Usage

This package provides two operations on data:

  1. Checks (if data is correct). Also knows as validations.
  2. Corrections (how to fix incorrect data).

Example

Create some data

import pandas as pd

df = pd.DataFrame([
    {"width": 3, "height": 7},
    {"width": 3, "height": 8},
])
  1. Check the data
from datarules import CheckList, Check
from uneval import var

checks = CheckList([
    Check(name="almost_square",
          tags=["low-priority"],
          test=(var.width - var.height).abs() <= 4),
])
check_report = checks.run(df)
print(check_report)

Output:

CheckReport
-----------
          name                         test  items  passes  fails  NAs error  warnings
 almost_square  (width - height).abs() <= 4      2       1      1    0  None         0
  1. Correct the data
from datarules import CorrectionList, Correction

corrections = CorrectionList([
    Correction(name="correct_square",
               trigger=checks[0].fails,
               action={"height": var.height / 2 + var.width / 2}),
])
correction_report = corrections.run(df)
print(correction_report)
print(f"Modified data:\n{df}")

Output:

CorrectionReport
----------------
           name                             trigger                           action  applied error  warnings
 correct_square  almost_square.fails(height, width)  height = height / 2 + width / 2        1  None         0

Modified data:
   width  height
0      3     7.0
1      3     5.5

See more examples on DataRules examples.

Similar work (python)

These work on pandas, but only do validation:

  • Pandera - Like us, their checks are also vectorized.
  • Pandantic - Combination of validation and parsing based on pydantic.

The following offer validation only, but none of them seem to be vectorized or support pandas directly.

Similar work (R)

This project is inspired by https://github.com/data-cleaning/. Similar functionality can be found in the following R packages:

  • validate - Checking data (implemented)
  • dcmodify - Correcting data (implemented)
  • errorlocate - Identifying and removing errors (A start has been made here)
  • deductive - Deductive correction based on checks (not yet implemented)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datarules-0.2.2-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file datarules-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: datarules-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for datarules-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bb92418f085ac49ae870aa6b0494769f345475ab55e8f615cb3a6b372b506bec
MD5 243d6b51fb1b090d2c704dba336f1fe7
BLAKE2b-256 ede562add9c37d29136443f68a34eaa74542f904866bd5e97ea368bb11f98c3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page