Rules for validating and correcting datasets
Project description
pymodify
Goal and motivation
The idea of this project is to define rules to validate and correct datasets. Whenever possible, it does this in a vectorized way, which makes this library fast.
Reasons to make this:
- Implement the whole data pipeline in a single language (python). No need to call subprocess or http to send your data to R and back.
- Directly use pandas and all other python packages you are already familiar with. No need to relearn how everything is done in R.
- Validation can be fast if vectorized.
Usage
This package provides two operations on data:
- checks (if data is correct). Also knows as validations.
- corrections (how to fix incorrect data)
Checks
In checks.py
from datarules import check
@check(tags=["P1"])
def check_almost_square(width, height):
return (width - height).abs() < 5
@check(tags=["P3", "completeness"])
def check_not_too_deep(depth):
return depth < 3
In your main code:
import pandas as pd
from datarules import load_checks, Runner
df = pd.DataFrame([
{"width": 3, "height": 7},
{"width": 3, "height": 5, "depth": 1},
{"width": 3, "height": 8},
{"width": 3, "height": 3},
{"width": 3, "height": -2, "depth": 4},
])
checks = load_checks('checks.py')
report = Runner().check(df, checks)
print(report)
Output:
name condition items passes fails NAs error warnings
0 check_almost_square check_almost_square(width, height) 5 3 2 0 None 0
1 check_not_too_deep check_not_too_deep(depth) 5 1 4 0 None 0
Corrections
In corrections.py
from datarules import correction
from checks import check_almost_square
@correction(condition=check_almost_square.fails)
def make_square(width, height):
return {"height": height + (width - height) / 2}
In your main code:
from datarules import load_corrections
corrections = load_corrections('corrections.py')
report = Runner().correct(df, corrections)
print(report)
Output:
name condition action applied error warnings
0 make_square check_almost_square.fails(width, height) make_square(width, height) 2 None 0
Similar work (python)
These work on pandas:
- Pandera - A good alternative for validation only. Like us, their checks are vectorized too.
- Pandantic - A combination of validation and parsing based on pydantic.
The following offer validation only, but none of them seem to be vectorized or support pandas directly.
- Great Expectations - An overengineered library for validation that has confusing documentation.
- contessa - Meant to be used against databases.
- validator
- python-valid8
- pyruler - Dead project that is rule-based.
- pyrules - Dead project for corrections.
Similar work (R)
This project is inspired by https://github.com/data-cleaning/. Similar functionality can be found in the following R packages:
Features found in one of the packages above but not implemented here, might eventually make it into this package too.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for datarules-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc987ccf5d632a6878bf5b588feb0e5aa5e275b275a16dd6c52a5089cd3fe0ef |
|
MD5 | 0c05667189ab633c95cd1de76892e332 |
|
BLAKE2b-256 | d63b8df256c1f5fc6639b07f1ccc06aa001c180be11bbc6e2d5279b17cd33133 |