Rules for validating and correcting datasets
Project description
DataRules
Goal and motivation
The idea of this project is to define rules to validate and correct datasets. Whenever possible, it does this in a vectorized way, which makes this library fast.
Reasons to make this:
- Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.
- Implement both validation and correction. Most existing packages provide validation only.
- Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.
- Apply vectorization to make processing fast.
Usage
This package provides two operations on data:
- Checks (if data is correct). Also knows as validations.
- Corrections (how to fix incorrect data).
Example
Create some data
import pandas as pd
df = pd.DataFrame([
{"width": 3, "height": 7},
{"width": 3, "height": 8},
])
- Check the data
from datarules import CheckList, Check
from uneval import var
checks = CheckList([
Check(name="almost_square",
tags=["low-priority"],
test=(var.width - var.height).abs() <= 4),
])
check_report = checks.run(df)
print(check_report)
Output:
CheckReport
-----------
name test items passes fails NAs error warnings
almost_square (width - height).abs() <= 4 2 1 1 0 None 0
- Correct the data
from datarules import CorrectionList, Correction
corrections = CorrectionList([
Correction(name="correct_square",
trigger=checks[0].fails,
action={"height": var.height / 2 + var.width / 2}),
])
correction_report = corrections.run(df)
print(correction_report)
print(f"Modified data:\n{df}")
Output:
CorrectionReport
----------------
name trigger action applied error warnings
correct_square almost_square.fails(height, width) height = height / 2 + width / 2 1 None 0
Modified data:
width height
0 3 7.0
1 3 5.5
See more examples on DataRules examples.
Similar work (python)
These work on pandas, but only do validation:
- Pandera - Like us, their checks are also vectorized.
- Pandantic - Combination of validation and parsing based on pydantic.
The following offer validation only, but none of them seem to be vectorized or support pandas directly.
- Great Expectations - An overengineered library for validation that has confusing documentation.
- contessa - Meant to be used against databases.
- validator
- python-valid8
- pyruler - Dead project that is rule-based.
- pyrules - Dead project that supports rule based corrections (but no validation).
Similar work (R)
This project is inspired by https://github.com/data-cleaning/. Similar functionality can be found in the following R packages:
- validate - Checking data (implemented)
- dcmodify - Correcting data (implemented)
- errorlocate - Identifying and removing errors (A start has been made here)
- deductive - Deductive correction based on checks (not yet implemented)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datarules-0.2.2-py3-none-any.whl.
File metadata
- Download URL: datarules-0.2.2-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb92418f085ac49ae870aa6b0494769f345475ab55e8f615cb3a6b372b506bec
|
|
| MD5 |
243d6b51fb1b090d2c704dba336f1fe7
|
|
| BLAKE2b-256 |
ede562add9c37d29136443f68a34eaa74542f904866bd5e97ea368bb11f98c3e
|