A generic dataframe validation library
Project description
checkedframe:
What is it?
checkedframe is a lightweight and flexible library for DataFrame validation built on top of narwhals. This means it has first-class support for both narhwals itself and all the engines that narwhals supports (primarily Pandas, Polars, cuDF, Modin, and PyArrow).
Why use checkedframe?
The key advantages of checkedframe are DataFrame agnosticism (validate Pandas, Polars, Modin, etc. with a single unified API), separation from the pydantic ecosystem (which is fantastic but not suited for columnar data and relies too heavily on brittle type annotation magic), and a flexible, intuitive API for user-defined functions. Below is a (subjective) comparison of checkedframe and several other popular DataFrame validation libraries. If there are any errors / you want your library to be added, please send a PR!
| checkedframe | pandera | patito | dataframely | great-expectations | pointblank | |
|---|---|---|---|---|---|---|
| DataFrame agnostic | ✅ | 🟡 (1.) | ❌ (polars-only) | ❌ (polars-only) | ❌ (pandas-only) | ✅ |
| Lightweight | ✅ | ❌ (pydantic) | ❌ (pydantic) | ✅ | ❌ | 🟡 |
| Custom checks | ✅ | 🟡 (2.) | ❌ | 🟡 (3.) | ❌ | 🟡 |
| Static typing | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| Nested types | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| Safe casting | ✅ | ✅ | ❌ | 🟡 (4.) | ❌ | ❌ |
| Filtering | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| Supports older Python versions | ✅ (3.9+) | ✅ | ✅ | ❌ (3.11+) | ✅ | 🟡 (3.10+) |
| Battle-tested | ❌ (You can help!) | ✅ | 🟡 | 🟡 | ✅ | 🟡 |
- ✅ = Fully supported
- 🟡 = Partial/limited support
- ❌ = Not supported
- While pandera does support multiple libraries, it requires code changes to switch between them. Feature completeness also varies across different engines.
- This is quite subjective, but I find writing non-trivial checks (e.g. those requiring multiple columns, group-by, etc.) non-intuitive and difficult
- Checks must return an expr, which hampers checks such as a t-test between two columns
- Either all columns are cast or none are
Usage:
Installing
The easiest way is to install checkedframe is from PyPI using pip:
pip install checkedframe
Examples
import checkedframe as cf
import polars as pl
from checkedframe.polars import DataFrame
class AASchema(cf.Schema):
reason_code = cf.String()
reason_code_description = cf.String(nullable=True)
features = cf.List(cf.String)
shap = cf.Float64(cast=True)
rank = cf.UInt8(cast=True)
@cf.Check(columns="reason_code")
def check_reason_code_length(s: pl.Series) -> pl.Series:
"""Reason codes must be exactly 3 chars"""
return s.str.len_bytes() == 3
@cf.Check(columns="shap")
def check_shap_is_reasonable() -> pl.Expr:
"""Shap values must be reasonable"""
return pl.col("shap").lt(5).and_(pl.col("shap").gt(0.01))
@cf.Check
def check_row_height(df: pl.DataFrame) -> bool:
"""DataFrame must have 2 rows"""
return df.height == 2
_id_check = cf.Check.is_id("reason_code")
df = pl.DataFrame(
{
"reason_code": ["R23", "R23", "R9"],
"reason_code_description": ["Credit score too low", "Income too low", None],
"shap": [1, 2, 3],
"rank": [-1, 2, 1],
}
)
df: DataFrame[AASchema] = AASchema.validate(df)
checkedframe.exceptions.SchemaError: Found 5 error(s)
features: 1 error(s)
- Column marked as required but not found
rank: 1 error(s)
- Cannot safely cast Int64 to UInt8; invalid range [-1, 2], expected range [0, 255]
reason_code: 1 error(s)
- check_reason_code_length failed for 1 / 3 rows (33.33%): Reason codes must be exactly 3 chars
* is_id failed: reason_code must uniquely identify the DataFrame
* check_row_height failed: DataFrame must have 2 rows
Let's walk through the code step by step. We declare a schema (note that we inherit from cf.Schema) that represents a dataframe with 5 columns called reason_code, reason_code_description, features, shap, and rank. We declare the data type of each column, e.g. String, Float64, and so on. In addition, we declare certain properties about the columns. For example, we are OK with nulls in reason_code_description (by default, columns are not assumed to be nullable), so we set nullable=True. For shap and rank, we expect the specified data type but don't error if the column is not exactly that data type. Instead, since cast=True, we try to (safely) cast the column to the specified data type if possible.
Next, we use checks to assert different properties about our data. For example, we expect that all reason codes are exactly 3 characters long. Note the flexibility in how we perform checks. In the first example, we operate on the series. In the second example, we use expressions. In the third, we operate on the dataframe. In the fourth, we also operate on the dataframe but use a built-in check for convenience. All of these constructs are perfectly valid, with no need to switch between different decorators or remember complex arguments. In this example, the inputs and outputs of the checks are automatically determined from the type hints, but they can also be specified manually in case this fails.
@cf.Check(columns="reason_code", input_type="Series", return_type="Series")
def check_reason_code_length(s):
"""Reason codes must be exactly 3 chars"""
return s.str.len_bytes() == 3
Finally, when calling AASchema.validate on our bad data, we get a nice error message, including clear descriptions of why casting failed, why checks failed (and for what number of rows, if applicable), and so on.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file checkedframe-0.0.7.tar.gz.
File metadata
- Download URL: checkedframe-0.0.7.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e16d92a5fe76b0629f3082139f1508694fd7b697d215ad9f158ff577841b5aac
|
|
| MD5 |
d88e59392ea5d472d930e485bcd3c58e
|
|
| BLAKE2b-256 |
da8964887b615f53f3016736bfd7254c1a83207849c2f2b89d38a930606b8a58
|
File details
Details for the file checkedframe-0.0.7-py3-none-any.whl.
File metadata
- Download URL: checkedframe-0.0.7-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e77c0ac2673e63118aca2b55f5d685fa27217ec2c0db386a56db1273bb89e5f0
|
|
| MD5 |
a97e81dc9d4da31b62a4f247277bbd06
|
|
| BLAKE2b-256 |
d71f6c7b5905c02e9beaf6efd82fccf83110f718201f33226160cca63866afe7
|