Skip to main content

A generic dataframe validation library

Project description

checkedframe:

PyPI version PyPI - Downloads License: MIT Tests

Documentation

What is it?

checkedframe is a lightweight and flexible library for DataFrame validation built on top of narwhals. This means it has first-class support for both narwhals itself and all the engines that narwhals supports (primarily Pandas, Polars, cuDF, Modin, and PyArrow).

Why use checkedframe?

The key advantages of checkedframe are DataFrame agnosticism (validate Pandas, Polars, Modin, etc. with a single unified API), separation from the pydantic ecosystem (which is fantastic but not suited for columnar data and relies too heavily on brittle type annotation magic), and a flexible, intuitive API for user-defined functions. Below is a (subjective) comparison of checkedframe and several other popular DataFrame validation libraries. If there are any errors / you want your library to be added, please send a PR!

checkedframe pandera patito dataframely great-expectations pointblank
DataFrame agnostic โœ… ๐ŸŸก (1.) โŒ (polars-only) โŒ (polars-only) โŒ (pandas < 2.2-only) โœ…
Lightweight โœ… โŒ (pydantic) โŒ (pydantic) โœ… โŒ ๐ŸŸก
Custom checks โœ… ๐ŸŸก (2.) โŒ ๐ŸŸก (3.) โŒ ๐ŸŸก
Static typing ๐ŸŸก โœ… โœ… โœ… โŒ โŒ
Nested types โœ… โœ… โœ… โœ… โŒ โœ…
Safe casting โœ… โœ… โŒ ๐ŸŸก (4.) โŒ โŒ
Filtering โœ… โŒ โŒ โœ… โŒ โŒ
Schema generation โœ… โŒ โŒ โŒ โŒ โŒ
Union types ๐ŸŸก โŒ โŒ โŒ โŒ โŒ
Python version support โœ… (3.9+) ๐ŸŸก (<= 3.12) โœ… โŒ (3.11+) โœ… ๐ŸŸก (3.10+)
Battle-tested โŒ (You can help!) โœ… ๐ŸŸก ๐ŸŸก โœ… ๐ŸŸก
  • โœ… = Fully supported
  • ๐ŸŸก = Partial/limited support
  • โŒ = Not supported
  1. While pandera does support multiple libraries, it requires code changes to switch between them. Feature completeness also varies across different engines.
  2. This is quite subjective, but I find writing non-trivial checks (e.g. those requiring multiple columns, group-by, etc.) non-intuitive and difficult
  3. Checks must return an expr, which hampers boolean checks, such as a t-test between two columns
  4. Either all columns are cast or none are

Usage:

Installing

The easiest way is to install checkedframe is from PyPI using pip:

pip install checkedframe

Examples

import checkedframe as cf
import polars as pl
from checkedframe.polars import DataFrame

class AASchema(cf.Schema):
    reason_code = cf.String()
    reason_code_description = cf.String(nullable=True)
    features = cf.List(cf.String)
    shap = cf.Float64(cast=True)
    rank = cf.UInt8(cast=True)

    @cf.Check(columns="reason_code")
    def check_reason_code_length(s: pl.Series) -> pl.Series:
        """Reason codes must be exactly 3 chars"""
        return s.str.len_bytes() == 3

    @cf.Check(columns="shap")
    def check_shap_is_reasonable() -> pl.Expr:
        """Shap values must be reasonable"""
        return pl.col("shap").lt(5).and_(pl.col("shap").gt(0.01))

    @cf.Check
    def check_row_height(df: pl.DataFrame) -> bool:
        """DataFrame must have 2 rows"""
        return df.height == 2

    _id_check = cf.Check.is_id("reason_code")


df = pl.DataFrame(
    {
        "reason_code": ["R23", "R23", "R9"],
        "reason_code_description": ["Credit score too low", "Income too low", None],
        "shap": [1, 2, 3],
        "rank": [-1, 2, 1],
    }
)

df: DataFrame[AASchema] = AASchema.validate(df)
checkedframe.exceptions.SchemaError: Found 5 error(s)
  reason_code: 1 error(s)
    - check_reason_code_length failed for 1 / 3 (33.33%) rows: Reason codes must be exactly 3 chars
  features: 1 error(s)
    - Column marked as required but not found
  rank: 1 error(s)
    - Cannot safely cast Int64 to UInt8; 1 / 3 (33.33%) rows outside of expected range [0, 255]
  * check_row_height failed for 3 / 3 (100.00%) rows: DataFrame must have 2 rows
  * is_id failed for 3 / 3 (100.00%) rows: reason_code must uniquely identify the DataFrame

Let's walk through the code step by step. We declare a schema (note that we inherit from cf.Schema) that represents a dataframe with 5 columns called reason_code, reason_code_description, features, shap, and rank. We declare the data type of each column, e.g. String, Float64, and so on. In addition, we declare certain properties about the columns. For example, we are OK with nulls in reason_code_description (by default, columns are not assumed to be nullable), so we set nullable=True. For shap and rank, we expect the specified data type but don't error if the column is not exactly that data type. Instead, since cast=True, we try to (safely) cast the column to the specified data type if possible.

Next, we use checks to assert different properties about our data. For example, we expect that all reason codes are exactly 3 characters long. Note the flexibility in how we perform checks. In the first example, we operate on the series. In the second example, we use expressions. In the third, we operate on the dataframe. In the fourth, we also operate on the dataframe but use a built-in check for convenience. All of these constructs are perfectly valid, with no need to switch between different decorators or remember complex arguments. In this example, the inputs and outputs of the checks are automatically determined from the type hints, but they can also be specified manually in case this fails.

  @cf.Check(columns="reason_code", input_type="Series", return_type="Series")
  def check_reason_code_length(s):
      """Reason codes must be exactly 3 chars"""
      return s.str.len_bytes() == 3

Finally, when calling AASchema.validate on our bad data, we get a nice error message, including clear descriptions of why casting failed, why checks failed (and for what number of rows, if applicable), and so on.

For more advanced usage, please see the documentation.

Mypy Plugin

The example code as-is will actually throw some type errors, as type checkers will complain that the user-defined checks do not take a "self" parameter. This is because there is currently no way to mark a function as a staticmethod without using the staticmethod decorator. You can simply add this decorator to make the errors go away. If that's annoying, checkedframe also provides a mypy plugin that marks all methods decorated with cf.Check as staticmethods. Just add

[tool.mypy]
plugins = ["checkedframe.mypy"]

to your pyproject.toml. Unfortunately, no other type checker provides plugin capabilities.

Typing

checkedframe is also meant to integrate with static typing. When validation is successful, the returned dataframe can be parametrized by the schema. For example,

import checkedframe as cf
import polars as pl
from checkedframe.polars import DataFrame


class MySchema(cf.Schema):
    x = cf.String()


df = pl.DataFrame({
    "x": ["a", "b", "c"]
})

def func_that_requires_cleaned_data(df: DataFrame[MySchema]): ...

func_that_requires_cleaned_data(df)  # type error

validated_df: DataFrame[MySchema] = MySchema.validate(df)
func_that_requires_cleaned_data(validated_df)  # passes!

Roadmap:

  1. Better static typing. MySchema.validate should automatically return a DataFrame of your input type parametrized by MySchema.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkedframe-0.1.0.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

checkedframe-0.1.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file checkedframe-0.1.0.tar.gz.

File metadata

  • Download URL: checkedframe-0.1.0.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for checkedframe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec2699d2a816b41c5c6bdc4ac30aaa7122f234b00549a016ed83205147d6ee84
MD5 70343d885c8e7dae36ab01affcf9d674
BLAKE2b-256 1ed5efc05e24324cb417ceeb823832a16ae27365244af5ebb6a3c74cb53c2ee5

See more details on using hashes here.

File details

Details for the file checkedframe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: checkedframe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for checkedframe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ae0aef6ad2a21d3149a98423ed672c06950e6fdb14f99274945bd9a55091a10
MD5 adc3aceed0ef82375815b6fbecc4ea8b
BLAKE2b-256 d3d38023102967a453414a05f93cd800fc61563757de56e7bae769a32040d7f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page