Skip to main content

Polars DataFrame validation using type hints

Project description

polars_validate: Polars DataFrame validation using type hints

Simple DataFrame validation, based on type hints.

from typing import Annotated

import polars as pl

from polars_validate import (
    THIS,
    ContainsPattern,
    FrameValidators,
    IntegerType,
    IsBetween,
    IsIn,
    IsNotNull,
    UniqueTogether,
    validate,
)

tips = pl.DataFrame(
    {
        "restaurant": [1, 1, 1, 2],
        "table": [1, 2, 3, 1],
        "bill": [16.99, 10.34, 21.01, 23.68],
        "tip": [1.01, 1.66, 3.5, None],
        "sex": ["Female", "Male", None, "Male"],
        "smoker": [False, True, True, False],
        "time": ["45 min", "30 mins", "60 min", "50 min"],
    }
)


class TipsSchema:
    restaurant: Annotated[IntegerType, IsNotNull()]
    table: Annotated[IntegerType, IsNotNull()]
    bill: Annotated[pl.Float64, IsNotNull(), IsBetween(0.0, 50.0, closed="right")]
    tip: pl.Float64
    sex: Annotated[pl.String, IsNotNull(), IsIn(("Female", "Male"))]
    smoker: Annotated[pl.Boolean, IsNotNull()]
    time: Annotated[
        pl.String,
        ContainsPattern("^\\d+ min$"),
        THIS.str.strip_suffix(" min").cast(pl.Int64, strict=False) < 120,
    ]

    dataframe: FrameValidators = (
        UniqueTogether(("restaurant", "table")),
        pl.col("bill") > pl.col("tip"),
    )


validate(TipsSchema, tips, eager=False)
#> polars_validate.base.ValidationError: Validation failed with 2 errors (16 passed):
#> ❌ 'sex': 'not null' check failed at offsets: [2]
#> ❌ 'time': 'pattern ^\d+ min$' check failed at offsets: [1]

Installation

pip install git+https://github.com/chris-mcdo/polars-validate

Features

Validate using built-in validation types, polars expressions, or arbitrary functions.

In-built validation:

  • IsNotNull: check for missing values
  • IsIn: check for set membership
  • IsBetween: check values lie within an interval
  • ContainsPattern: check a string contains / matches a regex pattern
  • TypeValidator: check type
  • UniqueTogether: check some columns uniquely identify rows

For inspiration, a few examples of how polars expressions can be used for validation:

# THIS represents the the current Series / column.
from polars_validate import THIS

# series-based validation
is_even = (THIS % 2) == 0
is_in_title_case = THIS.str.title() == THIS
starts_with_foo = THIS.str.starts_with("foo")
is_close_to_mean = (THIS - THIS.mean()).abs() < 5.0
is_unique = THIS.is_unique()
is_short_string = THIS.str.len() < 10

# dataframe-based validation
bounded = pl.col("col_a").is_between("col_b", "col_c")
at_least_one = pl.any_horizontal("a", "b", "c")

Arbitrary custom validation is also supported:

def is_valid_index(s: pl.Series) -> bool:
    return s.is_sorted() and s[0] == 1


def smokers_tip_more(d: pl.DataFrame):
    # arbitrary logic ...


class TipsSchema:
    restaurant: Annotated[IntegerType, IsNotNull(), SeriesCallableValidator(is_valid_index, "valid index")]
    # ...

    dataframe: FrameValidators = (
        # ...
        CallableValidator(smokers_tip_more, "smokers tip more"),
    )

User Guide

Define validation for individual Series (or DataFrame columns) using type annotations as shown above.

E.g. for series:

# simple schema - just validate type
SimpleSeriesSchema = pl.Float32

# add more complex validation using type metadata
StrictSeriesSchema = Annotated[pl.Float32, IsNotNull(), THIS.sqrt().round().mod(7).eq(0), ...]

validate_series(StrictSeriesSchema, my_series)
#> ...

To validate DataFrames, combine Series type annotations in a class as shown above. To add validation which applies to the whole dataframe, add fields with the FrameValidators annotation.

Internally, type annotations and metadata are translated into a sequence of Validator objects. You can just use these objects directly if you want.

License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_validate-0.1.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_validate-0.1.0-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file polars_validate-0.1.0.tar.gz.

File metadata

  • Download URL: polars_validate-0.1.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for polars_validate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7ebafbce8426c67c2b9e10749a4cc187da7d997d53250010de195a3942124940
MD5 3cac5bf3d304e39731f567934b570715
BLAKE2b-256 9d528beb8d20e345db217a8292a4d6e4735b0bd5fda768cc401d9ed0dfaef5f1

See more details on using hashes here.

File details

Details for the file polars_validate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for polars_validate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e384d951c22b6ca55cd0c6bd89ae2fcc737fa43ae0621cb28839533efaef9c4
MD5 be1c60293186c6ee7d29b678686b9e8e
BLAKE2b-256 a44f1ff35b42e0bc900e27cb65d458eef6dd5fee7f72a078cca29518b4bac963

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page