Skip to main content

Validate Polars DataFrames using type hints

Project description

polars_validate: validate Polars DataFrames using type hints

Simple but powerful DataFrame validation library, based on type hints and polars expressions.

from typing import Annotated

import polars as pl

from polars_validate import (
    THIS,
    ContainsPattern,
    FrameValidators,
    IntegerType,
    IsBetween,
    IsIn,
    IsNotNull,
    UniqueTogether,
    validate,
)

tips = pl.DataFrame(
    {
        "restaurant": [1, 1, 1, 2],
        "table": [1, 2, 3, 1],
        "bill": [16.99, 10.34, 21.01, 23.68],
        "tip": [1.01, 1.66, 3.5, None],
        "sex": ["Female", "Male", None, "Male"],
        "smoker": [False, True, True, False],
        "time": ["45 min", "30 mins", "60 min", "50 min"],
    }
)


class TipsSchema:
    restaurant: Annotated[IntegerType, IsNotNull()]
    table: Annotated[IntegerType, IsNotNull()]
    bill: Annotated[pl.Float64, IsNotNull(), IsBetween(0.0, 50.0, closed="right")]
    tip: pl.Float64
    sex: Annotated[pl.String, IsNotNull(), IsIn(("Female", "Male"))]
    smoker: Annotated[pl.Boolean, IsNotNull()]
    time: Annotated[
        pl.String,
        ContainsPattern("^\\d+ min$"),
        THIS.str.strip_suffix(" min").cast(pl.Int64, strict=False) < 120,
    ]

    dataframe: FrameValidators = (
        UniqueTogether(("restaurant", "table")),
        pl.col("bill") > pl.col("tip"),
    )


validate(TipsSchema, tips, eager=False)
#> polars_validate.base.ValidationError: Validation failed with 2 errors (16 passed):
#> ❌ 'sex': 'not null' check failed at offsets: [2]
#> ❌ 'time': 'pattern ^\d+ min$' check failed at offsets: [1]

Installation

pip install polars-validate

Features

Validate using built-in validation types, polars expressions, or arbitrary functions.

In-built validation:

  • IsNotNull: check for missing values
  • IsIn: check for set membership
  • IsBetween: check values lie within an interval
  • ContainsPattern: check a string contains / matches a regex pattern
  • TypeValidator: check type
  • UniqueTogether: check some columns uniquely identify rows

For inspiration, a few examples of how polars expressions can be used for validation:

from polars_validate import THIS
# THIS represents the the current Series / column.

# series-based validation
is_even = (THIS % 2) == 0
is_in_title_case = THIS.str.title() == THIS
starts_with_foo = THIS.str.starts_with("foo")
is_close_to_mean = (THIS - THIS.mean()).abs() < 5.0
is_unique = THIS.is_unique()
is_short_string = THIS.str.len() < 10

# dataframe-based validation
bounded = pl.col("col_a").is_between("col_b", "col_c")
at_least_one = pl.any_horizontal("a", "b", "c")

Arbitrary custom validation is also supported:

def is_valid_index(s: pl.Series) -> bool:
    return s.is_sorted() and s[0] == 1


def smokers_tip_more(d: pl.DataFrame):
    # arbitrary logic ...


class TipsSchema:
    restaurant: Annotated[IntegerType, IsNotNull(), SeriesCallableValidator(is_valid_index, "valid index")]
    # ...

    dataframe: FrameValidators = (
        # ...
        CallableValidator(smokers_tip_more, "smokers tip more"),
    )

User Guide

Define validation for individual Series (or DataFrame columns) using type annotations as shown above.

E.g. for series:

# simple schema - just validate type
SimpleSeriesSchema = pl.Float32

# add more complex validation using type metadata
StrictSeriesSchema = Annotated[pl.Float32, IsNotNull(), THIS.sqrt().round().mod(7).eq(0), ...]

validate_series(StrictSeriesSchema, my_series)
#> ...

To validate DataFrames, combine Series type annotations in a class as shown above. To add validation which applies to the whole dataframe, add fields with the FrameValidators annotation.

Internally, type annotations and metadata are translated into a sequence of Validator objects. You can just use these objects directly if you want.

License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_validate-0.2.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_validate-0.2.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file polars_validate-0.2.0.tar.gz.

File metadata

  • Download URL: polars_validate-0.2.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for polars_validate-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e352338dc90816a551f262078cead466e9109daf27dae310598a459fa6aa4442
MD5 e9957a553ab44c4f08ed2427f0ffcc60
BLAKE2b-256 c216c232afbea63ffacb704d6557829e1bb357a655d273692a74b6e53c98a131

See more details on using hashes here.

File details

Details for the file polars_validate-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for polars_validate-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c0a754ac6b937fab6fe687713f731e8a3937922318e2e60ac60a71425f34baf
MD5 4cfb122cdd80cff3e98e315b2bf7cf1d
BLAKE2b-256 cf24be01dace98de8c6b7b7a63b1c38a5f680047ebcace50925f43d9fde78210

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page