Skip to main content

A dataframe modelling library built on top of polars and pydantic.

Project description

Patito

Patito combines pydantic and polars in order to write modern, type-annotated data frame logic.
Docs status CI status

Patito offers a simple way to declare pydantic data models which double as schema for your polars data frames. These schema can be used for:

๐Ÿ‘ฎ Simple and performant data frame validation.
๐Ÿงช Easy generation of valid mock data frames for tests.
๐Ÿ Retrieve and represent singular rows in an object-oriented manner.
๐Ÿง  Provide a single source of truth for the core data models in your code base.
๐Ÿฆ† Integration with DuckDB for running flexible SQL queries.

Patito has first-class support for polars, a "blazingly fast DataFrames library written in Rust".

Installation

pip install patito

DuckDB Integration

Patito can also integrate with DuckDB. In order to enable this integration you must explicitly specify it during installation:

pip install 'patito[duckdb]'

Documentation

The full documentation of Patio can be found here.

๐Ÿ‘ฎ Data validation

Patito allows you to specify the type of each column in your dataframe by creating a type-annotated subclass of patito.Model:

# models.py
from typing import Literal, Optional

import patito as pt


class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    temperature_zone: Literal["dry", "cold", "frozen"]
    is_for_sale: bool

The class Product represents the schema of the data frame, while instances of Product represent single rows of the dataframe. Patito can efficiently validate the content of arbitrary data frames and provide human-readable error messages:

import polars as pl

df = pl.DataFrame(
    {
        "product_id": [1, 1, 3],
        "temperature_zone": ["dry", "dry", "oven"],
    }
)
try:
    Product.validate(df)
except pt.ValidationError as exc:
    print(exc)
# 3 validation errors for Product
# is_for_sale
#   Missing column (type=type_error.missingcolumns)
# product_id
#   2 rows with duplicated values. (type=value_error.rowvalue)
# temperature_zone
#   Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
Click to see a summary of dataframe-compatible type annotations.
  • Regular python data types such as int, float, bool, str, date, which are validated against compatible polars data types.
  • Wrapping your type with typing.Optional indicates that the given column accepts missing values.
  • Model fields annotated with typing.Literal[...] check if only a restricted set of values are taken, either as the native dtype (e.g. pl.Utf8) or pl.Categorical.

Additonally, you can assign patito.Field to your class variables in order to specify additional checks:

  • Field(dtype=...) ensures that a specific dtype is used in those cases where several data types are compliant with the annotated python type, for example product_id: int = Field(dtype=pl.UInt32).
  • Field(unique=True) checks if every row has a unique value.
  • Field(gt=..., ge=..., le=..., lt=...) allows you to specify bound checks for any combination of > gt, >= ge, <= le < lt, respectively.
  • Field(multiple_of=divisor) in order to check if a given column only contains values as multiples of the given value.
  • Field(default=default_value) indicates that the given column is required and must take the given default value.
  • String fields annotated with Field(pattern=r"<pattern-pattern>"), Field(max_length=bound), and/or Field(min_length) will be validated with polars' efficient string processing capabilities.
  • Custom constraints can be specified with with Field(constraints=...), either as a single polars expression or a list of expressions. All the rows of the dataframe must satisfy the given constraint(s) in order to be considered valid. Example: even_field: int = pt.Field(constraints=pl.col("even_field") % 2 == 0).

Although Patito supports pandas, it is highly recommemended to be used in combination with polars. For a much more feature-complete, pandas-first library, take a look at pandera.

๐Ÿงช Synthesize valid test data

Patito encourages you to strictly validate dataframe inputs, thus ensuring correctness at runtime. But with forced correctness comes friction, especially during testing. Take the following function as an example:

import polars as pl

def num_products_for_sale(products: pl.DataFrame) -> int:
    Product.validate(products)
    return products.filter(pl.col("is_for_sale")).height

The following test would fail with a patito.ValidationError:

def test_num_products_for_sale():
    products = pl.DataFrame({"is_for_sale": [True, True, False]})
    assert num_products_for_sale(products) == 2

In order to make the test pass we would have to add valid dummy data for the temperature_zone and product_id columns. This will quickly introduce a lot of boilerplate to all tests involving data frames, obscuring what is actually being tested in each test. For this reason Patito provides the examples constructor for generating test data that is fully compliant with the given model schema.

Product.examples({"is_for_sale": [True, True, False]})
# shape: (3, 3)
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚ is_for_sale โ”† temperature_zone โ”† product_id โ”‚
# โ”‚ ---         โ”† ---              โ”† ---        โ”‚
# โ”‚ bool        โ”† str              โ”† i64        โ”‚
# โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
# โ”‚ true        โ”† dry              โ”† 0          โ”‚
# โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
# โ”‚ true        โ”† dry              โ”† 1          โ”‚
# โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
# โ”‚ false       โ”† dry              โ”† 2          โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The examples() method accepts the same arguments as a regular data frame constructor, the main difference being that it fills in valid dummy data for any unspecified columns. The test can therefore be rewritten as:

def test_num_products_for_sale():
    products = Product.examples({"is_for_sale": [True, True, False]})
    assert num_products_for_sale(products) == 2

๐Ÿ–ผ๏ธ A model-aware data frame class

Patito offers patito.DataFrame, a class that extends polars.DataFrame in order to provide utility methods related to patito.Model. The schema of a data frame can be specified at runtime by invoking patito.DataFrame.set_model(model), after which a set of contextualized methods become available:

  • DataFrame.validate() - Validate the given data frame and return itself.
  • DataFrame.drop() - Drop all superfluous columns not specified as fields in the model.
  • DataFrame.cast() - Cast any columns which are not compatible with the given type annotations. When Field(dtype=...) is specified, the given dtype will always be forced, even in compatible cases.
  • DataFrame.get(predicate) - Retrieve a single row from the data frame as an instance of the model. An exception is raised if not exactly one row is yielded from the filter predicate.
  • DataFrame.fill_null(strategy="defaults") - Fill inn missing values according to the default values set on the model schema.
  • DataFrame.derive() - A model field annotated with Field(derived_from=...) indicates that a column should be defined by some arbitrary polars expression. If derived_from is specified as a string, then the given value will be interpreted as a column name with polars.col(). These columns are created and populated with data according to the derived_from expressions when you invoke DataFrame.derive().

These methods are best illustrated with an example:

from typing import Literal

import patito as pt
import polars as pl


class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    # Specify a specific dtype to be used
    popularity_rank: int = pt.Field(dtype=pl.UInt16)
    # Field with default value "for-sale"
    status: Literal["draft", "for-sale", "discontinued"] = "for-sale"
    # The eurocent cost is extracted from the Euro cost string "โ‚ฌX.Y EUR"
    eurocent_cost: int = pt.Field(
        derived_from=100 * pl.col("cost").str.extract(r"โ‚ฌ(\d+\.+\d+)").cast(float).round(2)
    )


products = pt.DataFrame(
    {
        "product_id": [1, 2],
        "popularity_rank": [2, 1],
        "status": [None, "discontinued"],
        "cost": ["โ‚ฌ2.30 EUR", "โ‚ฌ1.19 EUR"],
    }
)
product = (
    products
    # Specify the schema of the given data frame
    .set_model(Product)
    # Derive the `eurocent_cost` int column from the `cost` string column using pattern
    .derive()
    # Drop the `cost` column as it is not part of the model
    .drop()
    # Cast the popularity rank column to an unsigned 16-bit integer and cents to an integer
    .cast()
    # Fill missing values with the default values specified in the schema
    .fill_null(strategy="defaults")
    # Assert that the data frame now complies with the schema
    .validate()
    # Retrieve a single row and cast it to the model class
    .get(pl.col("product_id") == 1)
)
print(repr(product))
# Product(product_id=1, popularity_rank=2, status='for-sale', eurocent_cost=230)

Every Patito model automatically gets a .DataFrame attribute, a custom data frame subclass where .set_model() is invoked at instantiation. With other words, pt.DataFrame(...).set_model(Product) is equivalent to Product.DataFrame(...).

๐Ÿ Representing rows as classes

Data frames are tailor-made for performing vectorized operations over a set of objects. But when the time comes to retrieving a single row and operate upon it, the data frame construct naturally falls short. Patito allows you to embed row-level logic in methods defined on the model.

# models.py
import patito as pt

class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    name: str

    @property
    def url(self) -> str:
        return (
            "https://example.com/no/products/"
            f"{self.product_id}-"
            f"{self.name.lower().replace(' ', '-')}"
        )

The class can be instantiated from a single row of a data frame by using the from_row() method:

products = pl.DataFrame(
    {
        "product_id": [1, 2],
        "name": ["Skimmed milk", "Eggs"],
    }
)
milk_row = products.filter(pl.col("product_id" == 1))
milk = Product.from_row(milk_row)
print(milk.url)
# https://example.com/no/products/1-skimmed-milk

If you "connect" the Product model with the DataFrame by the use of patito.DataFrame.set_model(), or alternatively by using Product.DataFrame directly, you can use the .get() method in order to filter the data frame down to a single row and cast it to the respective model class:

products = Product.DataFrame(
    {
        "product_id": [1, 2],
        "name": ["Skimmed milk", "Eggs"],
    }
)
milk = products.get(pl.col("product_id") == 1)
print(milk.url)
# https://example.com/no/products/1-skimmed-milk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humblpatito-0.7.3.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

humblpatito-0.7.3-py3-none-any.whl (62.7 kB view details)

Uploaded Python 3

File details

Details for the file humblpatito-0.7.3.tar.gz.

File metadata

  • Download URL: humblpatito-0.7.3.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Windows/10

File hashes

Hashes for humblpatito-0.7.3.tar.gz
Algorithm Hash digest
SHA256 e0280c232c192938e11b88f1c82ba1ed2484fe24fa055aa8c698e44302ba35c1
MD5 593edbf57ab2d4c4e17f621194e7c480
BLAKE2b-256 072a7a40515a4dca5e3e34d0ac7c74edff727325034434c1f92e0c54a510db63

See more details on using hashes here.

File details

Details for the file humblpatito-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: humblpatito-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 62.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Windows/10

File hashes

Hashes for humblpatito-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0a42612892e775d3fb880966c36657ef1e33dd33dd350f3664f5b4005b2bd1ff
MD5 9ae658a3af52d0eaf0b4275ac3b56bc8
BLAKE2b-256 07c62bf7ca2dd71d7b560bc2d9763feb0b472033527957077e33b6da58bffe70

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page