Skip to main content

Typed Wrappers over Pandas and Polars DataFrames with schema validation

Project description

logologo


typedframe

Typed wrappers over pandas DataFrames with schema validation.

Tests

TypedDataFrame is a lightweight wrapper over pandas DataFrame that provides runtime schema validation and can be used to establish strong data contracts between interfaces in your Python code.

The goal of the library is to reveal and make explicit all unclear or forgotten assumptions about your DataFrame.

Check the Official Documentation.

Quickstart

Install typedframe library:

pip install typedframe

Assume an overly simplified preprocessing code like this:

def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    c1_min, c1_max = df['col1'].min(), df['col1'].max()
    df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
    df['month'] = df['date'].dt.month
    df['comment'] = df['comment'].str.lower()
    return df

To add typedframe schema support for this transformation we will define two schema classes - for the input and for the output:

import numpy as np
from typedframe import TypedDataFrame, DATE_TIME_DTYPE

class MyRawData(TypedDataFrame):
    schema = {
        'col1': np.float64,
        'date': DATE_TIME_DTYPE,
        'comment': str,
    }


class PreprocessedData(MyRawData):
    schema = {
        'month': np.int8
    }

Then let's modify the preprocess function to take a typed wrapper MyRawData as input and return PreprocessedData:

def preprocess(data: MyRawData) -> PreprocessedData:
    df = data.df.copy()
    c1_min, c1_max = df['col1'].min(), df['col1'].max()
    df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
    df['month'] = df['date'].dt.month
    df['comment'] = df['comment'].str.lower()
    return PreprocessedData.convert(df)

As you can see the actual DataFrame can be accessed via the .df attribute of the Typed DataFrame.

Now clients of the preprocess function can easily check what are the inputs and outputs without the need to look at its internals. And if there are some unforseen changes in the data an exception will be thrown before the actual function will be invoked.

Let's check:

import pandas as pd

df = pd.DataFrame({
  'col1': [0.1, 0.2],
  'date': ['2021-01-01', '2022-01-01'],
  'comment': ['foo', 'bar']
})
df.date = pd.to_datetime(df.date)

bad_df = pd.DataFrame({
  'col1': [1, 2],
  'comment': ['foo', 'bar']
})

df2 = preprocess(MyRawData(df))
df3 = preprocess(MyRawData(bad_df))

The first call was successful. But when we've tried to pass a wrong dataframe as input we've got the following error:

AssertionError: Dataframe doesn't match schema
Actual: {'col1': dtype('int64'), 'comment': dtype('O')}
Expected: {'col1': <class 'numpy.float64'>, 'date': dtype('<M8[ns]'), 'comment': <class 'object'>}
Difference: {('col1', <class 'numpy.float64'>), ('date', dtype('<M8[ns]'))}

Supported versions

Tested on the following versions:

Python: 3.9

numpy: 1.20, 1.21, 1.22

pandas: 1.2, 1.3, 1.4

Manually test in your environment

git clone git@github.com:areshytko/typedframe.git
cd typedframe
pip install -r requirements.txt
pytest

Releases

v0.7.0

New Functionality

  • NaNs in categoricals are not allowed and cause an assertion. Motivation: Explicit use of pd.Categorical(df.col, categories=[MyTypedFrame.schema['col']]) conversion can introduce such NaNs and bypass the type check. See the pd.Categorical documentation.

v0.6.1

New Functionality

  • updated docstrings

Breaking changes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typedframe-0.11.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

typedframe-0.11.0-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file typedframe-0.11.0.tar.gz.

File metadata

  • Download URL: typedframe-0.11.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6

File hashes

Hashes for typedframe-0.11.0.tar.gz
Algorithm Hash digest
SHA256 d77c12369a0954a305f2e912c433367235d36c61d2f3b37090a8ad6ec4285443
MD5 b9510cd64d221439ce089565d720a7f7
BLAKE2b-256 17b8c6610673b5bb14906500f0c2cd5b9e396d18042d03a5f050763809285490

See more details on using hashes here.

File details

Details for the file typedframe-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: typedframe-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6

File hashes

Hashes for typedframe-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d38ed1206072c5953e846be6e29fce0f657a82de3937023292bd2a9ffd8d4832
MD5 53a7ee2767cfebdf21f363d9c6cb6103
BLAKE2b-256 a64f58a6dc92c3476461b240131183caa50378507b28d9d998af4aaa4a77f0bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page