Typed Wrappers over Pandas and Polars DataFrames with schema validation
Project description
typedframe
Typed wrappers over pandas DataFrames with schema validation.
TypedDataFrame
is a lightweight wrapper over pandas DataFrame
that provides runtime schema validation and can be used to establish strong data contracts between interfaces in your Python code.
The goal of the library is to reveal and make explicit all unclear or forgotten assumptions about your DataFrame.
Check the Official Documentation.
Quickstart
Install typedframe library:
pip install typedframe
Assume an overly simplified preprocessing code like this:
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
c1_min, c1_max = df['col1'].min(), df['col1'].max()
df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
df['month'] = df['date'].dt.month
df['comment'] = df['comment'].str.lower()
return df
To add typedframe
schema support for this transformation we will define two schema classes - for the input and for the output:
import numpy as np
from typedframe import TypedDataFrame, DATE_TIME_DTYPE
class MyRawData(TypedDataFrame):
schema = {
'col1': np.float64,
'date': DATE_TIME_DTYPE,
'comment': str,
}
class PreprocessedData(MyRawData):
schema = {
'month': np.int8
}
Then let's modify the preprocess
function to take a typed wrapper MyRawData
as input and return PreprocessedData
:
def preprocess(data: MyRawData) -> PreprocessedData:
df = data.df.copy()
c1_min, c1_max = df['col1'].min(), df['col1'].max()
df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
df['month'] = df['date'].dt.month
df['comment'] = df['comment'].str.lower()
return PreprocessedData.convert(df)
As you can see the actual DataFrame can be accessed via the .df
attribute of the Typed DataFrame.
Now clients of the preprocess
function can easily check what are the inputs and outputs without the need to look at its internals.
And if there are some unforseen changes in the data an exception will be thrown before the actual function will be invoked.
Let's check:
import pandas as pd
df = pd.DataFrame({
'col1': [0.1, 0.2],
'date': ['2021-01-01', '2022-01-01'],
'comment': ['foo', 'bar']
})
df.date = pd.to_datetime(df.date)
bad_df = pd.DataFrame({
'col1': [1, 2],
'comment': ['foo', 'bar']
})
df2 = preprocess(MyRawData(df))
df3 = preprocess(MyRawData(bad_df))
The first call was successful. But when we've tried to pass a wrong dataframe as input we've got the following error:
AssertionError: Dataframe doesn't match schema
Actual: {'col1': dtype('int64'), 'comment': dtype('O')}
Expected: {'col1': <class 'numpy.float64'>, 'date': dtype('<M8[ns]'), 'comment': <class 'object'>}
Difference: {('col1', <class 'numpy.float64'>), ('date', dtype('<M8[ns]'))}
Supported versions
Tested on the following versions:
Python: 3.9
numpy: 1.20, 1.21, 1.22
pandas: 1.2, 1.3, 1.4
Manually test in your environment
git clone git@github.com:areshytko/typedframe.git
cd typedframe
pip install -r requirements.txt
pytest
Releases
v0.7.0
New Functionality
- NaNs in categoricals are not allowed and cause an assertion. Motivation: Explicit use of pd.Categorical(df.col, categories=[MyTypedFrame.schema['col']]) conversion can introduce such NaNs and bypass the type check. See the pd.Categorical documentation.
v0.6.1
New Functionality
- updated docstrings
Breaking changes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file typedframe-0.11.0.tar.gz
.
File metadata
- Download URL: typedframe-0.11.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d77c12369a0954a305f2e912c433367235d36c61d2f3b37090a8ad6ec4285443 |
|
MD5 | b9510cd64d221439ce089565d720a7f7 |
|
BLAKE2b-256 | 17b8c6610673b5bb14906500f0c2cd5b9e396d18042d03a5f050763809285490 |
File details
Details for the file typedframe-0.11.0-py3-none-any.whl
.
File metadata
- Download URL: typedframe-0.11.0-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d38ed1206072c5953e846be6e29fce0f657a82de3937023292bd2a9ffd8d4832 |
|
MD5 | 53a7ee2767cfebdf21f363d9c6cb6103 |
|
BLAKE2b-256 | a64f58a6dc92c3476461b240131183caa50378507b28d9d998af4aaa4a77f0bd |