Skip to main content

Enforce column names & data types of pandas DataFrames

Project description

Overview

dataenforce is a Python package used to enforce column names & types of pandas DataFrames using Python 3 type hinting.

It is a common issue in Data Analysis to pass dataframes into functions without a clear idea of which columns are included or not, and as columns are added to or removed from input data, code can break in unexpected ways. With dataenforce, you can provide a clear interface to your functions and ensure that the input dataframes will have the right format when your code is used.

How to install

Install with pip:

pip install dataenforce

You can also pip install it from the sources, or just import the dataenforce folder.

How to use

There are two parts in dataenforce: the type-hinting part, and the validation. You can use type-hinting with the provided class to indicate what shape the input dataframes should have, and the validation decorator to additionally ensure the format is respected in every function call.

Type-hinting: Dataset

The Dataset type indicates that we expect a pandas.DataFrame

Column name checking

from dataenforce import Dataset

def process_data(data: Dataset["id", "name", "location"])
  pass

The code above specifies that data must be a DataFrame with exactly the 3 mentioned columns. If you want to only specify a subset of columns which is required, you can use an ellipsis:

def process_data(data: Dataset["id", "name", "location", ...])
  pass

dtype checking

def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
  pass

The code above specifies the column names which must be there, with associated types. A combination of only names & with types is possible: Dataset["id": int, "name"].

Reusing dataframe formats

As you're likely to use the same column subsets several times in your code, you can define them to reuse & combine them later:

DName = Dataset["id", "name"]
DLocation = Dataset["id", "latitude", "longitude"]

# Expects columns id, name
def process1(data: DName):
  pass

# Expects columns id, name, latitude, longitude, timestamp
def process2(data: Dataset[DName, DLocation, "timestamp"])
  pass

Enforcing: @validate

The @validate decorator ensures that input Datasets have the right format when the function is called, otherwise raises TypeError.

from dataenforce import Dataset, validate
import pandas as pd

@validate
def process_data(data: Dataset["id", "name"]):
  pass

process_data(pd.DataFrame(dict(id=[1,2], name=["Alice", "Bob"]))) # Works
process_data(pd.DataFrame(dict(id=[1,2]))) # Raises a TypeError, column name missing

How to test

dataenforce uses pytest as a testing library. If you have pytest installed, just run PYTHONPATH="." pytest in the command line while being in the root folder.

Notes

  • You can use dataenforce to type-hint the return value of a function, but it is not currently possible to validate it (it is not included in the checks)
  • You can't use @validate on a function where you use non-base class type-hints as strings (like def f() -> "MyClass"). Issue related to PEP 563
  • This work is at experimental state. It is not production-ready. Please raise issues & send pull requests if you find/solve some bugs
  • dataenforce is released under the Apache License 2.0, meaning you can freely use the library and redistribute it, provided Copyright is kept
  • Dependencies: Pandas & Numpy
  • Tested with Python 3.6, 3.7, 3.8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataenforce-0.1.2.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

dataenforce-0.1.2-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file dataenforce-0.1.2.tar.gz.

File metadata

  • Download URL: dataenforce-0.1.2.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.5

File hashes

Hashes for dataenforce-0.1.2.tar.gz
Algorithm Hash digest
SHA256 19c232cbd1e4e5165eda1ca7a82a4a6470a49a48ef6132d3320788b7d099517f
MD5 d3d670406a8ee293bcc6bd3603976e7e
BLAKE2b-256 e088ecaec8b4c615c9368028ee1369e9251cb9278b16d691a941ae1f39bc9af6

See more details on using hashes here.

File details

Details for the file dataenforce-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dataenforce-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.8.5

File hashes

Hashes for dataenforce-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cc0a46b151399c4dc2d9239983a243be2933d3fa96f3f9862b80eb5c1b046fbc
MD5 8bb1b053107dc92377982f3a81cb88a1
BLAKE2b-256 56768ee9d76d3c3930a99792b3f74bb30ba3ee30014fdda0189648c67b49962c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page