Enforce column names & data types of pandas DataFrames
Project description
# Overview
`dataenforce` is a Python package used to enforce column names & types of pandas DataFrames using Python 3 type hinting.
It is a common issue in Data Analysis to pass dataframes into functions without a clear idea of which columns are included or not, and as columns are added to or removed from input data, code can break in unexpected ways. With `dataenforce`, you can provide a clear interface to your functions and ensure that the input dataframes will have the right format when your code is used.
# How to install
Install with pip:
```
pip install dataenforce
```
You can also pip install it from the sources, or just import the `dataenforce` folder.
# How to use
There are two parts in `dataenforce`: the type-hinting part, and the validation. You can use type-hinting with the provided class to indicate what shape the input dataframes should have, and the validation decorator to additionally ensure the format is respected in every function call.
## Type-hinting: `Dataset`
The `Dataset` type indicates that we expect a `pandas.DataFrame`
### Column name checking
```py
from dataenforce import Dataset
def process_data(data: Dataset["id", "name", "location"])
pass
```
The code above specifies that `data` must be a DataFrame with exactly the 3 mentioned columns. If you want to only specify a subset of columns which is required, you can use an ellipsis:
```py
def process_data(data: Dataset["id", "name", "location", ...])
pass
```
### dtype checking
```py
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
pass
```
The code above specifies the column names which must be there, with associated types. A combination of only names & with types is possible: `Dataset["id": int, "name"]`.
### Reusing dataframe formats
As you're likely to use the same column subsets several times in your code, you can define them to reuse & combine them later:
```py
DName = Dataset["id", "name"]
DLocation = Dataset["id", "latitude", "longitude"]
# Expects columns id, name
def process1(data: DName):
pass
# Expects columns id, name, latitude, longitude, timestamp
def process2(data: Dataset[DName, DLocation, "timestamp"])
pass
```
## Enforcing: `@validate`
The `@validate` decorator ensures that input `Dataset`s have the right format when the function is called, otherwise raises `TypeError`.
```py
from dataenforce import Dataset, validate
import pandas as pd
@validate
def process_data(data: Dataset["id", "name"]):
pass
process_data(pd.DataFrame(dict(id=[1,2], name=["Alice", "Bob"]))) # Works
process_data(pd.DataFrame(dict(id=[1,2]))) # Raises a TypeError, column name missing
```
# How to test
`dataenforce` uses `pytest` as a testing library. If you have `pytest` installed, just run `pytest` in the command line while being in the root folder.
# Notes
* You can use `dataenforce` to type-hint the return value of a function, but it is not currently possible to `validate` it (it is not included in the checks)
* `dataenforce` is released under the Apache License 2.0, meaning you can freely use the library and redistribute it, provided Copyright is kept
* Dependencies: Pandas & Numpy
`dataenforce` is a Python package used to enforce column names & types of pandas DataFrames using Python 3 type hinting.
It is a common issue in Data Analysis to pass dataframes into functions without a clear idea of which columns are included or not, and as columns are added to or removed from input data, code can break in unexpected ways. With `dataenforce`, you can provide a clear interface to your functions and ensure that the input dataframes will have the right format when your code is used.
# How to install
Install with pip:
```
pip install dataenforce
```
You can also pip install it from the sources, or just import the `dataenforce` folder.
# How to use
There are two parts in `dataenforce`: the type-hinting part, and the validation. You can use type-hinting with the provided class to indicate what shape the input dataframes should have, and the validation decorator to additionally ensure the format is respected in every function call.
## Type-hinting: `Dataset`
The `Dataset` type indicates that we expect a `pandas.DataFrame`
### Column name checking
```py
from dataenforce import Dataset
def process_data(data: Dataset["id", "name", "location"])
pass
```
The code above specifies that `data` must be a DataFrame with exactly the 3 mentioned columns. If you want to only specify a subset of columns which is required, you can use an ellipsis:
```py
def process_data(data: Dataset["id", "name", "location", ...])
pass
```
### dtype checking
```py
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
pass
```
The code above specifies the column names which must be there, with associated types. A combination of only names & with types is possible: `Dataset["id": int, "name"]`.
### Reusing dataframe formats
As you're likely to use the same column subsets several times in your code, you can define them to reuse & combine them later:
```py
DName = Dataset["id", "name"]
DLocation = Dataset["id", "latitude", "longitude"]
# Expects columns id, name
def process1(data: DName):
pass
# Expects columns id, name, latitude, longitude, timestamp
def process2(data: Dataset[DName, DLocation, "timestamp"])
pass
```
## Enforcing: `@validate`
The `@validate` decorator ensures that input `Dataset`s have the right format when the function is called, otherwise raises `TypeError`.
```py
from dataenforce import Dataset, validate
import pandas as pd
@validate
def process_data(data: Dataset["id", "name"]):
pass
process_data(pd.DataFrame(dict(id=[1,2], name=["Alice", "Bob"]))) # Works
process_data(pd.DataFrame(dict(id=[1,2]))) # Raises a TypeError, column name missing
```
# How to test
`dataenforce` uses `pytest` as a testing library. If you have `pytest` installed, just run `pytest` in the command line while being in the root folder.
# Notes
* You can use `dataenforce` to type-hint the return value of a function, but it is not currently possible to `validate` it (it is not included in the checks)
* `dataenforce` is released under the Apache License 2.0, meaning you can freely use the library and redistribute it, provided Copyright is kept
* Dependencies: Pandas & Numpy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dataenforce-0.1.1.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for dataenforce-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5aa6331262bd23f03e998ec27844849a811f53d0485e2045cd8ffb853373ed27 |
|
MD5 | f628152f34fc95c9b227ea8726d38bd7 |
|
BLAKE2b-256 | 24b62ee08832bdce6f33f068e3378ca59a4276d5b55bf62566f65fb8b289cc8e |