A light-weight and flexible validation package for pandas data structures.
Project description
Pandera
A flexible and expressive pandas validation library.
pandas
data structures hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea. pandera
enables users to:
- Check the types and properties of columns in a
DataFrame
or values in aSeries
. - Perform more complex statistical validation like hypothesis testing.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
pandera
provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.
Documentation
The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io
Install
Using pip:
pip install pandera
Using conda:
conda install -c cosmicbboy pandera
Example Usage
DataFrameSchema
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, Check
# validate columns
schema = DataFrameSchema({
# the check function expects a series argument and should output a boolean
# or a boolean Series.
"column1": Column(pa.Int, Check(lambda s: s <= 10)),
"column2": Column(pa.Float, Check(lambda s: s < -1.2)),
# you can provide a list of validators
"column3": Column(pa.String, [
Check(lambda s: s.str.startswith("value_")),
Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
# alternatively, you can pass strings representing the legal pandas datatypes:
# http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes
schema = DataFrameSchema({
"column1": Column("int64", Check(lambda s: s <= 10)),
...
})
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})
validated_df = schema.validate(df)
print(validated_df)
# column1 column2 column3
# 0 1 -1.3 value_1
# 1 4 -1.4 value_2
# 2 0 -2.9 value_3
# 3 10 -10.1 value_2
# 4 9 -20.4 value_1
Development Installation
git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements.txt
pip install -e .
Tests
pip install pytest
pytest tests
Contributing to pandera
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Issues
Go here to submit feature requests or bugfixes.
Other Data Validation Libraries
Here are a few other alternatives for validating Python data structures.
Generic Python object data validation
pandas
-specific data validation
Why pandera
?
pandas
-centric data types, column nullability, and uniqueness are first-class concepts.check_input
andcheck_output
decorators enable seamless integration with existing code.Check
s provide flexibility and performance by providing access topandas
API by design.Hypothesis
class provides a tidy-first interface for statistical hypothesis testing.Check
s andHypothesis
objects support both tidy and wide data validation.- Comprehensive documentation on key functionality.
Citation Information
@misc{niels_bantilan_2019_3385266,
author = {Niels Bantilan and
Nigel Markey and
Riccardo Albertazzi and
chr1st1ank},
title = {pandera-dev/pandera: 0.2.0 pre-release 1},
month = sep,
year = 2019,
doi = {10.5281/zenodo.3385266},
url = {https://doi.org/10.5281/zenodo.3385266}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.