Skip to main content

A light-weight and flexible validation package for pandas data structures.

Project description



A data validation library for scientists, engineers, and analysts seeking correctness.


Build Status Documentation Status PyPI version shields.io PyPI license pyOpenSci Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status codecov PyPI pyversions DOI asv

pandas data structures contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings. With pandera, you can:

  1. Check the types and properties of columns in a DataFrame or values in a Series.
  2. Perform more complex statistical validation like hypothesis testing.
  3. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

pandera provides a flexible and expressive API for performing data validation on tidy (long-form) and wide data to make data processing pipelines more readable and robust.

Documentation

The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io

Install

Using pip:

pip install pandera

Using conda:

conda install -c conda-forge pandera

Quick Start

import pandas as pd
import pandera as pa


# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(pa.Int, checks=pa.Check.less_than_or_equal_to(10)),
    "column2": pa.Column(pa.Float, checks=pa.Check.less_than(-1.2)),
    "column3": pa.Column(pa.String, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema.validate(df)
print(validated_df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

Development Installation

git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements-dev.txt
pip install -e .

Tests

pip install pytest
pytest tests

Contributing to pandera GitHub contributors

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.

Issues

Go here to submit feature requests or bugfixes.

Other Data Validation Libraries

Here are a few other alternatives for validating Python data structures.

Generic Python object data validation

pandas-specific data validation

Other tools that include data validation

Why pandera?

  • pandas-centric data types, column nullability, and uniqueness are first-class concepts.
  • check_input and check_output decorators enable seamless integration with existing code.
  • Checks provide flexibility and performance by providing access to pandas API by design.
  • Hypothesis class provides a tidy-first interface for statistical hypothesis testing.
  • Checks and Hypothesis objects support both tidy and wide data validation.
  • Comprehensive documentation on key functionality.

Citation Information

@misc{niels_bantilan_2019_3385266,
  author       = {Niels Bantilan and
                  Nigel Markey and
                  Riccardo Albertazzi and
                  chr1st1ank},
  title        = {pandera-dev/pandera: 0.2.0 pre-release 1},
  month        = sep,
  year         = 2019,
  doi          = {10.5281/zenodo.3385266},
  url          = {https://doi.org/10.5281/zenodo.3385266}
}

Project details


Release history Release notifications | RSS feed

This version

0.4.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandera-0.4.2.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandera-0.4.2-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file pandera-0.4.2.tar.gz.

File metadata

  • Download URL: pandera-0.4.2.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.39.0 CPython/3.6.7

File hashes

Hashes for pandera-0.4.2.tar.gz
Algorithm Hash digest
SHA256 3d8c73c7ef29df592ceb896fe82446ee407ad101a908badc5c5167946a24f33f
MD5 f1a9aa7b17db1d4821d66ccd7a26fbb4
BLAKE2b-256 6bb7e6b7b132a6f98b853a2f57bc93d7e9a378041852f6fbaa95f2da1e8de0a6

See more details on using hashes here.

File details

Details for the file pandera-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: pandera-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.39.0 CPython/3.6.7

File hashes

Hashes for pandera-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 002a415033a2ac6693f260ec2d455f67fbbe41aef3177c269fb522ab04d9fc7d
MD5 feb30c568ab93c69632359301579c20e
BLAKE2b-256 381a2560a78df72f3a989962c533ad69e35db7d1936b55dbb971fa144161e22d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page