Skip to main content

lightweight pandas.DataFrame schema

Project description

DFS (aka Dataframe_Schema)

DFS is a lightweight validator for pandas.DataFrame. You can think of it as a jsonschema for dataframe.

Key features:

  1. Lightweight: only dependent on pandas and pydantic (which depends only on typing_extensions)
  2. Explicit: inspired by JsonSchema, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.
  3. Simple: Easy to use, no need to change your workflow and dive into the implementation details.
  4. Comprehensive: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe
  5. Rapid: base schemas can be generated from given dataframe or sql query (using pd.read_sql).
  6. Handy: Supports command line interface (with [cli] extra).
  7. Extendable: Core idea is to validate dataframes of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )

QuickStart

1. Validate DataFrame

Via wrapper

import pandas as pd
import dfschema as dfs


df = pd.DataFrame({
  "a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  "b": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

schema_pass = {
  "shape": {"min_rows": 10}
}

schema_raise = {
  "shape": {"min_rows": 20}
}


dfs.validate(df, schema_pass)  # won't raise any issues
dfs.validate(df, schema_raise) # will Raise DataFrameSchemaError

Alternatively (v2 optional), you can use the root class, DfSchema:

dfs.DfSchema.from_dict(schema_pass).validate(df)  # won't raise any issues
dfs.DfSchema.from_dict(schema_raise).validate(df)  # will Raise DataFrameSchemaError

2. Generate Schema

dfs.DfSchema.from_df(df)

3. Read and Write Schemas

schema = dfs.DfSchema.from_file('schema.json')
schema.to_file("schema.yml")

4. Using CLI

Note: requires [cli] extra as relies on Typer and click

Validate via CLI

dfschema validate --read_kwargs_json '{delimiter="|"}' FILEPATH SCHEMA_FILEPATH

Supports

  • csv
  • xlsx
  • parquet
  • feather

Generate via CLI

dfs generate --format 'yaml' DATA_PATH > schema.yaml

Installation

WIP

Alternatives

Changes

  • [[changelog]]

Roadmap

  • Add tutorial Notebook
  • Support tableschema
  • Support Modin models
  • Support SQLAlchemy ORM models
  • Built-in Airflow Operator?
  • Interactive CLI/jupyter for schema generation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfschema-0.0.12.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

dfschema-0.0.12-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file dfschema-0.0.12.tar.gz.

File metadata

  • Download URL: dfschema-0.0.12.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.0 CPython/3.8.18 Linux/6.5.0-1025-azure

File hashes

Hashes for dfschema-0.0.12.tar.gz
Algorithm Hash digest
SHA256 a3beddaa5cb7c784d0802c76434635dc7bf1ed80f331b3efe9b1ee1ba9c69ceb
MD5 c4bbde2b0b6dfbaa2ce6e077fad45d49
BLAKE2b-256 a0d3b96ee97111460d291a2ee570e09497f9cb3b575dfa1298f947b5c78577ef

See more details on using hashes here.

File details

Details for the file dfschema-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: dfschema-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.0 CPython/3.8.18 Linux/6.5.0-1025-azure

File hashes

Hashes for dfschema-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 d121ae1d8ea888c86fee67aba6131acfebc61caee4568a1b17848bbd15b2b35b
MD5 b5ad7af999829c9eab2e9e62c17f755d
BLAKE2b-256 1b712eb53be6b4e806d707a89af8aad7da85e6f5435e9510578d2add20810661

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page