lightweight pandas.DataFrame schema
Project description
DFS (aka Dataframe_Schema)
DFS is a lightweight validator for pandas.DataFrame
. You can think of it as a jsonschema
for dataframe.
Key features:
- Lightweight: only dependent on
pandas
andpydantic
(which depends only ontyping_extensions
) - Explicit: inspired by
JsonSchema
, all schemas are stored as json (or yaml) files and can be generated or changed on the fly. - Simple: Easy to use, no need to change your workflow and dive into the implementation details.
- Comprehensive: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe
- Rapid: base schemas can be generated from given dataframe or sql query (using
pd.read_sql
). - Handy: Supports command line interface (with
[cli]
extra). - Extendable: Core idea is to validate dataframes of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )
QuickStart
1. Validate DataFrame
Via wrapper
import pandas as pd
import dfschema as dfs
df = pd.DataFrame({
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"b": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
schema_pass = {
"shape": {"min_rows": 10}
}
schema_raise = {
"shape": {"min_rows": 20}
}
dfs.validate(df, schema_pass) # won't raise any issues
dfs.validate(df, schema_raise) # will Raise DataFrameSchemaError
Alternatively (v2 optional), you can use the root class, DfSchema
:
dfs.DfSchema.from_dict(schema_pass).validate(df) # won't raise any issues
dfs.DfSchema.from_dict(schema_raise).validate(df) # will Raise DataFrameSchemaError
2. Generate Schema
dfs.DfSchema.from_df(df)
3. Read and Write Schemas
schema = dfs.DfSchema.from_file('schema.json')
schema.to_file("schema.yml")
4. Using CLI
Note: requires [cli] extra as relies on
Typer
andclick
Validate via CLI
dfschema validate --read_kwargs_json '{delimiter="|"}' FILEPATH SCHEMA_FILEPATH
Supports
- csv
- xlsx
- parquet
- feather
Generate via CLI
dfs generate --format 'yaml' DATA_PATH > schema.yaml
Installation
WIP
Alternatives
- TableScheme
- GreatExpectations. Large and complex package with Html reports, Airflow Operator, connectors, etc. an work on out-of-memory data, SQL databases, parquet, etc
- Pandera - awesome package, great and suitable for type hinting, compatible with
hypothesis
- Tensorflow validate
- DTF expectations
Changes
- [[changelog]]
Roadmap
- Add tutorial Notebook
- Support tableschema
- Support Modin models
- Support SQLAlchemy ORM models
- Built-in Airflow Operator?
- Interactive CLI/jupyter for schema generation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dfschema-0.0.12.tar.gz
(17.7 kB
view details)
Built Distribution
dfschema-0.0.12-py3-none-any.whl
(22.2 kB
view details)
File details
Details for the file dfschema-0.0.12.tar.gz
.
File metadata
- Download URL: dfschema-0.0.12.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.0 CPython/3.8.18 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3beddaa5cb7c784d0802c76434635dc7bf1ed80f331b3efe9b1ee1ba9c69ceb |
|
MD5 | c4bbde2b0b6dfbaa2ce6e077fad45d49 |
|
BLAKE2b-256 | a0d3b96ee97111460d291a2ee570e09497f9cb3b575dfa1298f947b5c78577ef |
File details
Details for the file dfschema-0.0.12-py3-none-any.whl
.
File metadata
- Download URL: dfschema-0.0.12-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.0 CPython/3.8.18 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d121ae1d8ea888c86fee67aba6131acfebc61caee4568a1b17848bbd15b2b35b |
|
MD5 | b5ad7af999829c9eab2e9e62c17f755d |
|
BLAKE2b-256 | 1b712eb53be6b4e806d707a89af8aad7da85e6f5435e9510578d2add20810661 |