Skip to main content

Read and validate Frictionless Data Tabular Data Packages with pandas.

Project description

goodtables-pandas-py

tests coverage pypi

Warning: Not an official frictionlessdata package

This package reads and validates a Frictionless Data Tabular Data Package using pandas. It is about ~10x faster than the official frictionlessdata/frictionless-py, at the expense of higher memory usage.

Usage

pip install goodtables-pandas-py
import goodtables_pandas as goodtables

report = goodtables.validate(source='datapackage.json')

Implementation notes

Limitations

  • Only fields of type string, number, integer, boolean, date, datetime, year, and geopoint are currently supported. Other types can easily be supported with additional parse_* functions in parse.py.
  • Memory use could be greatly minimized by reading, parsing, and checking tables in chunks (using pandas.read_csv(chunksize=)), and storing only field values for unique and foreign key checks.

Uniqueness of null

Pandas chooses to treat missing values (null) as regular values, meaning that they are equal to themselves. How uniqueness is defined as a result is illustrated in the following examples.

unique not unique
(1), (null) (1), (null), (null)
(1, 1), (1, null) (1, 1), (1, null), (1, null)

As the following script demonstrates, pandas considers the repeated rows (1, null) to be duplicates, and thus not unique.

import pandas
import numpy as np

pandas.DataFrame(dict(x=[1, 1, 1], y=[1, np.nan, np.nan])).duplicated()

0 False 1 False 2 True dtype: bool

Although this behavior matches some SQL implementations (namely Microsoft SQL Server), others (namely PostgreSQL and SQLite) choose to treat null as unique. See this dbfiddle.

Key constraints

primaryKey

Fields in primaryKey cannot contain missing values (equivalent to required: true).

See https://github.com/frictionlessdata/specs/issues/593.

uniqueKey

The uniqueKeys property provides support for one or more row uniqueness constraints which, unlike primaryKey, do support null values. Uniqueness is determined as described above.

{
  "uniqueKeys": [
    ["x", "y"],
    ["y", "z"]
  ]
}

See https://github.com/frictionlessdata/specs/issues/593.

foreignKey

The reference key of a foreignKey must meet the requirements of uniqueKey: it must be unique but can contain null. The local key must be present in the reference key, unless one of the fields is null.

reference local: valid local: invalid
(1) (1), (null) (2)
(1), (null) (1), (null) (2)
(1, 1) (1, 1), (1, null), (2, null) (1, 2)

De-duplication of key constraints

To avoid duplicate key checks, the various key constraints are expanded as follows:

  • Reference foreign keys (foreignKey.reference.fields) are added (if not already present) to the unique keys (uniqueKeys) of the reference resource. The foreignKey check only considers whether the local key is in the reference key.
  • The primary key (primaryKey) is moved (if not already present) to the unique keys (uniqueKeys) and the fields in the key become required (field.constraints.required: true) if not already.
  • Single-field unique keys (uniqueKeys) are dropped and the fields become unique (field.constraints.unique: true) if not already.

The following example illustrates the transformation in terms of Table Schema descriptor.

Original

{
  "fields": [
    {
      "name": "x",
      "required": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "primaryKey": ["x", "y"],
  "uniqueKeys": [
    ["x", "y"],
    ["x"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}

Checked

{
  "fields": [
    {
      "name": "x",
      "required": true,
      "unique": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "uniqueKeys": [
    ["x", "y"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodtables-pandas-py-0.2.0.tar.gz (16.6 kB view hashes)

Uploaded Source

Built Distribution

goodtables_pandas_py-0.2.0-py3-none-any.whl (16.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page