Skip to main content

Read and validate Frictionless Data Tabular Data Packages with pandas.

Project description

goodtables-pandas-py

tests coverage pypi

Warning: Not an official frictionlessdata package

This package reads and validates a Frictionless Data Tabular Data Package using pandas. It is about ~10x faster than the official frictionlessdata/frictionless-py, at the expense of higher memory usage.

Usage

pip install goodtables-pandas-py
import goodtables_pandas as goodtables

report = goodtables.validate(source='datapackage.json')

Implementation notes

Limitations

  • Only fields of type string, number, integer, boolean, date, datetime, year, and geopoint are currently supported. Other types can easily be supported with additional parse_* functions in parse.py.
  • Memory use could be greatly minimized by reading, parsing, and checking tables in chunks (using pandas.read_csv(chunksize=)), and storing only field values for unique and foreign key checks.

Uniqueness of null

Pandas chooses to treat missing values (null) as regular values, meaning that they are equal to themselves. How uniqueness is defined as a result is illustrated in the following examples.

unique not unique
(1), (null) (1), (null), (null)
(1, 1), (1, null) (1, 1), (1, null), (1, null)

As the following script demonstrates, pandas considers the repeated rows (1, null) to be duplicates, and thus not unique.

import pandas
import numpy as np

pandas.DataFrame(dict(x=[1, 1, 1], y=[1, np.nan, np.nan])).duplicated()

0 False 1 False 2 True dtype: bool

Although this behavior matches some SQL implementations (namely Microsoft SQL Server), others (namely PostgreSQL and SQLite) choose to treat null as unique. See this dbfiddle.

Key constraints

primaryKey

Fields in primaryKey cannot contain missing values (equivalent to required: true).

See https://github.com/frictionlessdata/specs/issues/593.

uniqueKey

The uniqueKeys property provides support for one or more row uniqueness constraints which, unlike primaryKey, do support null values. Uniqueness is determined as described above.

{
  "uniqueKeys": [
    ["x", "y"],
    ["y", "z"]
  ]
}

See https://github.com/frictionlessdata/specs/issues/593.

foreignKey

The reference key of a foreignKey must meet the requirements of uniqueKey: it must be unique but can contain null. The local key must be present in the reference key, unless one of the fields is null.

reference local: valid local: invalid
(1) (1), (null) (2)
(1), (null) (1), (null) (2)
(1, 1) (1, 1), (1, null), (2, null) (1, 2)

De-duplication of key constraints

To avoid duplicate key checks, the various key constraints are expanded as follows:

  • Reference foreign keys (foreignKey.reference.fields) are added (if not already present) to the unique keys (uniqueKeys) of the reference resource. The foreignKey check only considers whether the local key is in the reference key.
  • The primary key (primaryKey) is moved (if not already present) to the unique keys (uniqueKeys) and the fields in the key become required (field.constraints.required: true) if not already.
  • Single-field unique keys (uniqueKeys) are dropped and the fields become unique (field.constraints.unique: true) if not already.

The following example illustrates the transformation in terms of Table Schema descriptor.

Original

{
  "fields": [
    {
      "name": "x",
      "required": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "primaryKey": ["x", "y"],
  "uniqueKeys": [
    ["x", "y"],
    ["x"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}

Checked

{
  "fields": [
    {
      "name": "x",
      "required": true,
      "unique": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "uniqueKeys": [
    ["x", "y"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodtables-pandas-py-0.2.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

goodtables_pandas_py-0.2.0-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file goodtables-pandas-py-0.2.0.tar.gz.

File metadata

  • Download URL: goodtables-pandas-py-0.2.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.3 CPython/3.8.5 Darwin/18.7.0

File hashes

Hashes for goodtables-pandas-py-0.2.0.tar.gz
Algorithm Hash digest
SHA256 194171af746bf96bb2198866b31aa76cd5b4547a77f6dd55d821ad97ef268406
MD5 534919c222be4d57aeff8ffc80485af9
BLAKE2b-256 f1983ce8b66fe03b6e4c57c8d63768588f30dde3615eefb6bcf231e8e777b6a4

See more details on using hashes here.

File details

Details for the file goodtables_pandas_py-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for goodtables_pandas_py-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85a7b9c0574f83928dff8b2a79dac12e395f17a783872dc8883792c7d634c060
MD5 6fe780058f4ae951bf84b69c6d820c06
BLAKE2b-256 f16cece7f50e18cd73ebe2701cf2462218de8b532253620eda254c9a53f6f60f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page