Define, validate and transform DBnomics data

These details have not been verified by PyPI

Project links

Homepage

Project description

DBnomics data model

Define, validate and transform DBnomics data.

For a quick schematic look at the data model, please read the cheat_sheet.md file. If you are a developer working on fetchers, you can print it!

Entities and relationships

provider -> dataset -> time series -> observations

Each provider contains datasets
Each dataset contains time series
Each time series contains observations
Each observation is a tuple like (period, value, attribute1, attribute2, ..., attributeN), where attributes are optional

Note: the singluar and plural forms of "time series" are identical (cf Wiktionary).

Storage

DBnomics data is stored in regular directories of the file-system.

A directory containing data from a provider converted by a fetcher.

✓ The directory name MUST be {provider_code}-json-data.

Revisions

Each storage directory is versioned using Git in order to track revisions.

General constraints

Minimal data

Data MUST NOT be stored if it adds no value or if it can be computed from any other data.

As a consequence:

series names MUST NOT be generated when not provided by source data;

DBnomics can generate a name from the dimensions values codes

Data stability

Any commit in the storage directory of a provider MUST reflect a change from the side of the provider.

Data conversions MUST be stable: running a conversion script on the same source-data MUST NOT change converted data.

As a consequence:

when series codes are generated from a dimensions dict, always use the same order;
properties of JSON objects MUST be sorted alphabetically;

`/provider.json`

This JSON file contains meta-data about the provider.

See its JSON schema.

`/category_tree.json`

This JSON file contains a tree of categories whose leaves are datasets and nodes are categories.

This file is optional:

if categories are provided by source data, it SHOULD exist;
if it's missing, DBnomics will generate the tree as a list of datasets ordered lexicographically;
it MUST NOT be written if it is identical to the generated list mentioned above (due to the general constraint about minimal data)

See its JSON schema.

`/{dataset_code}/`

This directory contains data about a dataset of the provider.

The directory name MUST be equal to the dataset code.

`/{dataset_code}/dataset.json`

This JSON file contains meta-data about a dataset of the provider.

See its JSON schema.

The series property if optional: see storing time series section.

`/{dataset_code}/series.jsonl`

This JSON-lines file contains meta-data about time series of a dataset of a provider.

Each line is a JSON object validated against this JSON schema.

This file is optional: see storing time series section.

`/{dataset_code}/{series_code}.tsv`

This TSV file contains observations of a time series of a dataset of a provider.

These files are optional: see storing time series section.

Constraints on time series

With providers using series codes composed of dimensions values codes:
- The separator MUST be '.' to be compatible with series codes masks. It is allowed to change the separator used originally by the provider. Example: this commit on BIS.
- The parts of the series code MUST follow the order defined by dimensions_codes_order. Example: if dimensions_codes_order = ["FREQ", "COUNTRY"], the series code MUST be A.FR and not FR.A.
- When dimensions codes order is not defined by the provider, the lexicographic order of the dimensions codes SHOULD be used, and the dimensions_codes_order key MUST NOT be written. Example: if dimensions are FREQ and COUNTRY, the series code is FR.A because dimensions codes are sorted alphabetically: ["COUNTRY", "FREQ"].

Constraints on TSV files

Note: The ✓ symbol means that a constraint is validated by the validation script.

TSV files MUST be encoded in UTF-8.
✓ The two first columns of the header MUST be named PERIOD and VALUE.
✓ Each row MUST have the same number of columns than the header.
The values of the PERIOD column:
- ✓ MUST respect a specific format:
  - YYYY for years
  - YYYY-MM for months (MUST be padded for MM)
  - YYYY-MM-DD for days (MUST be padded for MM and DD)
  - YYYY-Q[1-4] for year quarters
    - example: 2018-Q1 represents jan to mar 2018, and 2018-Q4 represents oct to dec 2018
  - YYYY-S[1-2] for year semesters (aka bi-annual, semi-annual)
    - example: 2018-S1 represents jan to jun 2018, and 2018-S2 represents jul to dec 2018
  - YYYY-B[1-6] for pairs of months (aka bi-monthly)
    - example: 2018-B1 represents jan + feb 2018, and 2018-B6 represents nov + dec 2018
  - YYYY-W[01-53] for year weeks (MUST be padded)
- ✓ MUST all have the same format
- ✓ MUST NOT include average values, like M13 or Q5 periods (some providers do this)
- MUST be consistent with the frequency (ie use YYYY-Q[1-4] for quarterly observations, not YYYY-MM-DD, even if those daily periods have 3 months between them)
✓ The PERIOD column MUST be sorted in an ascending order.
✓ The values of the VALUE column MUST either:
- follow that of decimal in XMLSchema: a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'.
- OR be NA meaning "not available".
TSV files CAN have supplementary columns in order to tag some observation values.
- The values of these columns are free, empty string "" means no tag
- Reuse values defined by the provider if possible; otherwise define values with DBnomics team

Storing time series

Meta-data

Time series meta-data can be stored either:

in {dataset_code}/dataset.json under the series property as a JSON array of objects
in {dataset_code}/series.jsonl, a JSON-lines file, each line being a (non-indented) JSON object

When a dataset contains a huge number of time series, the dataset.json file grows drastically. In this case, the series.jsonl format is recommended because parsing a JSON-lines file line-by-line consumes less memory than opening a whole JSON file. A maximum limit of 1000 time series in dataset.json is recommended. In this case, the series key of dataset.json file should be: {'path': 'series.jsonl'}.

Whatever format you choose, the JSON objects are validated against this JSON schema.

Constraints additional to the schema:

✓ The code properties of the series list MUST be unique

Examples:

this dataset stores time series meta-data in dataset.json under the series property
this dataset stores time series meta-data in series.jsonl

Dimensions values order

Sometimes the dimensions values order is different than the lexicographic one.

Example: for the dimension "country", we have "All countries [ALL]", "Afghanistan [AF]" "France [FR]", "Germany [DE]", "Other countries [OTHER]". In this case it seems more natural to display "All countries" first, and "Other countries" last. We don't want "Afghanistan" to come before "All countries" just because of lexicographic order.

It is possible to encode this order in dataset.json like this:

{
  "dimensions_values_labels": {
    "country": [
      ["ALL", "All countries"],
      ["AF", "Afghanistan"],
      ["FR", "France"],
      ["DE", "Germany"],
      ["OTHER", "Other countries"]
    ]
  }
}

Another case is when the dimensions values talk about units, and we want to order units from the smallest to the largest. For example, "millimeter", "centimeter", "meter", "kilometer".

Series attributes

In conjunction with dimensions, series can have attributes. They behave like dimensions: labels and codes.

Example: (from provider1-json-data/dataset2/dataset.json)

in dataset.json:

  "attributes_labels": {
      "UNIT_MULT": "Unit of multiplier"
  },
  "attributes_values_labels": {
      "UNIT_MULT": {
          "9": "× 10^9"
      }
  },

then, for each series (in dataset.json or series.jonl files)

  "attributes": {
      "UNIT_MULT": "9"
  },

Observations

Time-series observations can be stored either:

in {dataset_code}/{series_code}.tsv TSV files
in {dataset_code}/series.jsonl, a JSON-lines file, each line being a (non-indented) JSON object, under the observations property of each object.

When a dataset contains a huge number of time series, the number of TSV files file grows drastically. In this case, the series.jsonl format is recommended because a single file consumes less disk space than thousands of files (each file taking some kilo-bytes in the file-system table of contents), and because Git is slower when the number of committed files increases. A maximum limit of 1000 TSV files is recommended.

Whatever format you choose, the JSON objects are validated against this JSON schema.

Examples:

this dataset stores observations in TSV files
this dataset stores observations in series.jsonl

Adding documentation to data (description and notes fields)

Datasets and series can be documented using description and notes fields.

description presents what is the meaning of the data
notes presents some remarks about the data. Example: "Before March 2002, exposures were netted across the banking and trading books. This has necessitated a break in the series."

=> see this example

Data validation

dbnomics-data-model comes with a validation script. Validate a JSON data directory:

dbnomics-validate <storage_dir>

# for example:
dbnomics-validate wto-json-data

Note that some of the constraints expressed above are not yet checked by the validation script.

Some errors are warnings and are not displayed by default. Use the --developer-mode option to display all errors.

Testing

Run unit tests:

python setup.py test

Code quality:

pylint --rcfile ../code-style/pylintrc *.py dbnomics_data_model

Run validation script against dummy providers:

dbnomics-validate tests/fixtures/provider1-json-data
dbnomics-validate tests/fixtures/provider2-json-data

Changelog

See CHANGELOG.md. It contains an upgrade guide explaining how to modify the source code of your fetcher, if the data model changes in unexpected ways.

Publish a new version

For package maintainers:

git tag x.y.z
git push
git push --tags

GitLab CI will publish the package to https://pypi.org/project/dbnomics-data-model/ (see .gitlab-ci.yml).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.0b12 pre-release

Oct 7, 2024

1.0.0b11 pre-release

Aug 1, 2024

1.0.0b10 pre-release

Jul 5, 2024

1.0.0b9 pre-release

Jun 28, 2024

1.0.0b8 pre-release

Jun 21, 2024

1.0.0b7 pre-release

Jun 10, 2024

1.0.0b6 pre-release

Jun 7, 2024

1.0.0b5 pre-release

Jun 6, 2024

1.0.0b4 pre-release

Jan 12, 2024

1.0.0b3 pre-release

Dec 21, 2023

1.0.0b2 pre-release

Dec 20, 2023

1.0.0b1 pre-release

Dec 5, 2023

0.13.35

Jun 12, 2023

0.13.34

Jun 12, 2023

0.13.33

Jun 1, 2023

0.13.32

Nov 25, 2022

0.13.31

Nov 25, 2022

0.13.30 yanked

Nov 25, 2022

0.13.29

Jul 8, 2022

0.13.28

Apr 7, 2022

0.13.27

Apr 4, 2022

0.13.26

Apr 4, 2022

0.13.25

Apr 4, 2022

0.13.24

Apr 4, 2022

0.13.23

Mar 25, 2022

0.13.22

Mar 25, 2022

0.13.21 yanked

Mar 24, 2022

Reason this release was yanked:

Sub-packages are missing (`model`, `storage`)

0.13.20

Jul 15, 2021

0.13.19

Jan 25, 2021

0.13.18

Jan 7, 2021

0.13.17

Jan 4, 2021

0.13.16

Jan 4, 2021

0.13.15

Dec 3, 2020

0.13.14

Oct 12, 2020

This version

0.13.13

Oct 6, 2020

0.13.12

Jul 21, 2020

0.13.11

Jun 26, 2020

0.13.10

Jun 26, 2020

0.13.9

Feb 12, 2020

0.13.8

Feb 4, 2020

0.13.7

Feb 4, 2020

0.13.6

Jan 23, 2020

0.13.5

Nov 25, 2019

0.13.4

Oct 29, 2019

0.13.3

Jul 23, 2019

0.13.2

Jul 23, 2019

0.13.1

Jun 24, 2019

0.13.0

Apr 19, 2019

0.12.11

Apr 2, 2019

0.12.10

Mar 21, 2019

0.12.9

Mar 15, 2019

0.12.8

Feb 4, 2019

0.12.7

Jan 23, 2019

0.12.6

Jan 9, 2019

0.12.5

Dec 27, 2018

0.12.4

Dec 27, 2018

0.12.3

Dec 27, 2018

0.12.2

Dec 19, 2018

0.12.1

Dec 19, 2018

0.12.0

Dec 19, 2018

0.11.0

Dec 19, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbnomics-data-model-0.13.13.tar.gz (61.3 kB view hashes)

Uploaded Oct 6, 2020 Source

Built Distribution

dbnomics_data_model-0.13.13-py2.py3-none-any.whl (50.7 kB view hashes)

Uploaded Oct 6, 2020 Python 2 Python 3

Hashes for dbnomics-data-model-0.13.13.tar.gz

Hashes for dbnomics-data-model-0.13.13.tar.gz
Algorithm	Hash digest
SHA256	`8a2ac75c3638dedf7dcaa1e562216a2b106202190ddea98f119b3d04e7d01b35`
MD5	`7db8e69906e5170c3750442f277b7202`
BLAKE2b-256	`cd6b3c9de6069269bd092e115c0bd2ada2ef795bdabf605619362b45e36909d1`

Hashes for dbnomics_data_model-0.13.13-py2.py3-none-any.whl

Hashes for dbnomics_data_model-0.13.13-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c3d9c8bc450272dc92717abb8bc6811058d17560769da5da56bc10c17aa17f7`
MD5	`17366f849d2d47a1cab2ef356d6a509e`
BLAKE2b-256	`7725029a04781a7bd73abc20bf9a72939f1d6e5937d55aaaabff6710a2ed6fcf`

dbnomics-data-model 0.13.13

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DBnomics data model

Entities and relationships

Storage

Revisions

General constraints

Minimal data

Data stability

/provider.json

/category_tree.json

/{dataset_code}/

/{dataset_code}/dataset.json

/{dataset_code}/series.jsonl

/{dataset_code}/{series_code}.tsv

Constraints on time series

Constraints on TSV files

Storing time series

Meta-data

Dimensions values order

Series attributes

Observations

Adding documentation to data (description and notes fields)

Data validation

Testing

Changelog

Publish a new version

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`/provider.json`

`/category_tree.json`

`/{dataset_code}/`

`/{dataset_code}/dataset.json`

`/{dataset_code}/series.jsonl`

`/{dataset_code}/{series_code}.tsv`