Skip to main content

{{ DESCRIPTION }}

Project description

# datapackage-pipelines-fiscal

[![Travis](https://travis-ci.org/openspending/datapackage-pipelines-fiscal.svg?branch=master)](https://travis-ci.org/openspending/datapackage-pipelines-fiscal)

Extension for datapackage-pipelines used for loading Fiscal Data Packages into:
- S3 (or compatible) storage, in a denormalized form
- a database in normalized form.
- Metadata will be stored in an elasticsearch instance (if available), OpenSpending compatible
- A `babbage` model will also be generated for querying the database using its API

This extension works with a custom source spec and a set of processors. The generator will convert the source spec into a set of inter-dependent pipelines, which when run in order will perform data processing and loading to selected endpoints (based on environment variables).

## fiscal.source-spec.yaml

Each source-spec contains information regarding a single Fiscal Data Package.

Top level properties are:

#### `title`
Title, or Display name, of the data package

#### `dataset-name` [OPTIONAL]
A slug to be used as the data package's name.

If not provided, a slugified version of the title will be used.

#### `resource-name` [OPTIONAL]

A slug to be used as the main resource's name in the final data package.

If not provided, the dataset name will be used.

#### `owner-id`

The id of the owner of this datapackage.

This identifier is used to generate various paths and storage names.

#### `sources`

Contains a non-empty list of data sources for the fiscal data package.

Each data source has these properties:
- `url`: The location of the data
- `name`: [OPTIONAL] A name for this source (will later be used as an intermediate resource name)

Other `tabulator` parameters can also be added as properties here, e.g. `sheet`, `encoding`, `compression` etc.

#### `fields`

Contains a non-empty list of fields for the fiscal data package.

Each field definition has these properties:
- `header`: The `name` of the field in the resulting resource
- `title` [OPTIONAL]: The display name of the field in the resulting resource
- `columnType`: The _ColumnType_ of the field
- `options`: Extra options to be added to the field, e.g. json-table-schema properties such as `decimalChar` etc.

#### `measures` [OPTIONAL]

Extra information for measure normalization processing.
(Measure normalization is the process of reducing the number of measures to one while multipltying the number of rows and adding extra columns to contain values for identifying the original measure).

Contains the following sub-properties:
- `currency`: The currency code of the output measure column
- `title` [OPTIONAL]: The title for the output measure column
- `mapping`: Unpivoting map.

The unpivoting map is a map from a measure's name to its unpivoting data.

"Unpivoting data" is a map from an extra column's name to a value

Example:
```yaml
measures:
currency: GTQ
mapping:
APPROVED:
PHASE_ID: "0"
PHASE: Inicial
RELEASED:
PHASE_ID: "1"
PHASE: Vigente
COMMITTED:
PHASE_ID: "2"
PHASE: Comprometido
```

- `currencies` [OPTIONAL]: List of currency codes to convert to ('USD' by default).
See next section for details


#### `currency-conversion` [OPTIONAL]

Instructions for adding an extra column or columns with measure values in another currency.

- `date_measure` [OPTIONAL]: Column name from which a date can be extracted.
If not provided, a guess will be made according to the _ColumnType_.

- `title` [OPTOINAL]: Title for the currency-converted measure columns.

#### `datapackage-url` [OPTIONAL]

Contains the URL for a source datapackage from which this data came from.
If provided, metadata for this datapackage will be loaded from this URL.

#### `deduplicate` [OPTIONAL]

If `true`, then the source data will be processed to remove duplicate rows (i.e. rows which have the same values in the primary key). Measure values for these rows will be summed in order to generate a single output row.

#### `postprocessing` [OPTIONAL]

A list of extra processors (and parameters) that will be applied to the data.
Format is as in any `pipeline-spec.yaml`

## Generated Pipelines

#### ./denormalized_flow

- Loads external metadata
- Collects all data from all sources
- Combines different sources onto one unified stream
- Does measure normalization
- Does currency conversion
- Does row deduplication
- Does extra processing steps

Outputs:
- Denormalized data (local file)
- List of fiscal years in a separate resource (local file)
- Updates os package registry (if configured)

#### ./finalize_datapackage_flow_splitter
_(depends on ` ./denormalized_flow`)_

- Loads denormalized package
- Writes separate per-year filtered copies of the data

#### ./finalize_datapackage_flow
_(depends on ` ./finalize_datapackage_flow_splitter`)_

- Loads all resources from the `splitter` pipeline as well as the full denormalized dataset

Outputs:
- Stores results in S3 bucket
- Zip file with the datapackage (in case an S3 bucket is not configured)
- Updates os package registry (if configured)

#### ./dimension_flow_{hierarchy}
_(depends on ` ./denormalized_flow`)_

- Loads denormalized data
- Picks only _hierarchy_ columns
- Add auto-incrementing id column
- Remove duplicates

Outputs:
- Normalized hierarchy data (local file)

#### ./normalized_flow
_(depends on ` ./denormalized_flow` and all `./dimension_flow_{hierarchy}`)_

- Loads denormalized data as fact table
- Loads all normalized hierarchy data
- Creates babbage model
- Replaces all hierarchy columns in fact table with corresponding ids from normalized hierarchy tables

Outputs:
- Normalized fact table (local file)
- Updates os package registry (if configured)

#### ./dumper_flow_{hierarchy}
_(depends on corresponding `./dimension_flow_{hierarchy}`)_

- Loads normalized hierarchy data
- Fixes nulls in primary key (replacing them with empty strings)

Outputs
- Saves data as a single table in an SQL database

#### ./dumper_flow
_(depends on `./normalized_flow`)_

- Loads normalized fact table data
- Fixes nulls in primary key (replacing them with empty strings)

Outputs
- Saves data as a single table in an SQL database

#### ./dumper_flow_update_status
_(depends on `./dumper_flow`)_

Outputs
- Updates os package registry (if configured) that the package was loaded successfully

## Environment variables

`DPP_DB_ENGINE` - connection string for an SQL database to dump data into

`ELASTICSEARCH_ADDRESS` [OPTIONAL] - connection string for an elasticsearch instance (used for package registry updating)

`S3_BUCKET_NAME` [OPTIONAL] - S3 bucket for uploading data. If not provided, local ZIP files will be created instead.

`AWS_ACCESS_KEY_ID` - S3 credentials (required if S3 bucket was specified)

`AWS_SECRET_ACCESS_KEY` - S3 credentials (required if S3 bucket was specified)

## Dependencies

In order to fully run the fiscal datapackage flow you need to have `os-types` installed, using npm:

`$ npm install -g os-types`

This external node.js utility is used to perform fiscal modelling for the processed datapackage.

## Contributing

Please read the contribution guideline:

[How to Contribute](CONTRIBUTING.md)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapackage-pipelines-fiscal-1.0.22.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

datapackage_pipelines_fiscal-1.0.22-py2.py3-none-any.whl (25.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file datapackage-pipelines-fiscal-1.0.22.tar.gz.

File metadata

  • Download URL: datapackage-pipelines-fiscal-1.0.22.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.23.1 CPython/3.6.1

File hashes

Hashes for datapackage-pipelines-fiscal-1.0.22.tar.gz
Algorithm Hash digest
SHA256 b3bb586655d5659bb1a78e2b10da0054edfd013d2dfd01565db59d12bcfbf806
MD5 be8621088edc411341595acb7cbfe95f
BLAKE2b-256 086793c2330f8747489abba0041098dae9b9f2bb5b7f4a30432c9a4b6dafda77

See more details on using hashes here.

File details

Details for the file datapackage_pipelines_fiscal-1.0.22-py2.py3-none-any.whl.

File metadata

  • Download URL: datapackage_pipelines_fiscal-1.0.22-py2.py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.23.1 CPython/3.6.1

File hashes

Hashes for datapackage_pipelines_fiscal-1.0.22-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8d7dd0c8078974130cf8869dde1398d57d19f5537237aae575014febbd23ba36
MD5 b383fe3eefa5250967a5acabba60fa96
BLAKE2b-256 cdf3a1555d0de7af3c014ac6a1b740093951b03cffe0c0a805609908c4168fa7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page