DCAT to Intake Catalog translation layer

These details have not been verified by PyPI

Project links

Homepage

Project description

intake-dcat

This is an intake data source for DCAT catalogs.

These catalogs are a standardized format for describing metadata and access information for public datasets, as described here. Many Socrata and ESRI data portals publish data.json files in this format describing their catalogs. Two examples of thes can be found at

https://data.lacity.org/data.json

http://geohub.lacity.org/data.json

This project provides an opinionated way for users to load datasets from these catalogs into the scientific Python ecosystem. At the moment it loads CSVs into Pandas dataframes and GeoJSON files into GeoDataFrames, and ESRI Shapefiles into GeoDataFrames. Future formats could include plain JSON and Parquet.

Requirements

intake >= 0.4.4
intake_geopandas >= 0.2.2
geopandas >= 0.5.0

Installation

intake-dcat is published on PyPI. You can install it by running the following in your terminal:

pip install intake-dcat

You can test the functionality by opening the example notebooks in the examples/ directory

Usage

The package can be imported using

from intake_dcat import DCATCatalog

Loading a catalog

You can load data from a DCAT catalog by providing the URL to the data.json file:

catalog = DCATCatalog('http://geohub.lacity.org/data.json', name='geohub')
len(list(catalog))

You can display the items in the catalog

for entry_id, entry in catalog.items():
    display(entry)

If the catalog has too many entries to comfortably print all at once, you can narrow it by searching for a term (e.g. 'district'):

for entry_id, entry in catalog.search('district').items():
  display(entry)

Loading a dataset

Once you have identified a dataset, you can load it into a dataframe using read():

df = entry.read()

This will automatically load that dataset into a Pandas dataframe, or a GeoDataFrame, depending on the source format.

Specifying catalogs

You can read a DCATCatalog directly in Python using a URL, as done above, but it is also possible to write a catalog file that itself contains DCATCatalog entries. This allows you to more easily specify DCAT catalogs for use in distribution and version control.

For instance, this YAML file creates entries for two open data catalogs:

metadata:
  version: 1
sources:
  # Here we have two data sources for this catalog, which are themselves
  # DCAT catalogs, one for LA open data, and the other for LA GeoHub
  la_open_data:
    # We identify them as being loaded with the DCAT driver
    driver: dcat
    # Here we specify the args used to load the catalog
    args:
      # The URL to the catalog
      url: https://data.lacity.org/data.json
      # An optional name for the catalog.
      name: la-open-data
  la_geohub:
    driver: dcat
    args:
      url: http://geohub.lacity.org/data.json
      name: la_geohub
      # We can also specify a subset of the datasets in the catalog using an "items"
      # dictionary. If these are specified, only these datasets will be available in
      # the resulting catalog. They will be available under the more human-readable
      # name specified as the key.
      items:
        # So, this dataset will be available as "bikeways"
        bikeways: http://geohub.lacity.org/datasets/2602345a7a8549518e8e3c873368c1d9_0
        city_boundary: http://geohub.lacity.org/datasets/09f503229d37414a8e67a7b6ceb9ec43_7
        bike_racks: http://geohub.lacity.org/datasets/3b022cced9704108af157d3d5eedb268_2

Command Line Interface

intake-dcat provides a small command line interface for some common operations. These are invoked using intake-dcat <subcommand> <options>

The `mirror` command

This command loads a manifest file that lists a set of DCAT entries, uploads them to a specified s3 bucket, and outputs a new catalog with identical entries pointing to the bucket.

An example manifest is given by

# Name of the LA open data portal
la-open-data:
  # URL to the open data portal catalog
  url: https://data.lacity.org/data.json
  # The s3 bucket to upload the data to
  bucket_uri: s3://my-bucket
  # A list of data resources to mirror
  items:
    lapd_metrics: https://data.lacity.org/api/views/t6kt-2yic
# Name of the LA GeoHub data portal
la-geohub:
  # URL to the open data portal catalog
  url: http://geohub.lacity.org/data.json
  # The s3 bucket to upload the data to
  bucket_uri: s3://my-bucket
  # A list of data resources to mirror
  items:
    bikeways: http://geohub.lacity.org/datasets/2602345a7a8549518e8e3c873368c1d9_0 
    city_boundary: http://geohub.lacity.org/datasets/09f503229d37414a8e67a7b6ceb9ec43_7

This can be mirrored using the command

intake-dcat mirror manifest.yml > new-catalog.yml

This command uses the boto3 library and assumes it can find AWS credentials. For more information see this documentation.

The `create` command

This command creates a new intake catalog from a DCAT catalog, and outputs it to standard out. An example command is given by

intake-dcat create data.lacity.org/data.json > catalog.yml

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4.0

Dec 10, 2019

0.3.1

Oct 11, 2019

0.3.0

Sep 4, 2019

0.2.3

Jul 17, 2019

0.2.2

Jul 16, 2019

0.2.1

Jun 19, 2019

0.2.0

Apr 30, 2019

0.1.0

Apr 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intake-dcat-0.4.0.tar.gz (13.4 kB view details)

Uploaded Dec 10, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

intake_dcat-0.4.0-py3-none-any.whl (14.3 kB view details)

Uploaded Dec 10, 2019 Python 3

File details

Details for the file intake-dcat-0.4.0.tar.gz.

File metadata

Download URL: intake-dcat-0.4.0.tar.gz
Upload date: Dec 10, 2019
Size: 13.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for intake-dcat-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`a8b7b447c5f6460ffce9bd498f74ad5b314bcdb084f72547b2e3e77a65afdc7c`
MD5	`8904a00911de7d4abe07ca2eca442bdc`
BLAKE2b-256	`62e25f895b963d693bfa51e96a46701c676b4ca7b784eb4848462483b985cfe8`

See more details on using hashes here.

File details

Details for the file intake_dcat-0.4.0-py3-none-any.whl.

File metadata

Download URL: intake_dcat-0.4.0-py3-none-any.whl
Upload date: Dec 10, 2019
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for intake_dcat-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8da19e677ce18282b69331aa93d733a4969dc8bbaf640e58039feef3e61b113a`
MD5	`22d5f993404d0e80ab4c5ae6e3f0f224`
BLAKE2b-256	`08ce2da44862d372a9b1b272d2e658cd36a62cdda31d25bea70c148863ca3665`

See more details on using hashes here.

intake-dcat 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

intake-dcat

Requirements

Installation

Usage

Loading a catalog

Loading a dataset

Specifying catalogs

Command Line Interface

The `mirror` command

The `create` command

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

intake-dcat 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

intake-dcat

Requirements

Installation

Usage

Loading a catalog

Loading a dataset

Specifying catalogs

Command Line Interface

The mirror command

The create command

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

The `mirror` command

The `create` command