Skip to main content

Library for accessing the Reference Data Repository

Project description

https://img.shields.io/badge/License-MIT-yellow.svg https://github.com/VIDA-NYU/reference-data-repository/workflows/build/badge.svg https://codecov.io/gh/VIDA-NYU/reference-data-repository/branch/master/graph/badge.svg

About

The Reference Data Repository provides access to reference data sets (e.g., controlled vocabularies, gazetteers, etc.) that are accessible on the Web and that are useful for data cleaning and data profiling tools like openclean and Auctus.

Data Hosting

Individual datasets are hosted by data maintainers on different platforms. The only requirement is that the datasets (or individual dataset versions) are accessible via HTTP GET requests. Information about dataset is maintained in a central index (as a Json file) that is hosted on the Web (see for example the openclean reference data collection).

Datasets and Data Formats

Each dataset has a unique identifier. Different file formats are supported for the datasets, e.g., csv files, Json, SQLIte database files, etc.. Format information for each dataset is stored as part of its entry in the global index.

Datasets are considered tabular (or sets of columns). Users may access only a single column from a dataset (e.g., country_name), multiple columns (e.b., country_name, captial_city) or the full dataset.

Below is an example dataset descriptor.

{
    "id": "encyclopaedia_britannica:us_cities",
    "name": "Cities in the U.S.",
    "description": "Names of cities in the U.S. from the Encyclopaedia Britannica.",
    "url": "https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/us_cities.tsv",
    "checksum": "d361873f13b867805628d7db63987392835114f13da9ead0e11ccff2946631d2",
    "webpage": "https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068",
    "schema": [
        {"id": "city", "name": "City", "description": "City Name", "dtype": "text"},
        {"id": "state", "name": "State", "description": "U.S. State Name", "dtype": "text"}
    ],
    "format": {
        "type": "csv",
        "parameters": {
            "delim": "\t"
        }
    }
}

The full schema for the data repository index content is defined in schema.yaml.

Local Data Repository

Users maintain copies of the datasets for local access. By default, datasets are stored in a subfolder in the user’s cache directory.

Getting Started

Install the package using pip from the GitHub repository:

pip install git+git://github.com:VIDA-NYU/reference-data-repository.git

This repository contains an example notebook that demonstrates the basic features of the package.

The package also includes a simple command line interface refdata that can be used to list contents of the repository index and to interact with the local data store.

Usage: refdata [OPTIONS] COMMAND [ARGS]...

  Command line interface for the Reference Data Repository.

Options:
  --help  Show this message and exit.

Commands:
  checksum  Print file checksum.
  index     Data Repository Index.
      list      List repository index content.
      show      Show dataset descriptor from repository index.
      validate  Validate repository index file.
  store     Local Data Store.
      download  List local store content.
      list      List local store content.
      remove    Remove dataset from local store.
      show      Show descriptor for downloaded dataset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refdata-0.2.0.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

refdata-0.2.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file refdata-0.2.0.tar.gz.

File metadata

  • Download URL: refdata-0.2.0.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for refdata-0.2.0.tar.gz
Algorithm Hash digest
SHA256 28658623267f03ee9de15c91b7beca67fd5b0311bad940925c76f2e762f78c87
MD5 949b96df19e058f64fabf887db021302
BLAKE2b-256 40d5c2d62cd075543e364f10f506bb61a56fcfe7ac4728d674d7f50a00c0a611

See more details on using hashes here.

File details

Details for the file refdata-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: refdata-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for refdata-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83afc7c351a100a20a2308e4f32ed5a8f0ef417c83c36c2a9967daec572996c7
MD5 a87d40ab42c71eb622030e99a6492ac6
BLAKE2b-256 1d05afe4da72b2e97c4cd667e09a6594efab41f35faa68a98c6afb8ecdbc6bcc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page