Library for accessing the Reference Data Repository
Project description
About
The Reference Data Repository provides access to reference data sets (e.g., controlled vocabularies, gazetteers, etc.) that are accessible on the Web and that are useful for data cleaning and data profiling tools like openclean and Auctus.
Data Hosting
Individual datasets are hosted by data maintainers on different platforms. The only requirement is that the datasets (or individual dataset versions) are accessible via HTTP GET requests. Information about dataset is maintained in a central index (as a Json file) that is hosted on the Web (see for example the openclean reference data collection).
Datasets and Data Formats
Each dataset has a unique identifier. Different file formats are supported for the datasets, e.g., csv files, Json, SQLIte database files, etc.. Format information for each dataset is stored as part of its entry in the global index.
Datasets are considered tabular (or sets of columns). Users may access only a single column from a dataset (e.g., country_name), multiple columns (e.b., country_name, captial_city) or the full dataset.
Below is an example dataset descriptor.
{
"id": "encyclopaedia_britannica:us_cities",
"name": "Cities in the U.S.",
"description": "Names of cities in the U.S. from the Encyclopaedia Britannica.",
"url": "https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/us_cities.tsv",
"checksum": "d361873f13b867805628d7db63987392835114f13da9ead0e11ccff2946631d2",
"webpage": "https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068",
"schema": [
{"id": "city", "name": "City", "description": "City Name", "dtype": "text"},
{"id": "state", "name": "State", "description": "U.S. State Name", "dtype": "text"}
],
"format": {
"type": "csv",
"parameters": {
"delim": "\t"
}
}
}
The full schema for the data repository index content is defined in schema.yaml.
Local Data Repository
Users maintain copies of the datasets for local access. By default, datasets are stored in a subfolder in the user’s cache directory.
Getting Started
Install the package using pip from the GitHub repository:
pip install git+git://github.com:VIDA-NYU/reference-data-repository.git
This repository contains an example notebook that demonstrates the basic features of the package.
The package also includes a simple command line interface refdata that can be used to list contents of the repository index and to interact with the local data store.
Usage: refdata [OPTIONS] COMMAND [ARGS]...
Command line interface for the Reference Data Repository.
Options:
--help Show this message and exit.
Commands:
checksum Print file checksum.
index Data Repository Index.
list List repository index content.
show Show dataset descriptor from repository index.
validate Validate repository index file.
store Local Data Store.
download List local store content.
list List local store content.
remove Remove dataset from local store.
show Show descriptor for downloaded dataset.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file refdata-0.2.0.tar.gz
.
File metadata
- Download URL: refdata-0.2.0.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28658623267f03ee9de15c91b7beca67fd5b0311bad940925c76f2e762f78c87 |
|
MD5 | 949b96df19e058f64fabf887db021302 |
|
BLAKE2b-256 | 40d5c2d62cd075543e364f10f506bb61a56fcfe7ac4728d674d7f50a00c0a611 |
File details
Details for the file refdata-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: refdata-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83afc7c351a100a20a2308e4f32ed5a8f0ef417c83c36c2a9967daec572996c7 |
|
MD5 | a87d40ab42c71eb622030e99a6492ac6 |
|
BLAKE2b-256 | 1d05afe4da72b2e97c4cd667e09a6594efab41f35faa68a98c6afb8ecdbc6bcc |