Skip to main content

Downloads and caches files for knowledge graph ETL

Project description

KG-Hub Downloader

| Documentation |

Overview

This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.

Installation

KGHub Downloader is available to install via pip:

pip install kghub-downloader

Configure

The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml

Available options are:

  • *url: The URL to download from. Currently supported:
    • http(s)
    • Google Cloud Storage (gs://)
    • Google Drive (gdrive:// or https://drive.google.com/...). The file must be publicly accessible.
    • Amazon AWS S3 bucket (s3://)
  • local_name: The name to save the file as locally
  • tag: A tag to use to filter downloads
  • api: The API to use to download the file. Currently supported: elasticsearch
  • elastic search options
    • query_file: The file containing the query to run against the index
    • index: The elastic search index for query

* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:

Mirorring local files to Amazon AWS S3 bucket requires the following:

You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}, for example:

---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
  localname: myfancyfile.json

Note: YOUR_SECRET MUST as an environment variable, and be sure to include the {curly braces} in the url string.

Usage

Downloader can be used directly in Python or via command line

In Python

from kghub_downloader.download_utils import download_from_yaml

download_from_yaml(yaml_file="download.yaml", output_dir="data")

Command Line

$ downloader [OPTIONS] ARGS

╰ Download files listed in a download.yaml file

OPTIONS
yaml_file A string pointing to the download.yaml file, to be parsed for things to download.
Defaults to ./download.yaml
ignore_cache Ignore cache and download files even if they exist [false]
snippet_only Downloads only the first 5 kB of each uncompressed source, for testing and file checks
tags Limit to only downloads with this tag
mirror Remote storage URL to mirror download to. Supported buckets: Google Cloud Storage
ARGUMENTS
output_dir A string pointing to where to write out downloaded files.

Examples:

$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory

# Note that if your YAML file is named `download.yaml`, 
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output

Development

Install

git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install

Run tests

poetry run pytest

NOTE: The tests require gcloud credentials to be set up as described above, using the monarch github actions service account.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kghub_downloader-0.3.5.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

kghub_downloader-0.3.5-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file kghub_downloader-0.3.5.tar.gz.

File metadata

  • Download URL: kghub_downloader-0.3.5.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for kghub_downloader-0.3.5.tar.gz
Algorithm Hash digest
SHA256 0fa0bad8446dc22b907eb19b4fe5fabd381d1528c74971f8c830449af36be152
MD5 990b6f96514fd080e89111a14aa04d22
BLAKE2b-256 7e51be826ac10a88ba99919af806fb1143153ee369af5d102af91e23d44f3e9f

See more details on using hashes here.

File details

Details for the file kghub_downloader-0.3.5-py3-none-any.whl.

File metadata

File hashes

Hashes for kghub_downloader-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 f1712117f1d88cf54fd6cb18dd4cf3c5f90e6d92fad27e9fa2e2e0f717254a35
MD5 a8f176385fb28eb119a745beb0fc085e
BLAKE2b-256 8ef86d332ea758d120090e0b3169904bd8ba14573732bd509e67f0f51aeb085f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page