Downloads and caches files for knowledge graph ETL

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

KG-Hub Downloader

Overview

This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.

Installation

KGHub Downloader is available to install via pip:

pip install kghub-downloader

Configure

The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml

Available options are:

*url: The URL to download from. Currently supported:
- http(s)
- ftp
  - with glob: option to download files with specific extensions (only with ftp as of now and looks recursively).
- Google Cloud Storage (gs://)
- Google Drive (gdrive:// or https://drive.google.com/...). The file must be publicly accessible.
- Amazon AWS S3 bucket (s3://)
local_name: The name to save the file as locally
tag: A tag to use to filter downloads
api: The API to use to download the file. Currently supported: elasticsearch
elastic search options
- query_file: The file containing the query to run against the index
- index: The elastic search index for query

* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:

create a service account

add the service account to the relevant bucket and

download a JSON key for that service account.
Then, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to that file.

Mirorring local files to Amazon AWS S3 bucket requires the following:

Create an AWS account

Create an IAM user in AWS: This enables getting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY needed for authentication. These two should be stored as environment variables in the user's system.

Create an S3 bucket: This will be the destination for pushing local files.

You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}, for example:

---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
  localname: myfancyfile.json

Note: YOUR_SECRET MUST as an environment variable, and be sure to include the {curly braces} in the url string.

Usage

Downloader can be used directly in Python or via command line

In Python

from kghub_downloader.download_utils import download_from_yaml

download_from_yaml(yaml_file="download.yaml", output_dir="data")

Command Line

To download files listed in a download.yaml file:

$ downloader [OPTIONS] ARGS

OPTIONS
yaml_file	Path to the download.yaml file, to be parsed for things to download. Defaults to `./download.yaml`
ignore_cache	Ignore cache and download files even if they exist (Default `False`)
snippet_only	Downloads only the first 5 kB of each uncompressed source, for testing and file checks
tags	Limit to only downloads with this tag
mirror	Remote storage URL to upload downloaded files to. Supported buckets: Google Cloud Storage

ARGUMENTS
output_dir	Where to save downloaded files.

Examples:

$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory

# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output

Development

Install

git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install

Run tests

poetry run pytest

NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.8

Apr 24, 2024

0.3.7

Mar 22, 2024

0.3.6

Mar 20, 2024

0.3.5

Feb 9, 2024

0.3.4

Oct 10, 2023

0.3.3

Nov 30, 2022

0.3.2

Nov 3, 2022

0.3.1

Oct 7, 2022

0.3.0

Oct 3, 2022

0.1.14

Mar 29, 2022

0.1.13

Mar 7, 2022

0.1.12

Feb 24, 2022

0.1.11

Feb 24, 2022

0.1.10

Feb 24, 2022

0.1.9

Feb 24, 2022

0.1.8

Feb 24, 2022

0.1.7

Feb 24, 2022

0.1.6

Feb 15, 2022

0.1.5

Jan 26, 2022

0.1.4

Jan 20, 2022

0.1.3

Dec 20, 2021

0.1.2

Dec 20, 2021

0.1.1

Dec 8, 2021

0.1.0

Dec 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kghub_downloader-0.3.8.tar.gz (7.1 kB view hashes)

Uploaded Apr 24, 2024 Source

Built Distribution

kghub_downloader-0.3.8-py3-none-any.whl (8.1 kB view hashes)

Uploaded Apr 24, 2024 Python 3

Hashes for kghub_downloader-0.3.8.tar.gz

Hashes for kghub_downloader-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`a77dd62b1a00ca0d5d6ea81dbe983c4293bb71e6a9c73c6174cb309a771fc9ce`
MD5	`8914f5d967d0dbefe66dbcd39219785c`
BLAKE2b-256	`92a5a23732acfd7da91de1f4a4288c73977930189c8e3cd6b326bef42de218ac`

Hashes for kghub_downloader-0.3.8-py3-none-any.whl

Hashes for kghub_downloader-0.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8a2b6f9502d84caf700e8736f06cb82f0dfd0e06ab7ee1ee63f9e37acf90ed6`
MD5	`4874438e8d9648ecaf842c45bd8e8a67`
BLAKE2b-256	`1335ea5a1fcd687c835e4d6cf7b0191c809adf329cbe2b874d323e4f307cddaa`