Downloads and caches files for knowledge graph ETL
Project description
KG-Hub Downloader
| Documentation | Repository | PyPI |
Overview
This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.
Installation
KGHub Downloader is available to install via pip:
pip install kghub-downloader
Configure
The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml
Available options are:
- *url: The URL to download from. Currently supported:
http(s)
ftp
- with
glob:
option to download files with specific extensions (only with ftp as of now and looks recursively).
- with
- Google Cloud Storage (
gs://
) - Google Drive (
gdrive://
or https://drive.google.com/...). The file must be publicly accessible. - Amazon AWS S3 bucket (
s3://
) - GitHub Release Assets (
git://RepositoryOwner/RepositoryName
)
- local_name: The name to save the file as locally
- tag: A tag to use to filter downloads
- api: The API to use to download the file. Currently supported:
elasticsearch
- elastic search options
- query_file: The file containing the query to run against the index
- index: The elastic search index for query
* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:
- create a service account
- add the service account to the relevant bucket and
- download a JSON key for that service account.
Then, set theGOOGLE_APPLICATION_CREDENTIALS
environment variable to point to that file.Mirorring local files to Amazon AWS S3 bucket requires the following:
- Create an AWS account
- Create an IAM user in AWS: This enables getting the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
needed for authentication. These two should be stored as environment variables in the user's system.- Create an S3 bucket: This will be the destination for pushing local files.
You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}
, for example:
---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
localname: myfancyfile.json
Note: YOUR_SECRET
MUST as an environment variable, and be sure to include the {curly braces} in the url string.
Usage
Downloader can be used directly in Python or via command line
In Python
from kghub_downloader.download_utils import download_from_yaml
download_from_yaml(yaml_file="download.yaml", output_dir="data")
Command Line
To download files listed in a download.yaml file:
$ downloader [OPTIONS] ARGS
OPTIONS | |
---|---|
yaml_file | Path to the download.yaml file, to be parsed for things to download. Defaults to ./download.yaml |
ignore_cache | Ignore cache and download files even if they exist (Default False ) |
snippet_only | Downloads only the first 5 kB of each uncompressed source, for testing and file checks |
tags | Limit to only downloads with this tag |
mirror | Remote storage URL to upload downloaded files to. Supported buckets: Google Cloud Storage |
ARGUMENTS | |
---|---|
output_dir | Where to save downloaded files. |
Examples:
$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory
# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output
Development
Install
git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install
Run tests
poetry run pytest
NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for kghub_downloader-0.3.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8edfc0e7d753e38b866bf9c71ed57c96eb1682d1f67d02b7a1932fe1de18841 |
|
MD5 | 9595e1e12ceae6ef326666cbfaf5fa42 |
|
BLAKE2b-256 | 5ec3b3fae8233039f6dd5b0b511ec313f5f530b9e66d558fe6fe818d446f4792 |