Open Targets python genetics utility CLI tools
Project description
gentroutils
Set of Command Line Interface tools to process Open Targets Genetics GWAS data.
Installation
pip install gentroutils
Available commands
To see all available commands after installation run
gentroutils --help
Usage
To run a single step run
uv run gentroutils -s gwas_catalog_release # After cloning the repository
gentroutils -s gwas_catalog_release -c otter_config.yaml # When installed by pip
The gentroutils repository uses the otter framework to build the set of tasks to run. The current implementation of tasks can be found in the config.yaml file in the root of the repository. To run gentroutils installed via pip you need to define the otter config that looks like the config.yaml file.
Example config
For the top level fields refer to the otter documentation
[!NOTE] All
destination_templatemust point to the Google Cloud Storage (GCS) bucket objects. Allsource_templatemust point to the FTP server paths. In case this is not enforced, the user may experience silent failures.
---
work_path: ./work
log_level: DEBUG
scratchpad:
steps:
gwas_catalog_release:
- name: crawl release metadata
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
promote: "true"
- name: fetch associations
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
promote: true
- name: fetch studies
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
promote: true
- name: fetch ancestries
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
promote: true
- name: curation study
requires:
- fetch studies
previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
destination_template: gs://gwas_catalog_inputs/gentroutils/curation/{release_date}/GWAS_Catalog_study_curation.tsv
summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
promote: true
The config above defines the steps that are run in parallel by the otter framework.
Available tasks
The list of tasks (defined in the config.yaml file) that can be run are:
Crawl release metadata
- name: crawl release metadata
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
promote: "true"
This task fetches the latest GWAS Catalog release metadata from the https://www.ebi.ac.uk/gwas/api/search/stats endpoint and saves it to the specified destination.
[!NOTE] Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata.- The
destination_templateis where the metadata will be saved, and it uses the{release_date}placeholder to specify the release date dynamically. By default it searches for the release directly in the stats_uri json output.- The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/stats.jsonafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
Fetch associations
- name: fetch associations
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
promote: true
This task fetches the GWAS Catalog associations file from the specified FTP server and saves it to the specified destination.
[!NOTE] Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata.- The
source_templateis the URL of the GWAS Catalog associations file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint.- The
destination_templateis where the associations file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint.- The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_associations_ontology_annotated.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
Fetch studies
- name: fetch studies
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
promote: true
This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
[!NOTE] Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata.- The
source_templateis the URL of the GWAS Catalog studies file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint.- The
destination_templateis where the studies file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint.- The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
Fetch ancestries
- name: fetch ancestries
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
promote: true
This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
[!NOTE] Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata.- The
source_templateis the URL of the GWAS Catalog ancestries file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint.- The
destination_templateis where the ancestries file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint.- The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_ancestries.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
Curation
- name: curation study
requires:
- fetch studies
previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
promote: true
This task is used to build the GWAS Catalog curation file that is later used as a template for manual curation. It requires the fetch studies task to be completed before it can run. This is due to the fact that the curation file is build based on the list of studies fetched from download studies file.
[!NOTE] Task parameters
- The
requiresfield specifies that this task depends on thefetch studiestask, meaning it will only run after the studies have been fetched.- The
previous_curationfield is used to specify the path to the previous curation file. This is used to build the new curation file based on the previous one.- The
studiesfield is the path to the studies file that was fetched in thefetch studiestask. This file is used to build the curation file.- The
destination_templateis where the curation file will be saved, and it uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint.- The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date. Thesummary_statistics_globfield is used to specify the glob pattern to list all synced summary statistics files from GCS. This is used to identify which studies have summary statistics available.
Curation process
The base of the curation process for GWAS Catalog data is defined in the docs/gwas_catalog_curation.md. The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the curation task automates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
The automated process includes:
- Reading
download studiesfile with the list of studies that are currently comming from the latest GWAS Catalog release. - Reading
previous curationfile that contains the list of the curated studies from the previous release. - Listing all synced summary statistics files from the
summary_statistics_globparameter to identify which studies have summary statistics available. Note that this can be more then the list of studies in thedownload studiesfile as syncing also involves the unpublished studies. - Comparing the three datasets with following logic:
- In case the study is present in the
previous curationanddownload studies, the study is marked ascurated - In case the study is present in the
download studiesbut not in theprevious curation, the study is marked asto_curateorhas_no_sumstatsdepending on the presence of summary statistics files - In case the study is present in the
previous curationbut not in thedownload studies, the study is marked asremoved
- In case the study is present in the
- The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the
destination_templatepath specified in the task configuration. The file is saved undergs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsvpath. - The output file is then promoted to the latest release path
gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvso that it can be used for manual curation. - The manual curation process is then performed on the
gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvfile. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to thegs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsvandgs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsvfile. This file is then used for the Open Targets Staging Dags.
Contribute
To be able to contribute to the project you need to set it up. This project runs on:
- python 3.13
- uv (dependency manager)
To set up the project run
make dev
The command will install above dependencies (initial requirements are curl and bash) if not present and
install all python dependencies listed in pyproject.toml. Finally the command will install pre-commit hooks
required to be run before the commit is created.
The project has additional dev dependencies that include the list of packages used for testing purposes.
All of the dev dependencies are automatically installed by uv.
To see all available dev commands
Run following command to see all available dev commands
make help
Manual testing of CLI module
To check CLI execution manually you need to run
uv run gentroutils
This software was developed as part of the Open Targets project. For more information please see: http://www.opentargets.org
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gentroutils-4.0.0.tar.gz.
File metadata
- Download URL: gentroutils-4.0.0.tar.gz
- Upload date:
- Size: 716.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a041ddea485bc921aa65fbbc30c76862649f34dbaf0585a422ba44a055eaf734
|
|
| MD5 |
54b83d4016d6d7106d2af0bce252663a
|
|
| BLAKE2b-256 |
934ed873d88f26ff8e309d3b8883b6236b3c43267c91be8c4bf38107943673c0
|
Provenance
The following attestation bundles were made for gentroutils-4.0.0.tar.gz:
Publisher:
release.yaml on opentargets/gentroutils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gentroutils-4.0.0.tar.gz -
Subject digest:
a041ddea485bc921aa65fbbc30c76862649f34dbaf0585a422ba44a055eaf734 - Sigstore transparency entry: 908232273
- Sigstore integration time:
-
Permalink:
opentargets/gentroutils@6f1a923a4283cc026103e33d4ce7923851e9fa12 -
Branch / Tag:
refs/tags/v4.0.0 - Owner: https://github.com/opentargets
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@6f1a923a4283cc026103e33d4ce7923851e9fa12 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gentroutils-4.0.0-py3-none-any.whl.
File metadata
- Download URL: gentroutils-4.0.0-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78ba5e9a9899c8ff2359ede328b2f6745e24fe28b402e839b750ca181a82d9b3
|
|
| MD5 |
5d717e06103d32c5f874f1a5a2d02644
|
|
| BLAKE2b-256 |
7c255d7af7105230bb37c6ecdc41e5759b31873d1874b403d3922a75dcdf5c58
|
Provenance
The following attestation bundles were made for gentroutils-4.0.0-py3-none-any.whl:
Publisher:
release.yaml on opentargets/gentroutils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gentroutils-4.0.0-py3-none-any.whl -
Subject digest:
78ba5e9a9899c8ff2359ede328b2f6745e24fe28b402e839b750ca181a82d9b3 - Sigstore transparency entry: 908232280
- Sigstore integration time:
-
Permalink:
opentargets/gentroutils@6f1a923a4283cc026103e33d4ce7923851e9fa12 -
Branch / Tag:
refs/tags/v4.0.0 - Owner: https://github.com/opentargets
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@6f1a923a4283cc026103e33d4ce7923851e9fa12 -
Trigger Event:
push
-
Statement type: