Skip to main content

Open Targets python genetics utility CLI tools

Project description

gentroutils

checks License release

Set of Command Line Interface tools to process Open Targets Genetics GWAS data.

Installation

pip install gentroutils

Available commands

To see all available commands after installation run

gentroutils --help

Usage

To run a single step run

uv run gentroutils -s gwas_catalog_release  # After cloning the repository
gentroutils -s gwas_catalog_release -c otter_config.yaml # When installed by pip

The gentroutils repository uses the otter framework to build the set of tasks to run. The current implementation of tasks can be found in the config.yaml file in the root of the repository. To run gentroutils installed via pip you need to define the otter config that looks like the config.yaml file.

Example config

For the top level fields refer to the otter documentation

[!NOTE] All destination_template must point to the Google Cloud Storage (GCS) bucket objects. All source_template must point to the FTP server paths. In case this is not enforced, the user may experience silent failures.

---
work_path: ./work
log_level: DEBUG
scratchpad:
steps:
  gwas_catalog_release:
    - name: crawl release metadata
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
      promote: "true"
    - name: fetch associations
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
      promote: true
    - name: fetch studies
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
      promote: true
    - name: fetch ancestries
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
      promote: true
    - name: curation study
      requires:
        - fetch studies
      previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
      studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
      destination_template: gs://gwas_catalog_inputs/gentroutils/curation/{release_date}/GWAS_Catalog_study_curation.tsv
      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
      promote: true

The config above defines the steps that are run in parallel by the otter framework.

Available tasks

The list of tasks (defined in the config.yaml file) that can be run are:

Crawl release metadata

- name: crawl release metadata
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
      promote: "true"

This task fetches the latest GWAS Catalog release metadata from the https://www.ebi.ac.uk/gwas/api/search/stats endpoint and saves it to the specified destination.

[!NOTE] Task parameters

  • The stats_uri is used to fetch the latest release date and other metadata.
  • The destination_template is where the metadata will be saved, and it uses the {release_date} placeholder to specify the release date dynamically. By default it searches for the release directly in the stats_uri json output.
  • The promote field is set to true, which means the output will be promoted to the latest release. Meaning that the file will be saved under gs://gwas_catalog_inputs/gentroutils/latest/stats.json after the task is completed. If the promote field is set to false, the file will not be promoted and will be saved under the specified path with the release date.

Fetch associations

- name: fetch associations
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
      promote: true

This task fetches the GWAS Catalog associations file from the specified FTP server and saves it to the specified destination.

[!NOTE] Task parameters

  • The stats_uri is used to fetch the latest release date and other metadata.
  • The source_template is the URL of the GWAS Catalog associations file, which uses the {release_date} placeholder to specify the release date dynamically. The release date is fetched from the stats_uri endpoint.
  • The destination_template is where the associations file will be saved, and it also uses the {release_date} placeholder. The release date is fetched from the stats_uri endpoint.
  • The promote field is set to true, which means the output will be promoted to the latest release. Meaning that the file will be saved under gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_associations_ontology_annotated.tsv after the task is completed. If the promote field is set to false, the file will not be promoted and will be saved under the specified path with the release date.

Fetch studies

- name: fetch studies
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
      promote: true

This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.

[!NOTE] Task parameters

  • The stats_uri is used to fetch the latest release date and other metadata.
  • The source_template is the URL of the GWAS Catalog studies file, which uses the {release_date} placeholder to specify the release date dynamically. The release date is fetched from the stats_uri endpoint.
  • The destination_template is where the studies file will be saved, and it also uses the {release_date} placeholder. The release date is fetched from the stats_uri endpoint.
  • The promote field is set to true, which means the output will be promoted to the latest release. Meaning that the file will be saved under gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv after the task is completed. If the promote field is set to false, the file will not be promoted and will be saved under the specified path with the release date.

Fetch ancestries

- name: fetch ancestries
      stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
      source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
      destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
      promote: true

This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.

[!NOTE] Task parameters

  • The stats_uri is used to fetch the latest release date and other metadata.
  • The source_template is the URL of the GWAS Catalog ancestries file, which uses the {release_date} placeholder to specify the release date dynamically. The release date is fetched from the stats_uri endpoint.
  • The destination_template is where the ancestries file will be saved, and it also uses the {release_date} placeholder. The release date is fetched from the stats_uri endpoint.
  • The promote field is set to true, which means the output will be promoted to the latest release. Meaning that the file will be saved under gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_ancestries.tsv after the task is completed. If the promote field is set to false, the file will not be promoted and will be saved under the specified path with the release date.

Curation

- name: curation study
      requires:
        - fetch studies
      previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
      studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
      destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
      promote: true

This task is used to build the GWAS Catalog curation file that is later used as a template for manual curation. It requires the fetch studies task to be completed before it can run. This is due to the fact that the curation file is build based on the list of studies fetched from download studies file.

[!NOTE] Task parameters

  • The requires field specifies that this task depends on the fetch studies task, meaning it will only run after the studies have been fetched.
  • The previous_curation field is used to specify the path to the previous curation file. This is used to build the new curation file based on the previous one.
  • The studies field is the path to the studies file that was fetched in the fetch studies task. This file is used to build the curation file.
  • The destination_template is where the curation file will be saved, and it uses the {release_date} placeholder to specify the release date dynamically. The release date is fetched from the stats_uri endpoint.
  • The promote field is set to true, which means the output will be promoted to the latest release. Meaning that the file will be saved under gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv after the task is completed. If the promote field is set to false, the file will not be promoted and will be saved under the specified path with the release date. The summary_statistics_glob field is used to specify the glob pattern to list all synced summary statistics files from GCS. This is used to identify which studies have summary statistics available.

Curation process

The base of the curation process for GWAS Catalog data is defined in the docs/gwas_catalog_curation.md. The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the curation task automates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.

The automated process includes:

  1. Reading download studies file with the list of studies that are currently comming from the latest GWAS Catalog release.
  2. Reading previous curation file that contains the list of the curated studies from the previous release.
  3. Listing all synced summary statistics files from the summary_statistics_glob parameter to identify which studies have summary statistics available. Note that this can be more then the list of studies in the download studies file as syncing also involves the unpublished studies.
  4. Comparing the three datasets with following logic:
    • In case the study is present in the previous curation and download studies, the study is marked as curated
    • In case the study is present in the download studies but not in the previous curation, the study is marked as to_curate or has_no_sumstats depending on the presence of summary statistics files
    • In case the study is present in the previous curation but not in the download studies, the study is marked as removed
  5. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the destination_template path specified in the task configuration. The file is saved under gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv path.
  6. The output file is then promoted to the latest release path gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv so that it can be used for manual curation.
  7. The manual curation process is then performed on the gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv and gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv file. This file is then used for the Open Targets Staging Dags.

Contribute

To be able to contribute to the project you need to set it up. This project runs on:

  • python 3.13
  • uv (dependency manager)

To set up the project run

make dev

The command will install above dependencies (initial requirements are curl and bash) if not present and install all python dependencies listed in pyproject.toml. Finally the command will install pre-commit hooks required to be run before the commit is created.

The project has additional dev dependencies that include the list of packages used for testing purposes. All of the dev dependencies are automatically installed by uv.

To see all available dev commands

Run following command to see all available dev commands

make help

Manual testing of CLI module

To check CLI execution manually you need to run

uv run gentroutils

This software was developed as part of the Open Targets project. For more information please see: http://www.opentargets.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gentroutils-4.0.0.tar.gz (716.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gentroutils-4.0.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file gentroutils-4.0.0.tar.gz.

File metadata

  • Download URL: gentroutils-4.0.0.tar.gz
  • Upload date:
  • Size: 716.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gentroutils-4.0.0.tar.gz
Algorithm Hash digest
SHA256 a041ddea485bc921aa65fbbc30c76862649f34dbaf0585a422ba44a055eaf734
MD5 54b83d4016d6d7106d2af0bce252663a
BLAKE2b-256 934ed873d88f26ff8e309d3b8883b6236b3c43267c91be8c4bf38107943673c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for gentroutils-4.0.0.tar.gz:

Publisher: release.yaml on opentargets/gentroutils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gentroutils-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: gentroutils-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gentroutils-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 78ba5e9a9899c8ff2359ede328b2f6745e24fe28b402e839b750ca181a82d9b3
MD5 5d717e06103d32c5f874f1a5a2d02644
BLAKE2b-256 7c255d7af7105230bb37c6ecdc41e5759b31873d1874b403d3922a75dcdf5c58

See more details on using hashes here.

Provenance

The following attestation bundles were made for gentroutils-4.0.0-py3-none-any.whl:

Publisher: release.yaml on opentargets/gentroutils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page