Skip to main content

Pipeline for processing GEO data and uploading it to the PEPHub

Project description

geopephub

Automatic uploader of GEO metadata projects to PEPhub.

This repository contains geopephub CLI, that enables to automatic upload GEO projects to PEPhub based on date and scheduled automatic uploading using GitHub actions. Additionally, the CLI includes a download command, enabling users to retrieve projects from specified namespace directly from the PEPhub database. This feature is particularly helpful for downloading all GEO projects at once.

Installation

To install geopephub use this command:

pip install git+https://github.com/pepkit/geopephub.git

Overview:

The geopephub consists of 4 main functionalities:

  1. Queuer: This module comprises functions that scan for new projects in GEO, generate a new cycle for the current run, and log details for each GEO project. It sets the project status to queued and adds it to the database.
  2. Uploader: Checks if there are any queued cycles in the cycle_status table. It retrieves a list of queued projects, executes GEOfetch to download them, and uploads the results to PEPhub database using pepdbagent. geopephub updates the project upload status at each step, allowing for later checks to determine why the upload failed and what occurred.
  3. Checker: This component examines previous cycles, verifies their status, and determines if they were executed. If a cycle was not executed or was unsuccessful, it triggers a rerun. In cases where only one project was unsuccessful, it attempts to upload it again. Additionally, if the cycle does not exist, it creates one using the queuer and uploads files using the uploader.
  4. Downloader: Retrieves projects from the specified namespace, filters by uploading or updating date, and optionally sorts by name or date. It also allows setting a limit on the number of downloaded projects. Projects can be downloaded locally or to a specified S3 bucket. For more information, use the geopephub --help command

More information about these processes can be found in the flowcharts and overview below.

Queuer Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s1: Create a new cycle
    s2: Find GEO updated projects with geofetch Finder
    s3: Add projects to the queue in sample status table
    s4: Change cycle status to queued
    s5: Exit

Uploader Flowchart:

%%{init: {'theme':'forest'}}%%
stateDiagram-v2
    s1 --> s2 
    s2 --> s3
    s3 --> s4
    s4 --> s5
    s5 --> s6
    s6 --> s7
    s7 --> s8

    s7 --> s2
    s6 --> s3

    s1: Get queued cycles by specifying namespace
    s2: Change status of the cycle
    s2: Get each element from list of queued cycle
    s3: Get each project (GSE) from one cycle
    s4: Change status of the project in project_status_table
    s5: Get specified project by running Geofetcher
    s6: Using pepdbagent add project to the DB
    s6: Change status of the project in project_status_table
    s7: Change status of cycle in cycle_status_table
    s8: Exit

Checker Flowchart:

graph TD
    A[Choose cycle to check] --> B{Did it run?}
    B -->|Yes| C{Was it successful?}
    B -->|No| D[Run Queuer for the cycle]
    C -->|Yes| E{Did all samples succeed?}
    C -->|No| D

    D --> D1[Run Uploader for the cycle]
    D1 --> K

    E --> |Yes| K[Exit]
    E --> |No| G[Retrieve failed samples]

    G --> H[Run Queuer for samples]
    H --> F[Run Uploader for queued samples]
    
    F --> I[Change samples status in the table]

    I --> J[Change cycle status in the table]

    J --> K[Exit]

Download all namespace.

How to run it on rivanna:

# install geopephub from dev branch
pip install git+https://github.com/pepkit/geopephub.git@dev

# set all env vars 

# run:
geopephub auto-download --destination /project/shefflab/brickyard/datasets_downloaded/pephub/geo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geopephub-0.1.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

geopephub-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file geopephub-0.1.0.tar.gz.

File metadata

  • Download URL: geopephub-0.1.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for geopephub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c09bc5b70e32aee19e069a3a600def3bab87c3015b008bbcb2affc68002688fe
MD5 0b1ee5ebfe5009aeb200f99dc6313974
BLAKE2b-256 1713d22898ee7dca69cdbf27c31e2ddbfb2d7e34b2df28de6239056872dba1ec

See more details on using hashes here.

File details

Details for the file geopephub-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: geopephub-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for geopephub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 037fd83f7ba882e1fbe6c348c24f7a248036ba3214b9e4a2f33bed4da93ead5d
MD5 893f872ad60b502172aaa698fd5b42df
BLAKE2b-256 1b4696fa9442bf2b27a7aeb4cdd8d002c3afade0b73d2e9efff3d26c5eda43fe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page