Skip to main content

AACR Project GENIE ETL

Project description

genie banner

AACR Project GENIE

PyPi Docker Automated GitHub CI

Introduction

This repository documents code used to gather, QC, standardize, and analyze data uploaded by institutes participating in AACR's Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange).

Dependencies

This package contains both R, Python and cli tools. These are tools or packages you will need, to be able to reproduce these results:

  • Python >3.7 or <3.10
    • pip install -r requirements.txt
  • bedtools
  • R 4.2.2
    • renv::install()
    • Follow instructions here to install synapser
  • Java > 8
    • For mac users, it seems to work better to run brew install java
  • wget
    • For mac users, have to run brew install wget

File Validator

One of the features of the aacrgenie package is that is provides a local validation tool that GENIE data contributors and install and use to validate their files locally prior to uploading to Synapse.

pip install aacrgenie
genie -v

This will install all the necessary components for you to run the validator locally on all of your files, including the Synapse client. Please view the help to see how to run to validator.

genie validate -h
genie validate data_clinical_supp_SAGE.txt SAGE

Contributing

Please view contributing guide to learn how to contribute to the GENIE package.

Sage Bionetworks Only

Developing locally

These are instructions on how you would develop and test the pipeline locally.

  1. Make sure you have read through the GENIE Onboarding Docs and have access to all of the required repositories, resources and synapse projects for Main GENIE.

  2. Be sure you are invited to the Synapse GENIE Admin team.

  3. Make sure you are a Synapse certified user: Certified User - Synapse User Account Types

  4. Clone this repo and install the package locally.

    pip install -e .
    pip install -r requirements.txt
    pip install -r requirements-dev.txt
    
  5. Run the different pipelines on the test project. The --project_id syn7208886 points to the test project.

    1. Validate all the files.

      python bin/input_to_database.py main --project_id syn7208886 --onlyValidate
      
    2. Process all the files aside from the mutation (maf, vcf) files. The mutation processing was split because it takes at least 2 days to process all the production mutation data. Ideally, there is a parameter to exclude or include file types to process/validate, but that is not implemented.

      python bin/input_to_database.py main --project_id syn7208886 --deleteOld
      
    3. Process the mutation data. Be sure to clone this repo: https://github.com/Sage-Bionetworks/annotation-tools. This repo houses the code that re-annotates the mutation data with genome nexus. The --createNewMafDatabase will create a new mutation tables in the test project. This flag is necessary for production data for two main reasons:

      • During processing of mutation data, the data is appended to the data, so without creating an empty table, there will be duplicated data uploaded.
      • By design, Synapse Tables were meant to be appended to. When a Synapse Tables is updated, it takes time to index the table and return results. This can cause problems for the pipeline when trying to query the mutation table. It is actually faster to create an entire new table than updating or deleting all rows and appending new rows when dealing with millions of rows.
      python bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase
      
    4. Create a consortium release. Be sure to add the --test parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal

      python bin/database_to_staging.py Jan-2017 ../cbioportal TEST --test
      
    5. Create a public release. Be sure to add the --test parameter. Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal

      python bin/consortium_to_public.py Jan-2017 ../cbioportal TEST --test
      

Production

The production pipeline is run on Nextflow Tower and the Nextflow workflow is captured in nf-genie. It is wise to create an ec2 via the Sage Bionetworks service catalog to work with the production data, because there is limited PHI in GENIE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aacrgenie-15.3.0.tar.gz (155.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aacrgenie-15.3.0-py3-none-any.whl (152.7 kB view details)

Uploaded Python 3

File details

Details for the file aacrgenie-15.3.0.tar.gz.

File metadata

  • Download URL: aacrgenie-15.3.0.tar.gz
  • Upload date:
  • Size: 155.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for aacrgenie-15.3.0.tar.gz
Algorithm Hash digest
SHA256 6e7d46b6b4fafc0c01a9f56ee9a294a25a4bcfb55b12ecb9aaa7e2591dbe256b
MD5 5385f1708acfc35e21a14d4c34bbe5f0
BLAKE2b-256 947553d08ceb794770bd42abda5fe2ed81734289204acc8cceaed5a31de7a1ed

See more details on using hashes here.

File details

Details for the file aacrgenie-15.3.0-py3-none-any.whl.

File metadata

  • Download URL: aacrgenie-15.3.0-py3-none-any.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for aacrgenie-15.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6c27c7e521a2bb9718bd9ead8110ebe7c224655b88c52d6203608d83c6587256
MD5 a5ede4ced055fa6b8dc1d284745ad074
BLAKE2b-256 f66e55f554042470bd1bc4788126676db4d71e768a4745d172b779387ebe2546

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page