AACR Project GENIE ETL

These details have been verified by PyPI

Project links

Owner

Sage Bionetworks

GitHub Statistics

Maintainers

rxu29

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

genie banner

AACR Project GENIE

Introduction
Documentation
Dependencies
File Validator
Contributing
Sage Bionetworks Only
- Running locally
- Developing
  - Developing with Docker
  - Modifying Docker
Testing
- Running unit tests
- Running integration tests
Production
Github Workflows

Introduction

This repository documents code used to gather, QC, standardize, and analyze data uploaded by institutes participating in AACR's Project GENIE (Genomics, Evidence, Neoplasia, Information, Exchange).

Documentation

For more information about the AACR genie repository, visit the GitHub Pages site.

Dependencies

This package contains both R, Python and cli tools. These are tools or packages you will need, to be able to reproduce these results:

Python >=3.10 or <3.12
- pip install -r requirements.txt
bedtools
R 4.3.3
- renv::install()
- Follow instructions here to install synapser
Java = 21
- For mac users, it seems to work better to run brew install java
wget
- For mac users, have to run brew install wget

File Validator

Please see the local file validation tutorial for more information on this and how to use it.

Contributing

Please view contributing guide to learn how to contribute to the GENIE package.

Sage Bionetworks Only

Running locally

These are instructions on how you would setup your environment and run the pipeline locally.

Make sure you have read through the GENIE Onboarding Docs and have access to all of the required repositories, resources and synapse projects for Main GENIE.
Be sure you are invited to the Synapse GENIE Admin team.
Make sure you are a Synapse certified user: Certified User - Synapse User Account Types
(OPTIONAL if developing with docker) Be sure to clone the cbioportal repo: https://github.com/cBioPortal/cbioportal and git checkout the version of the repo pinned to the Dockerfile
(OPTIONAL if developing with docker) Be sure to clone the annotation-tools repo: https://github.com/Sage-Bionetworks/annotation-tools and git checkout the version of the repo pinned to the Dockerfile. (Not needed if developing with docker)
(HIGHLY RECOMMENDED) It is highly recommended to develop in an ec2-instance as the dockerfile building/other environment setup is not as stable under Mac/Windows local computer environments (specifically the dockerfile building is unstable in Mac/Windows). Follow instructions using Service-Catalog-Provisioning to create an ec2 on service catalog.

Using `conda`

Follow instructions to install conda on your computer:

Install conda-forge and mamba

conda install -n base -c conda-forge mamba

Install Python and R versions via mamba

mamba create -n genie_dev -c conda-forge python=3.10 r-base=4.3

Using `pipenv`

Installing via pipenv

Specify a python version that is supported by this repo:
```
pipenv --python <python_version>
```
pipenv install from requirements file
Activate your pipenv:
```
pipenv shell
```

Using `docker` (HIGHLY Recommended)

This is the most reproducible method even though it will be the most tedious to develop with. See CONTRIBUTING docs for how to locally develop with docker.. This will setup the docker image in your environment.

Pull pre-existing docker image or build from Dockerfile: Pull pre-existing docker image. You can find the list of images from here.
```
docker pull <some_docker_image_name>
```
Build from Dockerfile.
```
docker build -f Dockerfile -t <some_docker_image_name> .
```

Run docker image:

docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>

Setting up

Clone this repo and install the package locally.

Install Python packages. This is the more traditional way of installing dependencies. Follow instructions here to learn how to install pip.
```
pip install -e .
pip install -r requirements.txt
pip install -r requirements-dev.txt
```
Install R packages. Note that the R package setup of this is the most unpredictable so it's likely you have to manually install specific packages first before the rest of it will install.
```
Rscript R/install_packages.R
```
Configure the Synapse client to authenticate to Synapse.
1. Create a Synapse Personal Access token (PAT).
2. Add a ~/.synapseConfig file
```
[authentication]
authtoken = <PAT here>
```
3. OR set an environmental variable
```
export SYNAPSE_AUTH_TOKEN=<PAT here>
```
4. Confirm you can log in your terminal.
```
synapse login
```
Run the different steps of the pipeline on the test project. The --project_id syn7208886 points to the test project. You should always be using the test project when developing, testing and running locally.
1. Validate all the files excluding vcf files:
```
python3 bin/input_to_database.py main --project_id syn7208886 --onlyValidate
```
2. Validate all the files:
```
python3 bin/input_to_database.py mutation --project_id syn7208886 --onlyValidate --genie_annotation_pkg ../annotation-tools
```
3. Process all the files aside from the mutation (maf, vcf) files. The mutation processing was split because it takes at least 2 days to process all the production mutation data. Ideally, there is a parameter to exclude or include file types to process/validate, but that is not implemented.
```
python3 bin/input_to_database.py main --project_id syn7208886 --deleteOld
```
4. Process the mutation data. This command uses the annotation-tools repo that you cloned previously which houses the code that standardizes/merges the mutation (both maf and vcf) files and re-annotates the mutation data with genome nexus. The --createNewMafDatabase will create a new mutation tables in the test project. This flag is necessary for production data for two main reasons:
  - During processing of mutation data, the data is appended to the data, so without creating an empty table, there will be duplicated data uploaded.
  - By design, Synapse Tables were meant to be appended to. When a Synapse Tables is updated, it takes time to index the table and return results. This can cause problems for the pipeline when trying to query the mutation table. It is actually faster to create an entire new table than updating or deleting all rows and appending new rows when dealing with millions of rows.
  - If you run this more than once on the same day, you'll run into an issue with overwriting the narrow maf table as it already exists. Be sure to rename the current narrow maf database under Tables in the test synapse project and try again.
```
python3 bin/input_to_database.py mutation --project_id syn7208886 --deleteOld --genie_annotation_pkg ../annotation-tools --createNewMafDatabase
```
5. Create a consortium release. Be sure to add the --test parameter. For consistency, the processingDate specified here should match the one used in the consortium_map for the TEST key nf-genie.
```
python3 bin/database_to_staging.py <processingDate> ../cbioportal TEST --test
```
6. Create a public release. Be sure to add the --test parameter. For consistency, the processingDate specified here should match the one used in the public_map for the TEST key nf-genie.
```
python3 bin/consortium_to_public.py <processingDate> ../cbioportal TEST --test
```

Developing

Navigate to your cloned repository on your computer/server.
Make sure your develop branch is up to date with the Sage-Bionetworks/Genie develop branch.
```
cd Genie
git checkout develop
git pull
```
Create a feature branch which off the develop branch. If there is a GitHub/JIRA issue that you are addressing, name the branch after the issue with some more detail (like {GH|GEN}-123-add-some-new-feature).
```
git checkout -b GEN-123-new-feature
```
At this point, you have only created the branch locally, you need to push this remotely to Github.
```
git push -u origin GEN-123-new-feature
```

Add your code changes and push them via useful commit message

git add
git commit changed_file.txt -m "Remove X parameter because it was unused"
git push

Once you have completed all the steps above, in Github, create a pull request (PR) from your feature branch to the develop branch of Sage-Bionetworks/Genie.

Developing with Docker

See using docker for setting up the initial docker environment.

A docker build will be created for your feature branch every time you have an open PR on github and add the label run_integration_tests to it.

It is recommended to develop with docker. You can either write the code changes locally, push it to your remote and wait for docker to rebuild OR do the following:

Make any code changes. These cannot be dependency changes - those would require a docker rebuild.
Create a running docker container with the image that you pulled down or created earlier
```
docker run -d <docker_image_name> /bin/bash -c "while true; do sleep 1; done"
```

Copy your code changes to the docker image:

docker cp <folder or name of file> <docker_image_name>:/root/Genie/<folder or name of files>

Run your image in interactive mode:

docker exec -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <docker_image_name> /bin/bash

Do any commands or tests you need to do

Modifying Docker

Follow this section when modifying the Dockerfile:

Have your synapse authentication token handy
docker build -f Dockerfile -t <some_docker_image_name> .
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
Run test code relevant to the dockerfile changes to make sure changes are present and working
Once changes are tested, follow genie contributing guidelines for adding it to the repo
Once deployed to main, make sure the CI/CD build successfully completed (our docker image gets automatically deployed via Github Actions CI/CD) here
Check that your docker image got successfully deployed here

Testing

Currently our Github Actions will run unit tests from our test suite /tests and run integration tests - each of the pipeline steps here on the test pipeline.

These are all triggered by adding the Github label run_integration_tests on your open PR.

To trigger run_integration_tests:

Add run_integration_tests for the first time when you just open your PR
Remove run_integration_tests label and re-add it
Make any commit and pushes when the PR is still open

If you are developing with docker, docker images for your feature branch also gets build via the run_integration_tests trigger so check that your docker image got successfully deployedhere.

Running unit tests

Unit tests in Python are also run automatically by Github Actions on any PR and are required to pass before merging.

Otherwise, if you want to add tests and run tests outside of the CI/CD, see how to run tests and general test development

Running integration tests

See running pipeline steps here if you want to run the integration tests locally.

You can also run them in nextflow via nf-genie

Production

The production pipeline is run on Nextflow Tower and the Nextflow workflow is captured in nf-genie. It is wise to create an ec2 via the Sage Bionetworks service catalog to work with the production data, because there is limited PHI in GENIE.

Github Workflows

For technical details about our CI/CD, please see the github workflows README

Project details

These details have been verified by PyPI

Project links

Owner

Sage Bionetworks

GitHub Statistics

Maintainers

rxu29

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

17.1.0

Feb 19, 2026

17.0.0

Oct 22, 2025

16.6.0

Aug 13, 2025

16.5.0

Feb 20, 2025

16.4.0

Jul 23, 2024

16.3.0

Mar 20, 2024

16.2.0

Feb 14, 2024

16.1.0

Nov 10, 2023

16.0.0

Sep 12, 2023

15.4.0

Jul 13, 2023

15.3.0

May 3, 2023

15.2.0

Mar 21, 2023

15.1.0

Feb 24, 2023

15.0.0

Jan 7, 2023

14.4.0

Dec 20, 2022

14.3.1

Nov 14, 2022

14.3.0

Nov 3, 2022

14.2.0

Nov 2, 2022

14.1.2

Sep 14, 2022

14.1.1

Aug 2, 2022

14.1.0

Jul 17, 2022

14.0.1

Jul 17, 2022

14.0.0

Jul 4, 2022

13.3.0

May 4, 2022

13.2.0

Apr 4, 2022

13.1.1

Mar 14, 2022

13.1.0

Mar 10, 2022

13.0.0

Mar 3, 2022

12.7.0

Jan 19, 2022

12.6.0

Jan 9, 2022

12.5.0

Apr 7, 2021

12.4.0

Mar 5, 2021

12.3.0

Feb 10, 2021

12.2.0

Jan 12, 2021

12.1.0

Nov 26, 2020

12.0.0

Nov 5, 2020

11.1.0

Sep 26, 2020

11.0.0

Aug 17, 2020

10.0.0

Jun 16, 2020

9.0.1

May 19, 2020

9.0.0

May 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aacrgenie-17.1.0.tar.gz (210.1 kB view details)

Uploaded Feb 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aacrgenie-17.1.0-py3-none-any.whl (205.5 kB view details)

Uploaded Feb 19, 2026 Python 3

File details

Details for the file aacrgenie-17.1.0.tar.gz.

File metadata

Download URL: aacrgenie-17.1.0.tar.gz
Upload date: Feb 19, 2026
Size: 210.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aacrgenie-17.1.0.tar.gz
Algorithm	Hash digest
SHA256	`61396a883ed046364c38a5a978ffe0dcfab5f54b6ee704cf7d2fbde821e12449`
MD5	`136f0ca673bb79dfc9b18417a3b564f2`
BLAKE2b-256	`99c4b40c9051997a86e5d92b07497d759d69d3bf50d41f7bd110d1c7880a417d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aacrgenie-17.1.0.tar.gz:

Publisher: ci.yml on Sage-Bionetworks/Genie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aacrgenie-17.1.0.tar.gz
- Subject digest: 61396a883ed046364c38a5a978ffe0dcfab5f54b6ee704cf7d2fbde821e12449
- Sigstore transparency entry: 968151278
- Sigstore integration time: Feb 19, 2026
Source repository:
- Permalink: Sage-Bionetworks/Genie@b19fe5c88bce624fe0a85842ee53f3c99bff4c19
- Branch / Tag: refs/tags/v17.1.0
- Owner: https://github.com/Sage-Bionetworks
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@b19fe5c88bce624fe0a85842ee53f3c99bff4c19
- Trigger Event: release

File details

Details for the file aacrgenie-17.1.0-py3-none-any.whl.

File metadata

Download URL: aacrgenie-17.1.0-py3-none-any.whl
Upload date: Feb 19, 2026
Size: 205.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aacrgenie-17.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22a6eb7d7f0c7c6fe6f9dded26488906f00cce073e01418a888a6169f3dbd35a`
MD5	`c80989752ba9f6546b6ceee1cfa93d94`
BLAKE2b-256	`12f94a0483b5d2b03233f3f4ddca7007a8c56f7eeeb0d1951486d2482c7bcd5f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aacrgenie-17.1.0-py3-none-any.whl:

Publisher: ci.yml on Sage-Bionetworks/Genie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aacrgenie-17.1.0-py3-none-any.whl
- Subject digest: 22a6eb7d7f0c7c6fe6f9dded26488906f00cce073e01418a888a6169f3dbd35a
- Sigstore transparency entry: 968151323
- Sigstore integration time: Feb 19, 2026
Source repository:
- Permalink: Sage-Bionetworks/Genie@b19fe5c88bce624fe0a85842ee53f3c99bff4c19
- Branch / Tag: refs/tags/v17.1.0
- Owner: https://github.com/Sage-Bionetworks
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@b19fe5c88bce624fe0a85842ee53f3c99bff4c19
- Trigger Event: release

aacrgenie 17.1.0

Navigation

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AACR Project GENIE

Table of Contents

Introduction

Documentation

Dependencies

File Validator

Contributing

Sage Bionetworks Only

Running locally

Using conda

Using pipenv

Using docker (HIGHLY Recommended)

Setting up

Developing

Developing with Docker

Modifying Docker

Testing

Running unit tests

Running integration tests

Production

Github Workflows

Project details

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Using `conda`

Using `pipenv`

Using `docker` (HIGHLY Recommended)