Skip to main content

ETL pipeline for single-cell RNA-seq data

Project description

scp-ingest-pipeline

File Ingest Pipeline for Single Cell Portal

Build status Code coverage

The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.

Prerequisites

  • Python 3.7+
  • Google Cloud Platform project
  • Suitable service account (SA) and MongoDB VM in GCP. SA needs roles "Editor", "Genomics Pipelines Runner", and "Storage Object Admin". Broad Institute engineers: see instructions here.
  • SAMtools, if using ingest/make_toy_data.py
  • Tabix, if using ingest/genomes/genomes_pipeline.py

Install

Fetch the code, boot your virtualenv, install dependencies:

git clone git@github.com:broadinstitute/scp-ingest-pipeline.git
cd scp-ingest-pipeline
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt

To use ingest/make_toy_data.py:

brew install samtools

To use ingest/genomes/genomes_pipeline.py:

brew install tabix

Now get secrets from Vault to set environment variables needed to write to the database:

export BROAD_USER="<username in your email address>"

export DATABASE_NAME="single_cell_portal_development"

vault login -method=github token=`~/bin/git-vault-token`

# Get username and password
vault read secret/kdux/scp/development/$BROAD_USER/mongo/user

export MONGODB_USERNAME="<username from Vault>"
export MONGODB_PASSWORD="<password from Vault>"

# Get external IP address for host
vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname

export DATABASE_HOST="<ip from Vault (omit brackets)>"

If you are developing updates for Sentry logging, then set the DSN:

vault read secret/kdux/scp/production/scp_config.json | grep SENTRY

export SENTRY_DSN="<Sentry DSN value from Vault>"

Be sure to unset SENTRY_DSN when your updates are done, so development logs are not always sent to Sentry.

Git hooks

After installing Ingest Pipeline, add Git hooks to help ensure code quality:

pre-commit install && pre-commit install -t pre-push

The hooks will expect that git-secrets has been set up. If you are a Broad Institute employee who has not done this yet, please see: broadinstitute/single_cell_portal_configs for specific guidance.

Bypass hooks

In rare cases, you might need to skip Git hooks, like so:

  • Skip commit hooks: git commit ... --no-verify
  • Skip pre-push hooks: git push ... --no-verify

Test

After installing:

source env/bin/activate
cd tests

# Run all tests
pytest

Some common pytest usage examples (run in /tests):

# Run all tests and see print() output
pytest -s

# Run only tests in test_ingest.py
pytest test_ingest.py

# Run all tests, show code coverage metrics
pytest --cov=../ingest/

For more, see https://docs.pytest.org/en/stable/usage.html.

Use

Run this every time you start a new terminal to work on this project:

source env/bin/activate

See ingest_pipeline.py for usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scp-ingest-pipeline-1.12.2.tar.gz (73.8 kB view hashes)

Uploaded Source

Built Distribution

scp_ingest_pipeline-1.12.2-py3-none-any.whl (85.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page