Skip to main content

ETL pipeline for single-cell RNA-seq data

Project description

scp-ingest-pipeline

File Ingest Pipeline for Single Cell Portal

Build status Code coverage

The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.

Prerequisites

  • Python 3.7+
  • Google Cloud Platform project
  • Suitable service account (SA) and MongoDB VM in GCP. SA needs roles "Editor", "Genomics Pipelines Runner", and "Storage Object Admin". Broad Institute engineers: see instructions here.
  • SAMtools, if using ingest/make_toy_data.py
  • Tabix, if using ingest/genomes/genomes_pipeline.py

Install

Fetch the code, boot your virtualenv, install dependencies:

git clone git@github.com:broadinstitute/scp-ingest-pipeline.git
cd scp-ingest-pipeline
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt

To use ingest/make_toy_data.py:

brew install samtools

To use ingest/genomes/genomes_pipeline.py:

brew install tabix

Now get secrets from Vault to set environment variables needed to write to the database:

export BROAD_USER="<username in your email address>"

export DATABASE_NAME="single_cell_portal_development"

vault login -method=github token=`~/bin/git-vault-token`

# Get username and password
vault read secret/kdux/scp/development/$BROAD_USER/mongo/user

export MONGODB_USERNAME="<username from Vault>"
export MONGODB_PASSWORD="<password from Vault>"

# Get external IP address for host
vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname

export DATABASE_HOST="<ip from Vault (omit brackets)>"

If you are developing updates for Sentry logging, then set the DSN:

vault read secret/kdux/scp/production/scp_config.json | grep SENTRY

export SENTRY_DSN="<Sentry DSN value from Vault>"

Be sure to unset SENTRY_DSN when your updates are done, so development logs are not always sent to Sentry.

Git hooks

After installing Ingest Pipeline, add Git hooks to help ensure code quality:

pre-commit install && pre-commit install -t pre-push

The hooks will expect that git-secrets has been set up. If you are a Broad Institute employee who has not done this yet, please see: broadinstitute/single_cell_portal_configs for specific guidance.

Bypass hooks

In rare cases, you might need to skip Git hooks, like so:

  • Skip commit hooks: git commit ... --no-verify
  • Skip pre-push hooks: git push ... --no-verify

Test

After installing:

source env/bin/activate
cd tests

# Run all tests
pytest

Some common pytest usage examples (run in /tests):

# Run all tests and see print() output
pytest -s

# Run only tests in test_ingest.py
pytest test_ingest.py

# Run all tests, show code coverage metrics
pytest --cov=../ingest/

For more, see https://docs.pytest.org/en/stable/usage.html.

Use

Run this every time you start a new terminal to work on this project:

source env/bin/activate

See ingest_pipeline.py for usage examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scp-ingest-pipeline-1.12.2.tar.gz (73.8 kB view details)

Uploaded Source

Built Distribution

scp_ingest_pipeline-1.12.2-py3-none-any.whl (85.4 kB view details)

Uploaded Python 3

File details

Details for the file scp-ingest-pipeline-1.12.2.tar.gz.

File metadata

  • Download URL: scp-ingest-pipeline-1.12.2.tar.gz
  • Upload date:
  • Size: 73.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for scp-ingest-pipeline-1.12.2.tar.gz
Algorithm Hash digest
SHA256 f13496772c94742f1a4674e8fc071fe17c56a622074cd7e5cd35eeab094e5a27
MD5 0d179b3032c60fc11bc93507398f9925
BLAKE2b-256 b2bf37cba5426d94b4e505d56ba65fdabdc8b5d950aa0a150b5fdecb927ec71d

See more details on using hashes here.

File details

Details for the file scp_ingest_pipeline-1.12.2-py3-none-any.whl.

File metadata

  • Download URL: scp_ingest_pipeline-1.12.2-py3-none-any.whl
  • Upload date:
  • Size: 85.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for scp_ingest_pipeline-1.12.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc6aa5b4e344f78840238266337d7498699ed98de8542d0c17b3ae6da3eb1a2a
MD5 167c129f25d1c7618493c79f25c17e3d
BLAKE2b-256 7c3cbf481eb48fac7670e4cc759df8a6f1df3206d2457737197e58c8206f0528

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page