ETL pipeline for single-cell RNA-seq data
Project description
scp-ingest-pipeline
File Ingest Pipeline for Single Cell Portal
The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.
Prerequisites
- Python 3.7+
- Google Cloud Platform project
- Suitable service account (SA) and MongoDB VM in GCP. SA needs roles "Editor", "Genomics Pipelines Runner", and "Storage Object Admin". Broad Institute engineers: see instructions here.
- SAMtools, if using
ingest/make_toy_data.py
- Tabix, if using
ingest/genomes/genomes_pipeline.py
Install
Fetch the code, boot your virtualenv, install dependencies:
git clone git@github.com:broadinstitute/scp-ingest-pipeline.git
cd scp-ingest-pipeline
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt
To use ingest/make_toy_data.py
:
brew install samtools
To use ingest/genomes/genomes_pipeline.py
:
brew install tabix
Now get secrets from Vault to set environment variables needed to write to the database:
export BROAD_USER="<username in your email address>"
export DATABASE_NAME="single_cell_portal_development"
vault login -method=github token=`~/bin/git-vault-token`
# Get username and password
vault read secret/kdux/scp/development/$BROAD_USER/mongo/user
export MONGODB_USERNAME="<username from Vault>"
export MONGODB_PASSWORD="<password from Vault>"
# Get external IP address for host
vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname
export DATABASE_HOST="<ip from Vault (omit brackets)>"
If you are developing updates for Sentry logging, then set the DSN:
vault read secret/kdux/scp/production/scp_config.json | grep SENTRY
export SENTRY_DSN="<Sentry DSN value from Vault>"
Be sure to unset SENTRY_DSN
when your updates are done, so development logs are not always sent to Sentry.
Git hooks
After installing Ingest Pipeline, add Git hooks to help ensure code quality:
pre-commit install && pre-commit install -t pre-push
The hooks will expect that git-secrets has been set up. If you are a Broad Institute employee who has not done this yet, please see: broadinstitute/single_cell_portal_configs for specific guidance.
Bypass hooks
In rare cases, you might need to skip Git hooks, like so:
- Skip commit hooks:
git commit ... --no-verify
- Skip pre-push hooks:
git push ... --no-verify
Test
After installing:
source env/bin/activate
cd tests
# Run all tests
pytest
Some common pytest
usage examples (run in /tests
):
# Run all tests and see print() output
pytest -s
# Run only tests in test_ingest.py
pytest test_ingest.py
# Run all tests, show code coverage metrics
pytest --cov=../ingest/
For more, see https://docs.pytest.org/en/stable/usage.html.
Use
Run this every time you start a new terminal to work on this project:
source env/bin/activate
See ingest_pipeline.py
for usage examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scp-ingest-pipeline-1.7.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d2a1a82a6a448f4e5e0c3ffa09a6ffd32539029f34e5eae7d919b7f7939c3b0 |
|
MD5 | 1a54cc72ecb9bb63b2144438f9c584e0 |
|
BLAKE2b-256 | b1951e76d712303813cddae3852b8d5a9ff430e1de9c48c56ce5cdc68abbbb71 |
Hashes for scp_ingest_pipeline-1.7.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a2b914c981a6c17c2b6323bc42709bf0ed37f37fcf6afb756060d07fdb12abc |
|
MD5 | 950068d6a4a78eeb6b15deec28cc53e2 |
|
BLAKE2b-256 | 2d57af46e8d6a1ab8587a8453eca7b50e07e35583ece66f26009ff00cd66a16f |