ETL pipeline for single-cell RNA-seq data
Project description
scp-ingest-pipeline
File Ingest Pipeline for Single Cell Portal
The SCP Ingest Pipeline is an ETL pipeline for single-cell RNA-seq data.
Prerequisites
- Python 3.7+
- Google Cloud Platform project
- Suitable service account (SA) and MongoDB VM in GCP. SA needs roles "Editor", "Genomics Pipelines Runner", and "Storage Object Admin". Broad Institute engineers: see instructions here.
- SAMtools, if using
ingest/make_toy_data.py
- Tabix, if using
ingest/genomes/genomes_pipeline.py
Install
Fetch the code, boot your virtualenv, install dependencies:
git clone git@github.com:broadinstitute/scp-ingest-pipeline.git
cd scp-ingest-pipeline
python3 -m venv env --copies
source env/bin/activate
pip install -r requirements.txt
To use ingest/make_toy_data.py
:
brew install samtools
To use ingest/genomes/genomes_pipeline.py
:
brew install tabix
Now get secrets from Vault to set environment variables needed to write to the database:
export BROAD_USER="<username in your email address>"
export DATABASE_NAME="single_cell_portal_development"
vault login -method=github token=`~/bin/git-vault-token`
# Get username and password
vault read secret/kdux/scp/development/$BROAD_USER/mongo/user
export MONGODB_USERNAME="<username from Vault>"
export MONGODB_PASSWORD="<password from Vault>"
# Get external IP address for host
vault read secret/kdux/scp/development/$BROAD_USER/mongo/hostname
export DATABASE_HOST="<ip from Vault (omit brackets)>"
If you are developing updates for Sentry logging, then set the DSN:
vault read secret/kdux/scp/production/scp_config.json | grep SENTRY
export SENTRY_DSN="<Sentry DSN value from Vault>"
Be sure to unset SENTRY_DSN
when your updates are done, so development logs are not always sent to Sentry.
Git hooks
After installing Ingest Pipeline, add Git hooks to help ensure code quality:
pre-commit install && pre-commit install -t pre-push
The hooks will expect that git-secrets has been set up. If you are a Broad Institute employee who has not done this yet, please see: broadinstitute/single_cell_portal_configs for specific guidance.
Bypass hooks
In rare cases, you might need to skip Git hooks, like so:
- Skip commit hooks:
git commit ... --no-verify
- Skip pre-push hooks:
git push ... --no-verify
Test
After installing:
source env/bin/activate
cd tests
# Run all tests
pytest
Some common pytest
usage examples (run in /tests
):
# Run all tests and see print() output
pytest -s
# Run only tests in test_ingest.py
pytest test_ingest.py
# Run all tests, show code coverage metrics
pytest --cov=../ingest/
For more, see https://docs.pytest.org/en/stable/usage.html.
Use
Run this every time you start a new terminal to work on this project:
source env/bin/activate
See ingest_pipeline.py
for usage examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scp-ingest-pipeline-1.12.2.tar.gz
.
File metadata
- Download URL: scp-ingest-pipeline-1.12.2.tar.gz
- Upload date:
- Size: 73.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f13496772c94742f1a4674e8fc071fe17c56a622074cd7e5cd35eeab094e5a27 |
|
MD5 | 0d179b3032c60fc11bc93507398f9925 |
|
BLAKE2b-256 | b2bf37cba5426d94b4e505d56ba65fdabdc8b5d950aa0a150b5fdecb927ec71d |
File details
Details for the file scp_ingest_pipeline-1.12.2-py3-none-any.whl
.
File metadata
- Download URL: scp_ingest_pipeline-1.12.2-py3-none-any.whl
- Upload date:
- Size: 85.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc6aa5b4e344f78840238266337d7498699ed98de8542d0c17b3ae6da3eb1a2a |
|
MD5 | 167c129f25d1c7618493c79f25c17e3d |
|
BLAKE2b-256 | 7c3cbf481eb48fac7670e4cc759df8a6f1df3206d2457737197e58c8206f0528 |