A Python PiPeLine framework
Project description
PyPPL - A Python PiPeLine framework
Documentation | API | Change log
Features
- Process caching.
- Process reusability.
- Process error handling.
- Runner customization.
- Running profile switching.
- Plugin system.
- Pipeline flowchart (using plugin pyppl-flowchart).
- Pipeline report (using plugin pyppl-report).
Installation
pip install PyPPL
Writing pipelines with predefined processes
Let's say we are implementing the TCGA DNA-Seq Re-alignment Workflow (The very left part of following figure). For demonstration, we will skip the QC and the co-clean parts here.
demo.py
:
from pyppl import PyPPL, Channel
# import predefined processes
from TCGAprocs import pBamToFastq, pAlignment, pBamSort, pBamMerge, pMarkDups
# Load the bam files
pBamToFastq.input = Channel.fromPattern('/path/to/*.bam')
# Align the reads to reference genome
pAlignment.depends = pBamToFastq
# Sort bam files
pBamSort.depends = pAlignment
# Merge bam files
pBamMerge.depends = pBamSort
# Mark duplicates
pMarkDups.depends = pBamMerge
# Export the results
pMarkDups.exdir = '/path/to/realigned_Bams'
# Specify the start process and run the pipeline
PyPPL().start(pBamToFastq).run()
Implementing individual processes
TCGAprocs.py
:
from pyppl import Proc
pBamToFastq = Proc(desc = 'Convert bam files to fastq files.')
pBamToFastq.input = 'infile:file'
pBamToFastq.output = [
'fq1:file:{{i.infile | stem}}_1.fq.gz',
'fq2:file:{{i.infile | stem}}_2.fq.gz']
pBamToFastq.script = '''
bamtofastq collate=1 exclude=QCFAIL,SECONDARY,SUPPLEMENTARY \
filename= {{i.infile}} gz=1 inputformat=bam level=5 \
outputdir= {{job.outdir}} outputperreadgroup=1 tryoq=1 \
outputperreadgroupsuffixF=_1.fq.gz \
outputperreadgroupsuffixF2=_2.fq.gz \
outputperreadgroupsuffixO=_o1.fq.gz \
outputperreadgroupsuffixO2=_o2.fq.gz \
outputperreadgroupsuffixS=_s.fq.gz
'''
pAlignment = Proc(desc = 'Align reads to reference genome.')
pAlignment.input = 'fq1:file, fq2:file'
# name_1.fq.gz => name.bam
pAlignment.output = 'bam:file:{{i.fq1 | stem | stem | [:-2]}}.bam'
pAlignment.script = '''
bwa mem -t 8 -T 0 -R <read_group> <reference> {{i.fq1}} {{i.fq2}} | \
samtools view -Shb -o {{o.bam}} -
'''
pBamSort = Proc(desc = 'Sort bam files.')
pBamSort.input = 'inbam:file'
pBamSort.output = 'outbam:file:{{i.inbam | basename}}'
pBamSort.script = '''
java -jar picard.jar SortSam CREATE_INDEX=true INPUT={{i.inbam}} \
OUTPUT={{o.outbam}} SORT_ORDER=coordinate VALIDATION_STRINGENCY=STRICT
'''
pBamMerge = Proc(desc = 'Merge bam files.')
pBamMerge.input = 'inbam:file'
pBamMerge.output = 'outbam:file:{{i.inbam | basename}}'
pBamMerge.script = '''
java -jar picard.jar MergeSamFiles ASSUME_SORTED=false CREATE_INDEX=true \
INPUT={{i.inbam}} MERGE_SEQUENCE_DICTIONARIES=false OUTPUT={{o.outbam}} \
SORT_ORDER=coordinate USE_THREADING=true VALIDATION_STRINGENCY=STRICT
'''
pMarkDups = Proc(desc = 'Mark duplicates.')
pMarkDups.input = 'inbam:file'
pMarkDups.output = 'outbam:file:{{i.inbam | basename}}'
pMarkDups.script = '''
java -jar picard.jar MarkDuplicates CREATE_INDEX=true INPUT={{i.inbam}} \
OUTPUT={{o.outbam}} VALIDATION_STRINGENCY=STRICT
'''
Each process is indenpendent so that you may also reuse the processes in other pipelines.
Pipeline flowchart
# When try to run your pipline, instead of:
# PyPPL().start(pBamToFastq).run()
# do:
PyPPL().start(pBamToFastq).flowchart().run()
Then an SVG file endswith .pyppl.svg
will be generated under current directory.
Note that this function requires Graphviz and graphviz for python.
See plugin details.
Pipeline report
See plugin details
pPyClone.report = """
## {{title}}
PyClone[1] is a tool using Probabilistic model for inferring clonal population structure from deep NGS sequencing.
![Similarity matrix]({{path.join(job.o.outdir, "plots/loci/similarity_matrix.svg")}})
```table
caption: Clusters
file: "{{path.join(job.o.outdir, "tables/cluster.tsv")}}"
rows: 10
```
[1]: Roth, Andrew, et al. "PyClone: statistical inference of clonal population structure in cancer." Nature methods 11.4 (2014): 396.
"""
# or use a template file
pPyClone.report = "file:/path/to/template.md"
PyPPL().start(pPyClone).run().report('/path/to/report', title = 'Clonality analysis using PyClone')
Full documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PyPPL-2.1.2.tar.gz
(68.1 kB
view details)
Built Distribution
PyPPL-2.1.2-py3-none-any.whl
(70.8 kB
view details)
File details
Details for the file PyPPL-2.1.2.tar.gz
.
File metadata
- Download URL: PyPPL-2.1.2.tar.gz
- Upload date:
- Size: 68.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/0.12.17 CPython/3.7.1 Linux/4.15.0-1028-gcp
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e67bf1208597625fb293087a0ad735a8ca39dff39d18f6d8960c8710eebe8df |
|
MD5 | 83cb141271944348cdb7dcc9f24edeff |
|
BLAKE2b-256 | a3dd2010b43435941301b648387354b1c9e4aab0687e6c41b9b25002f39d91c4 |
File details
Details for the file PyPPL-2.1.2-py3-none-any.whl
.
File metadata
- Download URL: PyPPL-2.1.2-py3-none-any.whl
- Upload date:
- Size: 70.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/0.12.17 CPython/3.7.1 Linux/4.15.0-1028-gcp
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a32a00c0b0762989477dab88d0b222a7e0e383eae752d3ad60edba8ce80f7de |
|
MD5 | 495b051e9bdfb0f29d5976989ea5de2e |
|
BLAKE2b-256 | b22b92ff0f22d743c7eb7cc2f741aabeffe0b05de9afdaa0a4ee0bee3e4c7167 |