Skip to main content

A Python PiPeLine framework

Project description

PyPPL - A Python PiPeLine framework

Pypi Github PythonVers docs Travis building Codacy Codacy coverage

Documentation | API | Change log

Features

  • Process caching.
  • Process reusability.
  • Process error handling.
  • Runner customization.
  • Running profile switching.
  • Plugin system.
  • Pipeline flowchart (using plugin pyppl-flowchart).
  • Pipeline report (using plugin pyppl-report).

Installation

pip install PyPPL

Writing pipelines with predefined processes

Let's say we are implementing the TCGA DNA-Seq Re-alignment Workflow (The very left part of following figure). For demonstration, we will skip the QC and the co-clean parts here.

DNA_Seq_Variant_Calling_Pipeline

demo.py:

from pyppl import PyPPL, Channel
# import predefined processes
from TCGAprocs import pBamToFastq, pAlignment, pBamSort, pBamMerge, pMarkDups

# Load the bam files
pBamToFastq.input = Channel.fromPattern('/path/to/*.bam')
# Align the reads to reference genome
pAlignment.depends = pBamToFastq
# Sort bam files
pBamSort.depends = pAlignment
# Merge bam files
pBamMerge.depends = pBamSort
# Mark duplicates
pMarkDups.depends = pBamMerge
# Export the results
pMarkDups.exdir = '/path/to/realigned_Bams'
# Specify the start process and run the pipeline
PyPPL().start(pBamToFastq).run()

asciicast

Implementing individual processes

TCGAprocs.py:

from pyppl import Proc
pBamToFastq = Proc(desc = 'Convert bam files to fastq files.')
pBamToFastq.input = 'infile:file'
pBamToFastq.output = [
    'fq1:file:{{i.infile | stem}}_1.fq.gz',
    'fq2:file:{{i.infile | stem}}_2.fq.gz']
pBamToFastq.script = '''
bamtofastq collate=1 exclude=QCFAIL,SECONDARY,SUPPLEMENTARY \
    filename= {{i.infile}} gz=1 inputformat=bam level=5 \
    outputdir= {{job.outdir}} outputperreadgroup=1 tryoq=1 \
    outputperreadgroupsuffixF=_1.fq.gz \
    outputperreadgroupsuffixF2=_2.fq.gz \
    outputperreadgroupsuffixO=_o1.fq.gz \
    outputperreadgroupsuffixO2=_o2.fq.gz \
    outputperreadgroupsuffixS=_s.fq.gz
'''

pAlignment = Proc(desc = 'Align reads to reference genome.')
pAlignment.input = 'fq1:file, fq2:file'
#                             name_1.fq.gz => name.bam
pAlignment.output = 'bam:file:{{i.fq1 | stem | stem | [:-2]}}.bam'
pAlignment.script = '''
bwa mem -t 8 -T 0 -R <read_group> <reference> {{i.fq1}} {{i.fq2}} | \
    samtools view -Shb -o {{o.bam}} -
'''

pBamSort = Proc(desc = 'Sort bam files.')
pBamSort.input = 'inbam:file'
pBamSort.output = 'outbam:file:{{i.inbam | basename}}'
pBamSort.script = '''
java -jar picard.jar SortSam CREATE_INDEX=true INPUT={{i.inbam}} \
    OUTPUT={{o.outbam}} SORT_ORDER=coordinate VALIDATION_STRINGENCY=STRICT
'''

pBamMerge = Proc(desc = 'Merge bam files.')
pBamMerge.input = 'inbam:file'
pBamMerge.output = 'outbam:file:{{i.inbam | basename}}'
pBamMerge.script = '''
java -jar picard.jar MergeSamFiles ASSUME_SORTED=false CREATE_INDEX=true \
    INPUT={{i.inbam}} MERGE_SEQUENCE_DICTIONARIES=false OUTPUT={{o.outbam}} \
    SORT_ORDER=coordinate USE_THREADING=true VALIDATION_STRINGENCY=STRICT
'''

pMarkDups = Proc(desc = 'Mark duplicates.')
pMarkDups.input = 'inbam:file'
pMarkDups.output = 'outbam:file:{{i.inbam | basename}}'
pMarkDups.script = '''
java -jar picard.jar MarkDuplicates CREATE_INDEX=true INPUT={{i.inbam}} \
    OUTPUT={{o.outbam}} VALIDATION_STRINGENCY=STRICT
'''

Each process is indenpendent so that you may also reuse the processes in other pipelines.

Pipeline flowchart

# When try to run your pipline, instead of:
#   PyPPL().start(pBamToFastq).run()
# do:
PyPPL().start(pBamToFastq).flowchart().run()

Then an SVG file endswith .pyppl.svg will be generated under current directory. Note that this function requires Graphviz and graphviz for python.

See plugin details.

flowchart

Pipeline report

See plugin details

pPyClone.report = """
## {{title}}

PyClone[1] is a tool using Probabilistic model for inferring clonal population structure from deep NGS sequencing.

![Similarity matrix]({{path.join(job.o.outdir, "plots/loci/similarity_matrix.svg")}})

```table
caption: Clusters
file: "{{path.join(job.o.outdir, "tables/cluster.tsv")}}"
rows: 10
```

[1]: Roth, Andrew, et al. "PyClone: statistical inference of clonal population structure in cancer." Nature methods 11.4 (2014): 396.
"""

# or use a template file

pPyClone.report = "file:/path/to/template.md"
PyPPL().start(pPyClone).run().report('/path/to/report', title = 'Clonality analysis using PyClone')

report

Full documentation

ReadTheDocs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyPPL-2.1.2.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

PyPPL-2.1.2-py3-none-any.whl (70.8 kB view details)

Uploaded Python 3

File details

Details for the file PyPPL-2.1.2.tar.gz.

File metadata

  • Download URL: PyPPL-2.1.2.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.17 CPython/3.7.1 Linux/4.15.0-1028-gcp

File hashes

Hashes for PyPPL-2.1.2.tar.gz
Algorithm Hash digest
SHA256 8e67bf1208597625fb293087a0ad735a8ca39dff39d18f6d8960c8710eebe8df
MD5 83cb141271944348cdb7dcc9f24edeff
BLAKE2b-256 a3dd2010b43435941301b648387354b1c9e4aab0687e6c41b9b25002f39d91c4

See more details on using hashes here.

File details

Details for the file PyPPL-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: PyPPL-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 70.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/0.12.17 CPython/3.7.1 Linux/4.15.0-1028-gcp

File hashes

Hashes for PyPPL-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6a32a00c0b0762989477dab88d0b222a7e0e383eae752d3ad60edba8ce80f7de
MD5 495b051e9bdfb0f29d5976989ea5de2e
BLAKE2b-256 b22b92ff0f22d743c7eb7cc2f741aabeffe0b05de9afdaa0a4ee0bee3e4c7167

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page