Skip to main content

A Python DSL for bioinformatics pipelines

Project description

DryPipe

A Python DSL for bioinformatics pipelines

Getting Started

1 Install dry-pipe in your virtualenv

pyton3 -m venv your_venv
source your_venv/bin/activate
pip install dry-pipe

2 Write your pipeline

from dry_pipe import DryPipe

@DryPipe.python_call()
def my_python_task_func(a, v):
    print(f"got {v}, and it's equal to 4321, and {a} is 456")
    return {
        "z": v * 2 + a
    }

def my_pipeline_task_generator(dsl):
    task1 = dsl.task(key="task1")
        .consumes(x=dsl.val(123))
        .produces(
            result=dsl.file("f.txt"),
            y=dsl.var(int)
            )
        .calls("""
            #!/usr/bin/env bash
            echo $x > $result
            export y=4321
        """)

    yield task1

    yield dsl.task(key="task2")
        .consumes(a=dsl.val(456), v=task1.out.y)
        .produces(z=dsl.var(int))
        .calls(my_python_task_func)

def my_pipeline():
    return DryPipe.create_pipeline(my_pipeline_task_generator)

3 Run it

(assuming the above code is in module my_module.py, and that my_module.py is in PYTHONPATH)

drypipe run --pipeline='my_module:my_pipeline'

What is a pipeline ?

A pipeline could be described as "a bunch of programs "working together" to analyze datasets".

Programs within a pipeline tend to:

  1. run for a long time
  2. need large amounts of resources (cpu, memory, disk space, etc), sometimes on clusters (Slurm, Torque,etc)
  3. have different CLI interfaces, file formats, etc.

The Task

a task represents the execution of a program or a python function.

DAG (Directed Acyclic Graphs) of Tasks

A bioinformatics pipeline could be described as "a bunch of program working together to analyze datasets".

DAGs (directed acyclic graph) are a very convenient mathematical abstraction to represent things such as pipelines.

    flowchart LR
    A([A])
    B([B])
    C([C])
    D([D])
    E([E])
    A-->B
    A-->C
    B-->D
    C-->D
    D-->E

The following DAG represents the execution of a pipeline. Each node represents the execution of a program, and arrows represent the producer / consumer relationship between the programs.

    flowchart LR
    A([prepare_datasets])
    B([blast])
    C([blast])
    D([report])
    A-->|f1.fasta|B
    A-->|f2.fasta|B
    A-->|f3.fasta|C
    B-->|blast-result.tsv|D
    C-->|blast-result.tsv|D

A DryPipe pipeline definition, consists of a python generator function that yields a DAG

from dry_pipe import DryPipe

def conservation_pipeline_generator(dsl):
    yield dsl.task(key="blast1") \
        .consumes(a=dsl.file("chimp")) \
        .produces(result=dsl.file("f.txt")) \
        .calls("""
            #!/usr/bin/env bash
            blastp $a $b
        """)

def conservation_pipeline():
    return DryPipe.create_pipeline(conservation_pipeline_generator)

Pipeline vs Pipeline Instance

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dry_pipe-0.5.1.tar.gz (1.3 MB view hashes)

Uploaded Source

Built Distribution

dry_pipe-0.5.1-py3-none-any.whl (1.4 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page