Skip to main content

Extends the scons build tool for reproducible research in bioinformatics.

Project description

This package extends the scons build tool for the construction of reproducible workflows in bioinformatics.

Documentation is available on github: http://nhoffman.github.io/bioscons/

Background

Why does SCons make sense for reproducible bioinformatics pipelines?

  • SCons has a sophisticated mechanism for determining dependencies, meaning that re-running SCons will only re-execute steps needing updating.

  • The definition of sources and targets (see example below) defines a dependency graph that supports parallelization of tasks.

  • Most of the work of pipelines is done by external programs that are easy to execute with a shell-like syntax for commands.

  • On the other hand, SCons also allows the execution of arbitrary python code in creating your script, and thus one can leverage the power of the python standard library, Biopython, NumPy, etc in your script.

  • Rather than dealing with a mess of filenames, subsequent steps in an SCons build are expressed in terms of file objects

  • Steps in the pipeline are implemented as Commands which implement a shell command or a python function in a way that consistently channels inputs into outputs

  • Provides multiple mechanisms for cleanly executing isolated steps of the workflow (for example, by previewing commands to be executed using scons -n, and pasting a single command directly into the shell)

  • SCons validates files and can fail incrementally

scons example

Here’s a simple but complete example of a build script demonstrating the execution of two commands:

vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')

env = Environment(variables=vars)

alignment = env.Command(
    target='$out/seqs.aln.fasta',
    source='$data/seqs.fasta',
    action='muscle -in $SOURCE -out $TARGET'
    )

tree = env.Command(
    target='$out/seqs.tre',
    source=alignment,
    action='FastTreeMP -nt -gtr $SOURCE > $TARGET'
)

Here we have defined some variables for the build environment (ie, the input and output directories), and constructed an object (env) defining the execution environment. The output of the first command is used as the input to another (thereby explicitly defining a dependency between the two commands), and for the most part the names and paths of the inputs and outputs are abstracted away with shell-like variable substitution rules. It’s easy to build pipelines involving complex dependencies that nonetheless remain extremely easy to read.

So, what does bioscons provide?

Mainly, integration with the Slurm job scheduler via a subclass of the SCons Environment object. For example, here’s how to modify the above example so that each job will be dispatched to a slurm queue with a specified number of cores requested for each job:

from bioscons.slurm import SlurmEnvironment

vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')

env = SlurmEnvironment(variables=vars, use_cluster=True)

alignment = env.Command(
    target='$out/seqs.aln.fasta',
    source='$data/seqs.fasta',
    action='muscle -in $SOURCE -out $TARGET',
    ncores=4
    )

tree = env.Command(
    target='$out/seqs.tre',
    source=alignment,
    action='FastTreeMP -nt -gtr $SOURCE > $TARGET',
    ncores=10
)

But bioscons also provides some additional utilities for creating an inventory of targets, timing actions, modifying file paths. See http://nhoffman.github.io/bioscons/ for complete package documentation.

Installation

dependencies

  • Python 3.5+

  • scons 2.4+

installation scenarios

Various installation scenarios are possible involving different combinations of system package installers, pip, and virtualenv vs system installs. We will describe only the recommended configuration here, although others are possible. Note that although bioscons should work with scons 2.4+, scons itself is only compatible with python 3 in versions > 3.0.0

Install both scons and bioscons to a virtualenv

We strongly recommend installing both this package and scons to a virtualenv rather than to your system due to idiosyncrasies in the scons installation script, and the fact that an older version of scons is likely to be installed by package managers. This option is available using Python 3.5+

Start by creating a virtualenv:

python3 -m venv bioscons-env

Due to some quirks in the scons installation process, you must ensure that pip is the most recent version, and wheel is installed:

source bioscons-env/bin/activate
pip install -U pip wheel
pip install bioscons

Take care that pip corresponds to the intended version of the python interpreter; a safer option may be to use pip3.

installation from source (for development)

https://github.com/nhoffman/bioscons.git
cd bioscons
python3 -m venv bioscons-env
source bioscons-env/bin/activate
pip install -U pip wheel
pip install -e .
pip install -r requirements.txt  # to run tests, build docs

Defining the execution environment for reproducible pipelines

When intending to run the version of scons installed to the virtualenv, it is a good idea to include the following directive in your SConstruct:

venv = os.environ.get('VIRTUAL_ENV')
if not venv:
    sys.exit('--> an active virtualenv is required')

It is best to define the $PATH used to locate executables that are used within your pipeline.

Monitoring Slurm tasks

A useful way to monitor a slurm queue on a Linux system is to use watch:

watch squeue

For more information on managing Slurm tasks and installing Slurm on your system go to https://slurm.schedmd.com/documentation.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioscons-1.1.0.tar.gz (261.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioscons-1.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file bioscons-1.1.0.tar.gz.

File metadata

  • Download URL: bioscons-1.1.0.tar.gz
  • Upload date:
  • Size: 261.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for bioscons-1.1.0.tar.gz
Algorithm Hash digest
SHA256 594c13a0c12ed646004d3f4e7e9040e88cbb6debe80cd12f67e34996121a48e5
MD5 5cb7dcee56fb30fa7790a8dcc986bb07
BLAKE2b-256 dd6dd8e57f1f2217398593158a9d61bc00fb73270d7f8d809bcfc350be30623c

See more details on using hashes here.

File details

Details for the file bioscons-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: bioscons-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for bioscons-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe1146274b02523238fbf43e91ade009bae221c21bf780fa13128959728ef89e
MD5 427c0b856c787c8f67d3e4d13b8fb99d
BLAKE2b-256 c1512ebc22111c7120d7324bc88252f45d1bf7d57430ddd7df35af68b13b0ab4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page