Skip to main content

Extends the scons build tool for reproducible research in bioinformatics.

Project description

This package extends the scons build tool for the construction of reproducible workflows in bioinformatics.

Documentation is available on github: http://nhoffman.github.io/bioscons/

Background

Why does SCons make sense for reproducible bioinformatics pipelines?

  • SCons has a sophisticated mechanism for determining dependencies, meaning that re-running SCons will only re-execute steps needing updating.

  • The definition of sources and targets (see example below) defines a dependency graph that supports parallelization of tasks.

  • Most of the work of pipelines is done by external programs that are easy to execute with a shell-like syntax for commands.

  • On the other hand, SCons also allows the execution of arbitrary python code in creating your script, and thus one can leverage the power of the python standard library, Biopython, NumPy, etc in your script.

  • Rather than dealing with a mess of filenames, subsequent steps in an SCons build are expressed in terms of file objects

  • Steps in the pipeline are implemented as Commands which implement a shell command or a python function in a way that consistently channels inputs into outputs

  • Provides multiple mechanisms for cleanly executing isolated steps of the workflow (for example, by previewing commands to be executed using scons -n, and pasting a single command directly into the shell)

  • SCons validates files and can fail incrementally

scons example

Here’s a simple but complete example of a build script demonstrating the execution of two commands:

vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')

env = Environment(variables=vars)

alignment = env.Command(
    target='$out/seqs.aln.fasta',
    source='$data/seqs.fasta',
    action='muscle -in $SOURCE -out $TARGET'
    )

tree = env.Command(
    target='$out/seqs.tre',
    source=alignment,
    action='FastTreeMP -nt -gtr $SOURCE > $TARGET'
)

Here we have defined some variables for the build environment (ie, the input and output directories), and constructed an object (env) defining the execution environment. The output of the first command is used as the input to another (thereby explicitly defining a dependency between the two commands), and for the most part the names and paths of the inputs and outputs are abstracted away with shell-like variable substitution rules. It’s easy to build pipelines involving complex dependencies that nonetheless remain extremely easy to read.

So, what does bioscons provide?

Mainly, integration with the Slurm job scheduler via a subclass of the SCons Environment object. For example, here’s how to modify the above example so that each job will be dispatched to a slurm queue with a specified number of cores requested for each job:

from bioscons.slurm import SlurmEnvironment

vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')

env = SlurmEnvironment(variables=vars, use_cluster=True)

alignment = env.Command(
    target='$out/seqs.aln.fasta',
    source='$data/seqs.fasta',
    action='muscle -in $SOURCE -out $TARGET',
    ncores=4
    )

tree = env.Command(
    target='$out/seqs.tre',
    source=alignment,
    action='FastTreeMP -nt -gtr $SOURCE > $TARGET',
    ncores=10
)

But bioscons also provides some additional utilities for creating an inventory of targets, timing actions, modifying file paths. See http://nhoffman.github.io/bioscons/ for complete package documentation.

Installation

dependencies

  • Python 3.5+

  • scons 2.4+

installation scenarios

Various installation scenarios are possible involving different combinations of system package installers, pip, and virtualenv vs system installs. We will describe only the recommended configuration here, although others are possible. Note that although bioscons should work with scons 2.4+, scons itself is only compatible with python 3 in versions > 3.0.0

Install both scons and bioscons to a virtualenv

We strongly recommend installing both this package and scons to a virtualenv rather than to your system due to idiosyncrasies in the scons installation script, and the fact that an older version of scons is likely to be installed by package managers. This option is available using Python 3.5+

Start by creating a virtualenv:

python3 -m venv bioscons-env

Due to some quirks in the scons installation process, you must ensure that pip is the most recent version, and wheel is installed:

source bioscons-env/bin/activate
pip install -U pip wheel
pip install bioscons

Take care that pip corresponds to the intended version of the python interpreter; a safer option may be to use pip3.

installation from source (for development)

https://github.com/nhoffman/bioscons.git
cd bioscons
python3 -m venv bioscons-env
source bioscons-env/bin/activate
pip install -U pip wheel
pip install -e .
pip install -r requirements.txt  # to run tests, build docs

Defining the execution environment for reproducible pipelines

When intending to run the version of scons installed to the virtualenv, it is a good idea to include the following directive in your SConstruct:

venv = os.environ.get('VIRTUAL_ENV')
if not venv:
    sys.exit('--> an active virtualenv is required')

It is best to define the $PATH used to locate executables that are used within your pipeline.

Monitoring Slurm tasks

A useful way to monitor a slurm queue on a Linux system is to use watch:

watch squeue

For more information on managing Slurm tasks and installing Slurm on your system go to https://slurm.schedmd.com/documentation.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioscons-1.2.0.tar.gz (261.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioscons-1.2.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file bioscons-1.2.0.tar.gz.

File metadata

  • Download URL: bioscons-1.2.0.tar.gz
  • Upload date:
  • Size: 261.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for bioscons-1.2.0.tar.gz
Algorithm Hash digest
SHA256 6da15dc593a9809145c16cdbb1d9e68984b232a57911ec0a1cd7cc5d8c737baa
MD5 b1fc9bd8fc9157c036f6d32133f29e40
BLAKE2b-256 cef1bba8be8a11fcadc61d9d0609b2f96a97d94a255b3d71a801ee8d5dd99d13

See more details on using hashes here.

File details

Details for the file bioscons-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: bioscons-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for bioscons-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f9c798820549733868883811ba3fadad769ada9b5582e11e154ea3f3b4ea91f
MD5 ec2171059e2af281ee7c3e5e4ee219ea
BLAKE2b-256 e4fa256e86d47dc23633dfb9d33341c2ff9b94d993ab508df86211add3211da0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page