Extends the scons build tool for reproducible research in bioinformatics.
Project description
This package extends the scons build tool for the construction of reproducible workflows in bioinformatics.
Documentation is available on github: http://nhoffman.github.io/bioscons/
Background
Why does SCons make sense for reproducible bioinformatics pipelines?
SCons has a sophisticated mechanism for determining dependencies, meaning that re-running SCons will only re-execute steps needing updating.
The definition of sources and targets (see example below) defines a dependency graph that supports parallelization of tasks.
Most of the work of pipelines is done by external programs that are easy to execute with a shell-like syntax for commands.
On the other hand, SCons also allows the execution of arbitrary python code in creating your script, and thus one can leverage the power of the python standard library, Biopython, NumPy, etc in your script.
Rather than dealing with a mess of filenames, subsequent steps in an SCons build are expressed in terms of file objects
Steps in the pipeline are implemented as Commands which implement a shell command or a python function in a way that consistently channels inputs into outputs
Provides multiple mechanisms for cleanly executing isolated steps of the workflow (for example, by previewing commands to be executed using scons -n, and pasting a single command directly into the shell)
SCons validates files and can fail incrementally
scons example
Here’s a simple but complete example of a build script demonstrating the execution of two commands:
vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')
env = Environment(variables=vars)
alignment = env.Command(
target='$out/seqs.aln.fasta',
source='$data/seqs.fasta',
action='muscle -in $SOURCE -out $TARGET'
)
tree = env.Command(
target='$out/seqs.tre',
source=alignment,
action='FastTreeMP -nt -gtr $SOURCE > $TARGET'
)
Here we have defined some variables for the build environment (ie, the input and output directories), and constructed an object (env) defining the execution environment. The output of the first command is used as the input to another (thereby explicitly defining a dependency between the two commands), and for the most part the names and paths of the inputs and outputs are abstracted away with shell-like variable substitution rules. It’s easy to build pipelines involving complex dependencies that nonetheless remain extremely easy to read.
So, what does bioscons provide?
Mainly, integration with the Slurm job scheduler via a subclass of the SCons Environment object. For example, here’s how to modify the above example so that each job will be dispatched to a slurm queue with a specified number of cores requested for each job:
from bioscons.slurm import SlurmEnvironment
vars = Variables()
vars.Add('data', help='source data', default='./data')
vars.Add('out', help='output directory', default='./output')
env = SlurmEnvironment(variables=vars, use_cluster=True)
alignment = env.Command(
target='$out/seqs.aln.fasta',
source='$data/seqs.fasta',
action='muscle -in $SOURCE -out $TARGET',
ncores=4
)
tree = env.Command(
target='$out/seqs.tre',
source=alignment,
action='FastTreeMP -nt -gtr $SOURCE > $TARGET',
ncores=10
)
But bioscons also provides some additional utilities for creating an inventory of targets, timing actions, modifying file paths. See http://nhoffman.github.io/bioscons/ for complete package documentation.
Installation
dependencies
Python 3.5+
scons 2.4+
installation scenarios
Various installation scenarios are possible involving different combinations of system package installers, pip, and virtualenv vs system installs. We will describe only the recommended configuration here, although others are possible. Note that although bioscons should work with scons 2.4+, scons itself is only compatible with python 3 in versions > 3.0.0
Install both scons and bioscons to a virtualenv
We strongly recommend installing both this package and scons to a virtualenv rather than to your system due to idiosyncrasies in the scons installation script, and the fact that an older version of scons is likely to be installed by package managers. This option is available using Python 3.5+
Start by creating a virtualenv:
python3 -m venv bioscons-env
Due to some quirks in the scons installation process, you must ensure that pip is the most recent version, and wheel is installed:
source bioscons-env/bin/activate pip install -U pip wheel pip install bioscons
Take care that pip corresponds to the intended version of the python interpreter; a safer option may be to use pip3.
installation from source (for development)
https://github.com/nhoffman/bioscons.git cd bioscons python3 -m venv bioscons-env source bioscons-env/bin/activate pip install -U pip wheel pip install -e . pip install -r requirements.txt # to run tests, build docs
Defining the execution environment for reproducible pipelines
When intending to run the version of scons installed to the virtualenv, it is a good idea to include the following directive in your SConstruct:
venv = os.environ.get('VIRTUAL_ENV')
if not venv:
sys.exit('--> an active virtualenv is required')
It is best to define the $PATH used to locate executables that are used within your pipeline.
Monitoring Slurm tasks
A useful way to monitor a slurm queue on a Linux system is to use watch:
watch squeue
For more information on managing Slurm tasks and installing Slurm on your system go to https://slurm.schedmd.com/documentation.html
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioscons-1.2.0.tar.gz.
File metadata
- Download URL: bioscons-1.2.0.tar.gz
- Upload date:
- Size: 261.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6da15dc593a9809145c16cdbb1d9e68984b232a57911ec0a1cd7cc5d8c737baa
|
|
| MD5 |
b1fc9bd8fc9157c036f6d32133f29e40
|
|
| BLAKE2b-256 |
cef1bba8be8a11fcadc61d9d0609b2f96a97d94a255b3d71a801ee8d5dd99d13
|
File details
Details for the file bioscons-1.2.0-py3-none-any.whl.
File metadata
- Download URL: bioscons-1.2.0-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f9c798820549733868883811ba3fadad769ada9b5582e11e154ea3f3b4ea91f
|
|
| MD5 |
ec2171059e2af281ee7c3e5e4ee219ea
|
|
| BLAKE2b-256 |
e4fa256e86d47dc23633dfb9d33341c2ff9b94d993ab508df86211add3211da0
|