The workflow management system for scalable data analyses.
Project description
PBS-builder
A simplified version of workflow management system for scalable data analysis.
Change log
- Version 0.2.0: move sample_sheet from header to job section, add group_sheet support.
- Version 0.1.0: first functional version.
Usage
Installation
pbs-builder is heavily inspired by Snakemake, but uses native SLURM/Torque dependencies to build analysis pipeline.
pbs-builder runs in python3.7+ with the tomli
package installed, no other dependencies are required.
Install pbs-builder using the following command:
pip install pbs-builder
Tutorial: Setting up the TOML pipeline configuration
A quick example
The main input is a configuration in TOML format. Here's a simple example:
# Parameters
pipeline = "kallisto_lncipedia"
version = "2023.5.2"
work_dir = "/histor/zhao/zhangjy/test_data/kallisto_lncipedia"
pbs_scheduler = "torque"
pbs_server = 'mu02'
pbs_queue = 'batch'
pbs_walltime = 1200
# Reference
index = "/histor/public/database/LNCipedia/merged_lnc52hc_gencodev29.annotated.fa.idx"
# Executable
kallisto = "/histor/public/software/kallisto/kallisto"
# Rules
[job.kallisto]
sample_sheet = "/histor/zhao/zhangjy/test_data/sample_sheet.csv"
input = [
{r1 = "{sample.read1}"},
{r2 = "{sample.read2}"},
]
out_dir = "{work_dir}/out_kallisto"
output = "{out_dir}/{sample.name}/abundance.h5"
threads = 32
log = "{work_dir}/logs/{sample.name}.kallisto.logs"
shell = """
mkdir -p {out_dir}
{kallisto} quant -t {threads} -b 100 -i {index} -o {out_dir}/{sample.name} {input.r1} {input.r2}
"""
See to example/Hi-C.toml
and example/samples.csv
for another example TOML and sample sheet configuration.
The header section of the TOML file
In the header section, the pipeline/version/work_dir/sample_sheet fields is required to specify the pipeline information:
field | description |
---|---|
pipeline | name of the pipeline |
version | version of the pipeline |
work_dir | path to the main output directory of the pipeline, will be created automatically if it does not exist |
Supported variables for submitting PBS jobs are:
field | description |
---|---|
pbs_scheduler | name of the scheduler, should be either 'torque' or 'slurm' |
pbs_server | name of the submitting host |
pbs_queue | queue for submitting jobs |
pbs_walltime | number of requested maximum hours each job can run |
threads | default threads for PBS jobs |
Other variables defined in the header section (e.g. kallisto
and index
) can also be accessed globally.
NOTE: only str
and int
variables are allowed in the header section.
Job configuration
The name of each job should be strictly in [job.job_name]
format (e.g. [job.kallisto]
, [job.sleuth]
).
The only mandatory job field is shell
, but some variables can also be defined for each job separately:
Variable | Type | Description | Default |
---|---|---|---|
depend | str / list | name of job dependencies | No dependencies |
sample_sheet | str | path to sample_sheet file | Optional |
group_sheet | str | path to group_sheet file | Optional |
input | str / list of dicts | version of the pipeline | No specific input files |
output | str / list of dicts | path to the main output directory of the pipeline, will be created automatically if it does not exist | No specific output files |
threads | int | numer of cpu cores | same as threads in header section |
gpus | int | number of gpus | 0 |
queue | str | name of queue | same as pbs_queue |
walltime | int | number of maximum hours to run | same as pbs_walltime |
dir | str | path to workdir | same as work_dir |
log | str | path to log file | {work_dir}/logs/{job}.logs |
shell | str | shell script to execute | required |
All variables defined in each job can be accessed locally using {var_name}
, e.g. {threads}
and {dir}
.
The priority of variables: sample > group > job > global
The sample sheet
The sample_sheet is a comma separated file, where the first row is the name of each column, and the other rows represent the samples to be analysed.
The only required column is name
, which should be used to define the name of each sample. Other columns can be named as you wish.
name,read1,read2
26-ADJ,/data/26-ADJ_1.fq.gz,/data/26-ADJ_2.fq.gz
Any columns defined in the sample_sheet.csv
can be accessed in the TOML file using {sample.name}
or {sample.col_name}
.
If sample_sheet is defined, a copy of job for each sample will be submitted.
The group sheet
The group_sheet should be in the same format as sample sheet file, but should be accessed using {group.name}
and {group.col_name}
.
If both sample_sheet and group_sheet are defined, a copy of job for each sample and group combination will be submitted.
Job dependencies
The depend
field is the name / list of names of job dependencies. For example,
# Single dependency, job will start when [job.hisat2] for the same sample has finished successfully.
depend = "hisat2"
# Multiple dependency, job will only start if both [job.hisat2] and [job.stringtie] has finished successfully.
depend = ["hisat2", "stringtie"]
Input / output files
The input
and output
fields can be in str or dict list format.
For example,
# Single input, defined value can be accessed with {input} or {output}
input = "path/to/{sample.name}.fastq.gz"
output = "path/to/{sample.name}.sorted.bam"
# Multiple input, defined values can be accessed with {input.read1} / {input.read2} / {output.bam} / {output.unmapped}.
input = [
{read1 = "path/to/{sample.name}_r1.fastq.gz"},
{read2 = "path/to/{sample.name}_r2.fastq.gz"},
]
output = [
{bam = "path/to/{sample.name}.sorted.bam"},
{unmapped = "path/to/{sample.name}.unmapped.fastq"}
]
The output
field is used to determine whether a job has been finished. If all files/directories exist, the job will not be committed.
Resource Limits
This section contains some variables for controlling the resources requested in SLURM/Torque system.
- The
threads
controls the number of requested cpu cores for each job. - The
gpus
controls the number of requested gpu cards for each job. - The
queue
controls the queue for submitting each job. - The
walltime
controls the maximum requested run time (hours) for each job. - The
dir
controls the working directory of each job. - The
log
controls the stdout/stderr of each job. The stdout and stderr are combined into a single file.
Command Lines
The shell
field is the command line for each job. This is a mandatory field, and should be provided in multiple lines as defined by """ """
.
For example:
shell = """
mkdir -p {out_dir}
source path/to/some/scripts
{kallisto} quant -t {threads} -b 100 -i {index} -o {out_dir}/{sample.name} {input.r1} {input.r2}
"""
Usage
Submit a pipeline
For each sample, qbatch
will submit all jobs in the defined order, and dependencies will be added if the depend
field is defined.
qbatch your_pipeline.toml
After a successful commit, the following directory will be created:
├── checkpoint # IDs of successfully completed jobs
├── info # JSON formatted pipeline information
├── pipeline # a copy of the submitted TOML configuration
├── jobs # A tab separated files of job submission record file with 3 columns: name/job_id/pbs_command
├── logs # directory for log files
│ ├── sample1.job1.logs
│ ├── sample1.job2.logs
│ ├── sample2.job1.logs
│ └── sample2.job2.logs
└── sc # directory for PBS scripts
├── job1.sample1.sh
├── job2.sample1.sh
├── job1.sample2.sh
└── job2.sample2.sh
Cancelling a pipeline
If the dependencies of a job have failed to complete. The job is canceled in the Torque scheduler. However, the SLURM jobs will be stuck and marked as dependencies never satisfied. So, a simple qcancel
command is provided to cancel all unfinished jobs in a pipeline:
qcancel work_dir/jobs
Continuing a pipeline
You can also continue unfinished jobs using the record pipeline file. Only jobs recorded in jobs
and not presented in checkpoint
will be resubmitted.
qbatch -c work_dir/pipeline
Testing a pipeline
You can use -d
option to specify a dry run, where jobs will be parsed but not submitted to the PBS server
qbatch -d pipeline.toml
NOTE
- All paths specified in the TOML and sample_sheet / group_sheet files should be absolute paths.
- The values defined in
{var}
are parsed line by line. So the variable should be defined in the preceding lines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pbs_builder-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51d0a2cca29f10255fa2aef99e72d2add4955de8a7c0c2a7d0e2b287c399ca4c |
|
MD5 | 12e31e549d4a8d18f8d2b95aaafa8e40 |
|
BLAKE2b-256 | 13d6c4a1c23a705fd1769faa14ad67a3efee8db67a4a86e76641b7d19af49c73 |