Genomics pipelines
Project description
- Integrated with LabxDB: all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
- Based on existing robust technologies. No new language.
- LabxPipe pipelines are defined in JSON text files.
- LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
- Simple and complex pipelines.
- By default, pipelines are linear (one step after the other).
- Branching is easily achieved be defining a previous step (using
step_input
parameter) allowing users to create any dependency between tasks.
- Parallelized using robust asynchronous threads from the Python standard library.
Examples
See JSON files in config/pipelines
of this repository.
Pipeline JSON file | |
---|---|
mrna_seq.json |
mRNA-seq. |
mrna_seq_profiling_bam.json |
mRNA-seq. Genomic coverage profiles using GeneAbacus. BAM and SAM outputs. |
mrna_seq_no_db.json |
mRNA-seq. No LabxDB. |
mrna_seq_with_plotting.json |
mRNA-seq. Plotting non-mapped reads. Demonstrate step_input . |
mrna_seq_cufflinks.json |
mRNA-seq. Replaces GeneAbacus by Cufflinks. |
chip_seq.json |
ChIP-seq. Bowtie2 and Samtools to uniquify reads. |
chip_seq_user_function.json |
ChIP-seq. Bowtie2 and Samtools to uniquify reads. Genomic coverage profiles using GeneAbacus. Peak-calling using MACS3 employing a user-defined step/function. |
Following demonstrates how to apply mrna_seq.json
pipeline. It requires:
- LabxDB
- FASTQ files for sample named
AGR000850
andAGR000912
/plus/data/seq/by_run/AGR000850 ├── 23_009_R1.fastq.zst └── 23_009_R2.fastq.zst /plus/data/seq/by_run/AGR000912 ├── 65_009_R1.fastq.zst └── 65_009_R2.fastq.zst
Note: mrna_seq_no_db.json
demonstrates how to use LabxPipe without LabxDB: it only requires FASTQ files (in path_seq_run
directory, see above).
Requirements:
- LabxDB. Alternatively,
mrna_seq_no_db.json
doesn't require LabxDB. - ReadKnead to trim reads.
- STAR and genome index in directory defined
path_star_index
. - GeneAbacus to count reads and generate genomic profile for tracks.
-
Start pipeline:
lxpipe run --pipeline mrna_seq.json \ --worker 2 \ --processor 16
Output is written in
path_output
directory. -
Create report:
lxpipe report --pipeline mrna_seq.json
Report file
mrna_seq.xlsx
should be created in same directory asmrna_seq.json
. -
Extract output file(s) to use them directly, for instance to load them in IGV. For example:
- To extract BAM files and rename them using the sample label:
lxpipe extract --pipeline mrna_seq.json \ --files aligning,accepted_hits.sam.zst \ --label
- To extract BigWig profile files and rename them using the sample label and reference in addition to the original filename used as filename suffix:
lxpipe extract --pipeline mrna_seq.json \ --files profiling,genome_plus.bw \ --label \ --reference \ --suffix
Use
-d
/--dry_run
to test the extract command before applying it. - To extract BAM files and rename them using the sample label:
-
Merge gene/mRNA counts generated by GeneAbacus in
counting
directory:lxpipe merge-count --pipeline mrna_seq.json \ --step counting
-
Create a trackhub. Requirements:
- ChromosomeMappings file (to map chromosome names from Ensembl/NCBI to UCSC)
- Tabulated file (with chromosome name and length)
Execute in a separate directory:
lxpipe trackhub --runs AGR000850,AGR000912 \ --species_ucsc danRer11 \ --path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \ --path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \ --input_sam \ --bam_names accepted_hits.sam.zst \ --make_config \ --make_trackhub \ --make_bigwig \ --processor 16
Directory is ready to be shared by a web server for display in the UCSC genome browser.
Configuration
Parameters can be defined globally. See in config
directory of this repository for examples.
Writing pipelines
Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: path_seq_run
defined in /etc/hts/labxpipe.json
is used by default, but if path_seq_run
is defined in the pipeline file, it will be used instead.
Main parameters
Parameter | Type |
---|---|
name | string |
path_output | string |
path_seq_run | string |
path_local_steps | string |
path_annots | string |
path_bowtie2_index | string |
path_bwa-mem2_index | string |
path_minimap2_index | string |
path_star_index | string |
fastq_exts | []strings |
adaptors | {} |
logging_level | string |
run_refs | []strings |
replicate_refs | []strings |
ref_info_source | []strings |
ref_infos | {} |
analysis | [{}, {}, ...] |
Parameters for all steps
Parameter | Type |
---|---|
step_name | string |
step_function | string |
step_desc | string |
force | boolean |
Step-specific parameters
Step | Synonym | Parameter | Type |
---|---|---|---|
readknead | preparing | options | []strings |
ops_r1 | [{}, {}, ...] | ||
ops_r2 | [{}, {}, ...] | ||
plot_fastq_in | boolean | ||
plot_fastq | boolean | ||
fastq_out | boolean | ||
zip_fastq_out | string | ||
bowtie2 | genomic_aligning | options | []strings |
index | string | ||
output | string | ||
output_unfiltered | string | ||
compress_sam | boolean | ||
compress_sam_cmd | string | ||
create_bam◆ | boolean | ||
index_bam◆ | boolean | ||
bwa-mem2 | options | []strings | |
index | string | ||
output | string | ||
compress_output | boolean | ||
compress_output_cmd | string | ||
create_bam◆ | boolean | ||
index_bam◆ | boolean | ||
minimap2 | options | []strings | |
index | string | ||
output | string | ||
compress_output | boolean | ||
compress_output_cmd | string | ||
create_bam◆ | boolean | ||
index_bam◆ | boolean | ||
star | aligning | options | []strings |
index | string | ||
output_type | []strings | ||
compress_sam | boolean | ||
compress_sam_cmd | string | ||
compress_unmapped | boolean | ||
compress_unmapped_cmd | string | ||
cufflinks | options | []strings | |
inputs | [{}, {}, ...] | ||
features | [{}, {}, ...] | ||
geneabacus | counting | options | []strings |
inputs | [{}, {}, ...] | ||
path_annots | string | ||
features | [{}, {}, ...] | ||
samtools_sort | options | []strings | |
sort_by_name_bam | boolean | ||
samtools_uniquify | options | []strings | |
sort_by_name_bam | boolean | ||
index_bam | boolean | ||
cleaning | steps | [{}, {}, ...] |
◆ indicates exclusive options. For example, either create_bam
or index_bam
can be used, but not both.
Sample-specific parameters. Automatically populated if using LabxDB or sourced from ref_infos
. These parameters can be changed manually in any step (for example setting paired
to false
will ignore second reads in that step).
Parameter | Type |
---|---|
label_short | string |
paired | boolean |
directional | boolean |
r1_strand | string |
quality_scores | string |
User-defined step
In addition to the provided steps/functions, i.e. bowtie2
, star
or geneabacus
, users can defined their own step, usable in the LabxPipe pipelines. LabxPipe will import user-defined steps:
-
Written in Python
-
One step per file with the
.py
extension located in the directory defined bypath_local_steps
-
Each step defined in individual file requires:
- A
functions
variable listing the step name(s) - A function named
run
with the 3 parameterspath_in
,path_out
andparams
For example:
functions = ['macs3'] def run(path_in, path_out, params): ...
- A
Example of a user-defined function providing peak-calling using MACS3 is available in config/user_steps/macs3.py
in this repository.
Example of a pipeline using the MACS3 step is available in config/pipelines/chip_seq_user_function.json
in this repository.
Demultiplexing sequencing reads: lxpipe demultiplex
-
Demultiplex reads based on barcode sequences from the
Second barcode
field in LabxDB -
Demultiplexing using ReadKnead. The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the
Adapter 3'
field in LabxDB. -
Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the
Adapter 3'
field is set tosRNA 1.5
in LabxDB for these samples) with the following pipeline:{ "sRNA 1.5": { "R1": [ { "name": "demultiplex", "end": 5, "max_mismatch": 1 } ], "R2": null } }
The barcode sequences are added by LabxPipe using the
Second barcode
field in LabxDB. -
Example for iCLIP demultiplexing. In Vejnar et al., iCLIP is demultiplexed (the
Adapter 3'
field is set toTruSeq-DMS+A Index
in LabxDB for these samples) using the following pipeline:{ "TruSeq-DMS+A Index": { "R1": [ { "name": "clip", "end": 5, "length": 4, "add_clipped": true }, { "name": "trim", "end": 3, "algo": "bktrim", "min_sequence": 5, "keep": ["trim_exact", "trim_align"] }, { "name": "length", "min_length": 6 }, { "name": "demultiplex", "end": 3, "max_mismatch": 1, "length_ligand": 2 }, { "name": "length", "min_length": 15 } ], "R2": null } }
Pipeline is stored in
demux_truseq_dms_a.json
. The barcode sequences are added by LabxPipe using theSecond barcode
field in LabxDB. (NB: published demultiplexed data were generated using"algo": "align"
with a minimum score of 80 instead of"algo": "bktrim"
)Then pipeline was tested running:
lxpipe demultiplex --bulk HHYLKADXX \ --path_demux_ops demux_truseq_dms_a.json \ --path_seq_prepared prepared \ --demux_nozip \ --processor 1 \ --demux_verbose_level 20 \ --no_readonly
This output is very verbose: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output,
--processor
must be set to1
. Output is written in local directoryprepared
.And finally, once pipeline is validated (data is written in
path_seq_prepared
directory, see here):lxpipe demultiplex --bulk HHYLKADXX \ --path_demux_ops demux_truseq_dms_a.json \ --processor 10
License
LabxPipe is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright © 2013-2023 Charles E. Vejnar
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file labxpipe-0.7.1.tar.gz
.
File metadata
- Download URL: labxpipe-0.7.1.tar.gz
- Upload date:
- Size: 58.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74348cd7a0e8c6d349d4423734cc4162798b7734e925fa3e9b1977e2705bab5c |
|
MD5 | 2581f5aa748effcdb60870e21fc8d621 |
|
BLAKE2b-256 | 0c66b80e5f1650286cdc8b1f3fe15cf773742cd1df84756a9bb15d41a0045ca9 |