Genomics pipelines
Project description
- Integrated with LabxDB: all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
- Based on existing robust technologies. No new language.
- LabxPipe pipelines are defined in JSON text files.
- LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
- Simple and complex pipelines.
- By default, pipelines are linear (one step after the other).
- Branching is easily achieved be defining a previous step (using
step_input
parameter) allowing users to create any dependency between tasks.
- Parallelized using robust asynchronous threads from the Python standard library.
Examples
See JSON files in config/pipelines
of this repository.
Pipeline JSON file | |
---|---|
mrna_seq.json |
mRNA-seq |
mrna_seq_no_db.json |
mRNA-seq. No LabxDB |
mrna_seq_with_plotting.json |
mRNA-seq. Plotting non-mapped reads. Demonstrate step_input |
mrna_seq_cufflinks.json |
mRNA-seq. Replaces GeneAbacus by Cufflinks |
chip_seq.json |
ChIP-seq using Bowtie2 and Samtools to uniquify reads. |
Following demonstrates how to apply mrna_seq.json
pipeline. It requires:
- LabxDB
- FASTQ files for sample named
AGR000850
andAGR000912
/plus/data/seq/by_run/AGR000850 ├── 23_009_R1.fastq.zst └── 23_009_R2.fastq.zst /plus/data/seq/by_run/AGR000912 ├── 65_009_R1.fastq.zst └── 65_009_R2.fastq.zst
Note: mrna_seq_no_db.json
demonstrates how to use LabxPipe without LabxDB: it only requires FASTQ files (in path_seq_run
directory, see above).
Requirements:
- LabxDB. Alternatively,
mrna_seq_no_db.json
doesn't require LabxDB. - ReadKnead to trim reads.
- STAR and genome index in directory defined
path_star_index
. - GeneAbacus to count reads and generate genomic profile for tracks.
-
Start pipeline:
lxpipe run --pipeline mrna_seq.json \ --worker 2 \ --processor 16
Output is written in
path_output
directory. -
Create report:
lxpipe report --pipeline mrna_seq.json
Report file
mrna_seq.xlsx
should be created in same directory asmrna_seq.json
. -
Merge gene/mRNA counts generated by GeneAbacus in
counting
directory:lxpipe merge-count --pipeline mrna_seq.json \ --step counting
-
Trackhub. Requirements:
- ChromosomeMappings file (to map chromosome names from Ensembl/NCBI to UCSC)
- Tabulated file (with chromosome name and length)
Execute in a separate directory:
lxpipe trackhub --runs AGR000850,AGR000912 \ --species_ucsc danRer11 \ --path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \ --path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \ --input_sam \ --bam_names accepted_hits.sam.zst \ --make_config \ --make_trackhub \ --make_bigwig \ --processor 16
Directory is ready to be shared by a web server for display in the UCSC genome browser.
Configuration
Parameters can be defined globally. See in config
directory of this repository for examples.
Writing pipelines
Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: path_seq_run
defined in /etc/hts/labxpipe.json
is used by default, but if path_seq_run
is defined in the pipeline file, it will be used instead.
Main parameters
Parameter | Type |
---|---|
name | string |
path_output | string |
path_seq_run | string |
path_annots | string |
path_bowtie2_index | string |
path_star_index | string |
fastq_exts | []strings |
adaptors | {} |
logging_level | string |
run_refs | []strings |
replicate_refs | []strings |
ref_info_source | []strings |
ref_infos | {} |
analysis | [{}, {}, ...] |
Parameters for all functions
Parameter | Type |
---|---|
step_name | string |
step_function | string |
step_desc | string |
force | boolean |
Function-specific parameters
Function | Synonym | Parameter | Type |
---|---|---|---|
readknead | preparing | options | []strings |
ops_r1 | [{}, {}, ...] | ||
ops_r2 | [{}, {}, ...] | ||
plot_fastq_in | boolean | ||
plot_fastq | boolean | ||
fastq_out | boolean | ||
zip_fastq_out | string | ||
bowtie2 | genomic_aligning | options | []strings |
index | string | ||
output | string | ||
output_unfiltered | string | ||
compress_sam | boolean | ||
compress_sam_cmd | string | ||
create_bam | boolean | ||
index_bam | boolean | ||
star | aligning | options | []strings |
index | string | ||
output_type | []strings | ||
compress_sam | boolean | ||
compress_sam_cmd | string | ||
compress_unmapped | boolean | ||
compress_unmapped_cmd | string | ||
cufflinks | options | []strings | |
inputs | [{}, {}, ...] | ||
features | [{}, {}, ...] | ||
geneabacus | counting | options | []strings |
inputs | [{}, {}, ...] | ||
features | [{}, {}, ...] | ||
uniquify | options | []strings | |
sort_by_name_bam | boolean | ||
index_bam | boolean | ||
cleaning | steps | [{}, {}, ...] |
Sample-specific parameters. Automatically populated if using LabxDB or sourced from ref_infos
. These parameters can be changed manually in any function (for example setting paired
to False
will ignore second reads in that step).
Parameter | Type |
---|---|
label_short | string |
paired | boolean |
directional | boolean |
r1_strand | string |
quality_scores | string |
Demultiplexing sequencing reads: lxpipe demultiplex
-
Demultiplex reads based on barcode sequences from the
Second barcode
field in LabxDB -
Demultiplexing using ReadKnead. The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the
Adapter 3'
field in LabxDB. -
Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the
Adapter 3'
field is set tosRNA 1.5
in LabxDB for these samples) with the following pipeline:{ "sRNA 1.5": { "R1": [{"name": "demultiplex", "end": 5, "max_mismatch": 1}], "R2": null } }
The barcode sequences are added by LabxPipe using the
Second barcode
field in LabxDB. -
Example for iCLIP demultiplexing. In Vejnar et al., iCLIP is demultiplexed (the
Adapter 3'
field is set toTruSeq-DMS+A Index
in LabxDB for these samples) using the following pipeline:{ "TruSeq-DMS+A Index": { "R1": [{"name": "clip", "end": 5, "length": 4, "add_clipped": true}, {"name": "trim", "end": 3, "algo": "bktrim", "min_sequence": 5, "keep": ["trim_exact", "trim_align"]}, {"name": "length", "min_length": 6}, {"name": "demultiplex", "end": 3, "max_mismatch": 1, "length_ligand": 2}, {"name": "length", "min_length": 15}], "R2": null } }
Pipeline is stored in
demux_truseq_dms_a.json
. The barcode sequences are added by LabxPipe using theSecond barcode
field in LabxDB. (NB: published demultiplexed data were generated using"algo": "align"
with a minimum score of 80 instead of"algo": "bktrim"
)Then pipeline was tested running:
lxpipe demultiplex --bulk HHYLKADXX \ --path_demux_ops demux_truseq_dms_a.json \ --path_seq_prepared prepared \ --demux_nozip \ --processor 1 \ --demux_verbose_level 20 \ --no_readonly
This output is very verbose: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output,
--processor
must be set to1
. Output is written in local directoryprepared
.And finally, once pipeline is validated (data is written in
path_seq_prepared
directory, see here):lxpipe demultiplex --bulk HHYLKADXX \ --path_demux_ops demux_truseq_dms_a.json \ --processor 10
License
LabxPipe is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).
Copyright (C) 2013-2022 Charles E. Vejnar
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file labxpipe-0.3.0.tar.gz
.
File metadata
- Download URL: labxpipe-0.3.0.tar.gz
- Upload date:
- Size: 50.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 411b138c230337cff7d5a751b26a6fddb32f38762ba5727adeb7e2d7099bb472 |
|
MD5 | 345d72b947cf9aab85943a71823677c2 |
|
BLAKE2b-256 | 58b7e8b8176d007cc96dc1d9058252683145b9fc9454e64fdffc3704b5a4b8c2 |