labxpipe

Genomics pipelines

Project description

Integrated with LabxDB: all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
Based on existing robust technologies. No new language.
- LabxPipe pipelines are defined in JSON text files.
- LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
Simple and complex pipelines.
- By default, pipelines are linear (one step after the other).
- Branching is easily achieved be defining a previous step (using step_input parameter) allowing users to create any dependency between tasks.
Parallelized using robust asynchronous threads from the Python standard library.

Examples

See JSON files in config/pipelines of this repository.

Pipeline JSON file
`mrna_seq.json`	mRNA-seq.
`mrna_seq_profiling_bam.json`	mRNA-seq. Genomic coverage profiles using GeneAbacus. BAM and SAM outputs.
`mrna_seq_no_db.json`	mRNA-seq. No LabxDB.
`mrna_seq_with_plotting.json`	mRNA-seq. Plotting non-mapped reads. Demonstrate `step_input`.
`mrna_seq_cufflinks.json`	mRNA-seq. Replaces GeneAbacus by Cufflinks.
`chip_seq.json`	ChIP-seq. Bowtie2 and Samtools to uniquify reads.
`chip_seq_user_function.json`	ChIP-seq. Bowtie2 and Samtools to uniquify reads. Genomic coverage profiles using GeneAbacus. Peak-calling using MACS3 employing a user-defined step/function.

Following demonstrates how to apply mrna_seq.json pipeline. It requires:

LabxDB

FASTQ files for sample named AGR000850 and AGR000912

/plus/data/seq/by_run/AGR000850
├── 23_009_R1.fastq.zst
└── 23_009_R2.fastq.zst
/plus/data/seq/by_run/AGR000912
├── 65_009_R1.fastq.zst
└── 65_009_R2.fastq.zst

Note: mrna_seq_no_db.json demonstrates how to use LabxPipe without LabxDB: it only requires FASTQ files (in path_seq_run directory, see above).

Requirements:

LabxDB. Alternatively, mrna_seq_no_db.json doesn't require LabxDB.
ReadKnead to trim reads.
STAR and genome index in directory defined path_star_index.
GeneAbacus to count reads and generate genomic profile for tracks.

Start pipeline:

lxpipe run --pipeline mrna_seq.json \
           --worker 2 \
           --processor 16

Output is written in path_output directory.

Create report:
```
lxpipe report --pipeline mrna_seq.json
```
Report file mrna_seq.xlsx should be created in same directory as mrna_seq.json.

Extract output file(s) to use them directly, for instance to load them in IGV. For example:

To extract BAM files and rename them using the sample label:

lxpipe extract --pipeline mrna_seq.json \
               --files aligning,accepted_hits.sam.zst \
               --label

To extract BigWig profile files and rename them using the sample label and reference in addition to the original filename used as filename suffix:

lxpipe extract --pipeline mrna_seq.json \
               --files profiling,genome_plus.bw \
               --label \
               --reference \
               --suffix

Use -d/--dry_run to test the extract command before applying it.

Merge gene/mRNA counts generated by GeneAbacus in counting directory:

lxpipe merge-count --pipeline mrna_seq.json \
                   --step counting

Create a trackhub. Requirements:

ChromosomeMappings file (to map chromosome names from Ensembl/NCBI to UCSC)
Tabulated file (with chromosome name and length)

Execute in a separate directory:

lxpipe trackhub --runs AGR000850,AGR000912 \
                --species_ucsc danRer11 \
                --path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \
                --path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \
                --input_sam \
                --bam_names accepted_hits.sam.zst \
                --make_config \
                --make_trackhub \
                --make_bigwig \
                --processor 16

Directory is ready to be shared by a web server for display in the UCSC genome browser.

Configuration

Parameters can be defined globally. See in config directory of this repository for examples.

Writing pipelines

Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: path_seq_run defined in /etc/hts/labxpipe.json is used by default, but if path_seq_run is defined in the pipeline file, it will be used instead.

Main parameters

Parameter	Type
name	string
path_output	string
path_seq_run	string
path_local_steps	string
path_annots	string
path_bowtie2_index	string
path_bwa-mem2_index	string
path_minimap2_index	string
path_star_index	string
fastq_exts	[]strings
adaptors	{}
logging_level	string
run_refs	[]strings
replicate_refs	[]strings
ref_info_source	[]strings
ref_infos	{}
analysis	[{}, {}, ...]

Parameters for all steps

Parameter	Type
step_name	string
step_function	string
step_desc	string
force	boolean

Step-specific parameters

Step	Synonym	Parameter	Type
readknead	preparing	options	[]strings
		ops_r1	[{}, {}, ...]
		ops_r2	[{}, {}, ...]
		plot_fastq_in	boolean
		plot_fastq	boolean
		fastq_out	boolean
		zip_fastq_out	string
bowtie2	genomic_aligning	options	[]strings
		index	string
		output	string
		output_unfiltered	string
		compress_sam	boolean
		compress_sam_cmd	string
		create_bam◆	boolean
		index_bam◆	boolean
bwa-mem2		options	[]strings
		index	string
		output	string
		compress_output	boolean
		compress_output_cmd	string
		create_bam◆	boolean
		index_bam◆	boolean
minimap2		options	[]strings
		index	string
		output	string
		compress_output	boolean
		compress_output_cmd	string
		create_bam◆	boolean
		index_bam◆	boolean
star	aligning	options	[]strings
		index	string
		output_type	[]strings
		compress_sam	boolean
		compress_sam_cmd	string
		compress_unmapped	boolean
		compress_unmapped_cmd	string
cufflinks		options	[]strings
		inputs	[{}, {}, ...]
		features	[{}, {}, ...]
geneabacus	counting	options	[]strings
		inputs	[{}, {}, ...]
		path_annots	string
		features	[{}, {}, ...]
samtools_sort		options	[]strings
		sort_by_name_bam	boolean
samtools_uniquify		options	[]strings
		sort_by_name_bam	boolean
		index_bam	boolean
cleaning		steps	[{}, {}, ...]

◆ indicates exclusive options. For example, either create_bam or index_bam can be used, but not both.

Sample-specific parameters. Automatically populated if using LabxDB or sourced from ref_infos. These parameters can be changed manually in any step (for example setting paired to false will ignore second reads in that step).

Parameter	Type
label_short	string
paired	boolean
directional	boolean
r1_strand	string
quality_scores	string

User-defined step

In addition to the provided steps/functions, i.e. bowtie2, star or geneabacus, users can defined their own step, usable in the LabxPipe pipelines. LabxPipe will import user-defined steps:

Written in Python
One step per file with the .py extension located in the directory defined by path_local_steps
Each step defined in individual file requires:
1. A functions variable listing the step name(s)
2. A function named run with the 3 parameters path_in, path_out and params
For example:
```
functions = ['macs3']
def run(path_in, path_out, params):
    ...
```

Example of a user-defined function providing peak-calling using MACS3 is available in config/user_steps/macs3.py in this repository.

Example of a pipeline using the MACS3 step is available in config/pipelines/chip_seq_user_function.json in this repository.

Demultiplexing sequencing reads: `lxpipe demultiplex`

Demultiplex reads based on barcode sequences from the Second barcode field in LabxDB
Demultiplexing using ReadKnead. The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the Adapter 3' field in LabxDB.

Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the Adapter 3' field is set to sRNA 1.5 in LabxDB for these samples) with the following pipeline:

{
    "sRNA 1.5": {
        "R1": [
            {
                "name": "demultiplex",
                "end": 5,
                "max_mismatch": 1
            }
        ],
        "R2": null
    }
}

The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB.

Example for iCLIP demultiplexing. In Vejnar et al., iCLIP is demultiplexed (the Adapter 3' field is set to TruSeq-DMS+A Index in LabxDB for these samples) using the following pipeline:

{
    "TruSeq-DMS+A Index": {
        "R1": [
            {
                "name": "clip",
                "end": 5,
                "length": 4,
                "add_clipped": true
            },
            {
                "name": "trim",
                "end": 3,
                "algo": "bktrim",
                "min_sequence": 5,
                "keep": ["trim_exact", "trim_align"]
            },
            {
                "name": "length",
                "min_length": 6
            },
            {
                "name": "demultiplex",
                "end": 3,
                "max_mismatch": 1,
                "length_ligand": 2
            },
            {
                "name": "length",
                "min_length": 15
            }
        ],
        "R2": null
    }
}

Pipeline is stored in demux_truseq_dms_a.json. The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB. (NB: published demultiplexed data were generated using "algo": "align" with a minimum score of 80 instead of "algo": "bktrim")

Then pipeline was tested running:

lxpipe demultiplex --bulk HHYLKADXX \
                   --path_demux_ops demux_truseq_dms_a.json \
                   --path_seq_prepared prepared \
                   --demux_nozip \
                   --processor 1 \
                   --demux_verbose_level 20 \
                   --no_readonly

This output is very verbose: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output, --processor must be set to 1. Output is written in local directory prepared.

And finally, once pipeline is validated (data is written in path_seq_prepared directory, see here):

lxpipe demultiplex --bulk HHYLKADXX \
                   --path_demux_ops demux_truseq_dms_a.json \
                   --processor 10

License

LabxPipe is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).

Project details

Release history Release notifications | RSS feed

This version

0.7.1

Aug 1, 2023

0.7.0

May 17, 2023

0.6.0

Apr 19, 2023

0.5.0

Apr 5, 2023

0.4.0

Mar 14, 2023

0.3.0

Nov 23, 2022

0.1.2

Sep 29, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labxpipe-0.7.1.tar.gz (58.2 kB view details)

Uploaded Aug 1, 2023 Source

File details

Details for the file labxpipe-0.7.1.tar.gz.

File metadata

Download URL: labxpipe-0.7.1.tar.gz
Upload date: Aug 1, 2023
Size: 58.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for labxpipe-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`74348cd7a0e8c6d349d4423734cc4162798b7734e925fa3e9b1977e2705bab5c`
MD5	`2581f5aa748effcdb60870e21fc8d621`
BLAKE2b-256	`0c66b80e5f1650286cdc8b1f3fe15cf773742cd1df84756a9bb15d41a0045ca9`

See more details on using hashes here.

labxpipe 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Examples

Configuration

Writing pipelines

User-defined step

Demultiplexing sequencing reads: `lxpipe demultiplex`

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

labxpipe 0.7.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Examples

Configuration

Writing pipelines

User-defined step

Demultiplexing sequencing reads: lxpipe demultiplex

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Demultiplexing sequencing reads: `lxpipe demultiplex`