Skip to main content

Genomics pipelines

Project description

LabxPipe

MPLv2

  • Integrated with LabxDB: all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
  • Based on existing robust technologies. No new language.
    • LabxPipe pipelines are defined in JSON text files.
    • LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
  • Simple and complex pipelines.
    • By default, pipelines are linear (one step after the other).
    • Branching is easily achieved be defining a previous step (using step_input parameter) allowing users to create any dependency between tasks.
  • Parallelized using robust asynchronous threads from the Python standard library.

Examples

See JSON files in config/pipelines of this repository.

Pipeline JSON file
mrna_seq.json mRNA-seq
mrna_seq_no_db.json mRNA-seq. No LabxDB
mrna_seq_with_plotting.json mRNA-seq. Plotting non-mapped reads. Demonstrate step_input
mrna_seq_cufflinks.json mRNA-seq. Replaces GeneAbacus by Cufflinks
chip_seq.json ChIP-seq using Bowtie2 and Samtools to uniquify reads.

Following demonstrates how to apply mrna_seq.json pipeline. It requires:

  • LabxDB
  • FASTQ files for sample named AGR000850 and AGR000912
    /plus/data/seq/by_run/AGR000850
    ├── 23_009_R1.fastq.zst
    └── 23_009_R2.fastq.zst
    /plus/data/seq/by_run/AGR000912
    ├── 65_009_R1.fastq.zst
    └── 65_009_R2.fastq.zst
    

Note: mrna_seq_no_db.json demonstrates how to use LabxPipe without LabxDB: it only requires FASTQ files (in path_seq_run directory, see above).

Requirements:

  • LabxDB. Alternatively, mrna_seq_no_db.json doesn't require LabxDB.
  • ReadKnead to trim reads.
  • STAR and genome index in directory defined path_star_index.
  • GeneAbacus to count reads and generate genomic profile for tracks.
  1. Start pipeline:

    lxpipe run --pipeline mrna_seq.json \
               --worker 2 \
               --processor 16
    

    Output is written in path_output directory.

  2. Create report:

    lxpipe report --pipeline mrna_seq.json
    

    Report file mrna_seq.xlsx should be created in same directory as mrna_seq.json.

  3. Merge gene/mRNA counts generated by GeneAbacus in counting directory:

    lxpipe merge-count --pipeline mrna_seq.json \
                       --step counting
    
  4. Trackhub. Requirements:

    • ChromosomeMappings file (to map chromosome names from Ensembl/NCBI to UCSC)
    • Tabulated file (with chromosome name and length)

    Execute in a separate directory:

    lxpipe trackhub --runs AGR000850,AGR000912 \
                    --species_ucsc danRer11 \
                    --path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \
                    --path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \
                    --input_sam \
                    --bam_names accepted_hits.sam.zst \
                    --make_config \
                    --make_trackhub \
                    --make_bigwig \
                    --processor 16
    

    Directory is ready to be shared by a web server for display in the UCSC genome browser.

Configuration

Parameters can be defined globally. See in config directory of this repository for examples.

Writing pipelines

Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: path_seq_run defined in /etc/hts/labxpipe.json is used by default, but if path_seq_run is defined in the pipeline file, it will be used instead.

Main parameters

Parameter Type
name string
path_output string
path_seq_run string
path_annots string
path_bowtie2_index string
path_star_index string
fastq_exts []strings
adaptors {}
logging_level string
run_refs []strings
replicate_refs []strings
ref_info_source []strings
ref_infos {}
analysis [{}, {}, ...]

Parameters for all functions

Parameter Type
step_name string
step_function string
step_desc string
force boolean

Function-specific parameters

Function Synonym Parameter Type
readknead preparing options []strings
ops_r1 [{}, {}, ...]
ops_r2 [{}, {}, ...]
plot_fastq_in boolean
plot_fastq boolean
fastq_out boolean
zip_fastq_out string
bowtie2 genomic_aligning options []strings
index string
output string
output_unfiltered string
compress_sam boolean
compress_sam_cmd string
create_bam boolean
index_bam boolean
star aligning options []strings
index string
output_type []strings
compress_sam boolean
compress_sam_cmd string
compress_unmapped boolean
compress_unmapped_cmd string
cufflinks options []strings
inputs [{}, {}, ...]
features [{}, {}, ...]
geneabacus counting options []strings
inputs [{}, {}, ...]
features [{}, {}, ...]
uniquify options []strings
sort_by_name_bam boolean
index_bam boolean
cleaning steps [{}, {}, ...]

Sample-specific parameters. Automatically populated if using LabxDB or sourced from ref_infos. These parameters can be changed manually in any function (for example setting paired to False will ignore second reads in that step).

Parameter Type
label_short string
paired boolean
directional boolean
r1_strand string
quality_scores string

Demultiplexing sequencing reads: lxpipe demultiplex

  • Demultiplex reads based on barcode sequences from the Second barcode field in LabxDB

  • Demultiplexing using ReadKnead. The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the Adapter 3' field in LabxDB.

  • Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the Adapter 3' field is set to sRNA 1.5 in LabxDB for these samples) with the following pipeline:

    {
        "sRNA 1.5": {
            "R1": [{"name": "demultiplex",
                    "end": 5,
                    "max_mismatch": 1}],
            "R2": null
        }
    }
    

    The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB.

  • Example for iCLIP demultiplexing. In Vejnar et al., iCLIP is demultiplexed (the Adapter 3' field is set to TruSeq-DMS+A Index in LabxDB for these samples) using the following pipeline:

    {
        "TruSeq-DMS+A Index": {
            "R1": [{"name": "clip",
                    "end": 5,
                    "length": 4,
                    "add_clipped": true},
                {"name": "trim",
                 "end": 3,
                 "algo": "bktrim",
                 "min_sequence": 5,
                 "keep": ["trim_exact", "trim_align"]},
                {"name": "length",
                 "min_length": 6},
                {"name": "demultiplex",
                 "end": 3,
                 "max_mismatch": 1,
                 "length_ligand": 2},
                {"name": "length",
                 "min_length": 15}],
            "R2": null
        }
    }
    

    Pipeline is stored in demux_truseq_dms_a.json. The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB. (NB: published demultiplexed data were generated using "algo": "align" with a minimum score of 80 instead of "algo": "bktrim")

    Then pipeline was tested running:

    lxpipe demultiplex --bulk HHYLKADXX \
                       --path_demux_ops demux_truseq_dms_a.json \
                       --path_seq_prepared prepared \
                       --demux_nozip \
                       --processor 1 \
                       --demux_verbose_level 20 \
                       --no_readonly
    

    This output is very verbose: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output, --processor must be set to 1. Output is written in local directory prepared.

    And finally, once pipeline is validated (data is written in path_seq_prepared directory, see here):

    lxpipe demultiplex --bulk HHYLKADXX \
                       --path_demux_ops demux_truseq_dms_a.json \
                       --processor 10
    

License

LabxPipe is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).

Copyright © 2013-2023 Charles E. Vejnar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labxpipe-0.5.0.tar.gz (53.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page