Skip to main content

ST Pipeline: An automated pipeline for spatial mapping of unique transcripts

Project description

The ST Pipeline contains the tools and scripts needed to process and analyze the raw files generated with the Spatial Transcriptomics or Visium in FASTQ format to generate datasets for down-stream analysis. The ST pipeline can also be used to process single cell RNA-seq data as long as a file with barcodes identifying each cell is provided. The ST Pipeline can also process RNA-Seq datasets generated with or without UMIs.

The ST Pipeline has been optimized for speed, robustness and it is very easy to use with many parameters to adjust all the settings. The ST Pipeline is fully parallel and has constant memory use. The ST Pipeline allows to skip any of the steps and to use the genome or the transcriptome as reference.

The following files/parameters are required:

  • FASTQ files (Read 1 containing the spatial information and the UMI and read 2 containing the genomic sequence)

  • A genome index generated with STAR

  • An annotation file in GTF or GFF format (optional)

  • The file containing the barcodes and array coordinates

    (look at the folder “ids” and chose the correct one). Basically this file contains 3 columns (BARCODE, X and Y), so if you provide this file with barcodes identinfying cells (for example), the ST pipeline can be used for single cell data. This file is optional too.

  • A name for the dataset

The ST pipeline has multiple parameters mostly related to trimming, mapping and annotation but generally the default values are good enough. You can see a full description of the parameters typing “st_pipeline_run.py –help” after you have installed the ST pipeline.

The input FASTQ files can be given in gzip/bzip format as well.

Basically what the ST pipeline does is:

  • Quality trimming (read 1 and read 2):
    • Remove low quality bases

    • Sanity check (reads same length, reads order, etc..)

    • Check quality UMI (if provided)

    • Remove artifacts (PolyT, PolyA, PolyG, PolyN and PolyC) of user defined length

    • Check for AT and GC content

    • Discard reads with a minimum number of bases of that failed any of the checks above

  • Contamimant filter e.x. rRNA genome (Optional)

  • Mapping with STAR (only read 2)

  • Demultiplexing with [Taggd](https://github.com/SpatialTranscriptomicsResearch/taggd) (only read 1)

  • Keep reads (read 2) that contain a valid barcode and are correctly mapped

  • Annotate the reads with htseq-count (optional)

  • Group annotated reads by barcode(spot position) and gene to get a read count

  • In the grouping/counting only unique molecules (UMIs) are kept.

You can see a graphical more detailed description of the workflow in the documents workflow.pdf and workflow_extended.pdf

The output will be a matrix of counts (genes as columns, spots as rows), a BED file containing the transcripts (Read name, coordinate, gene, etc..), and a JSON file with useful stats. The ST pipeline will also output a log file with useful information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stpipeline-1.7.8.tar.gz (29.5 MB view hashes)

Uploaded Source

Built Distributions

stpipeline-1.7.8-py3.7-macosx-10.9-x86_64.egg (233.2 kB view hashes)

Uploaded Source

stpipeline-1.7.8-py3.7-macosx-10.7-x86_64.egg (234.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page