General Small RNA-seq Analysis Tool
gsat - General Small RNA-seq Analysis Tool
Adapter trimming, quality control, and simple analyses of small RNA-seq data. Independent of alignments to a reference.
Michael J. Axtell, The Pennsylvania State University University Park, PA 16802 USA firstname.lastname@example.org
- R , with libraries:
R and the above libraries are required for interactive visualization of the results, but not for the main analyses.
Install from the Python Package Index using
pip. If you are unfamiliar with installing python packages, read this first. Most importantly, consider whether you want to install in a virtual environment, install to the User Site (using the
--user switch), or perform a general installation.
pip install "gsat"
gsat [-h] --directory DIRECTORY --metadata METADATA [--prefixSeq PREFIXSEQ] [--matureFa MATUREFA] [--noTrim] [--adapterSeq ADAPTERSEQ] [--processors PROCESSORS] [--maxEdit MAXEDIT]
-h, --help show this help message and exit --directory DIRECTORY REQUIRED path to directory containing sRNA-seq fastq files (gz compressed is OK) --metadata METADATA REQUIRED .csv file of metadata. First column must be called 'sample', additional columns are factors. Sample names must match sRNA-seq file names, stripped of .fastq.gz suffices. --prefixSeq PREFIXSEQ Sequence to use as a prefix key to search for adapter. Defaults to AAGCTCAGGAGGGATAGCGCC (ath-miR390a). Should be DNA, upper-case --matureFa MATUREFA FASTA file of mature sRNA query(ies). Can be DNA or RNA, upper or lower-case. Must be all ATGUCatguc characters --noTrim Don't trim - reads already trimmed --adapterSeq ADAPTERSEQ Adapter sequence to use for trimming. Applied to ALL libraries. Should be DNA, upper-case. If not specified, will be determined empirically --processors PROCESSORS Number of processor cores to use for fastq file processing. Defaults to 1. Use more, depending on your machine, in order to speed the analysis --maxEdit MAXEDIT Maximum edit distance for variant finding. Set to 0 to drastically increase speed (but miss all variants!) Defaults to 2.
If the assumptions below aren't met, you CAN'T use gsat with untrimmed .fastq reads!
- Untrimmed .fastq data are assumed to be single-end, + stranded reads from a standard small RNA-seq library. This means that the first base of each read corresponds to the first RNA nucleotide in the original small RNA.
- It is also assumed that each untrimmed read is longer than the original small RNA, such that, somewhere in the read, the adapter used in sequencing is part of the read.
Arrangement of inputs
A set of .fastq files is placed into a directory. These can be gzip compressed (file name must end in .gz), or not. In addition, a .csv file of metadata must also be prepared. This metadata file must have 'sample' as the header for the first column, and then one or more factors in subequent columns. Samples are specified by their 'base name' .. which is the .fastq file name with the .fastq (and anything after the .fastq eg.
Optionally, a file of one or more small RNA queries in FASTA format can also be prepared.
Identification of the 3'-adapter sequence
For each .fastq file, the 3'-adapter sequence is empirically determined. To do this, all of the reads are scanned to find those that begin with a known microRNA sequence, provided by option
--prefixSeq. By default, this is set to
AAGCTCAGGAGGGATAGCGCC, which is miR390. miR390 is deeply conserved among land plants, but won't be suitable for other species, so adjust
--prefixSeq according to your specific situation.
Once scanned, the most common 3' suffix is kept. If the suffix is longer than 20 bases, it is trimmed back to the first 20 bases.
If you already know the 3' adapter sequences for the libraries in your experiment and it is the same for all libraries, you can bybass this search by providing the adapter sequence with option
Trimming of 3'-adapter sequence
Once the first 20 bases of the 3'-adapter are inferred, the reads trimmed based on finding an exact match to those 20 bases. Thus, the longest possible trimmed read is the read length - 20. For each read, there are four possible outcomes:
- No adapter found. This will include sequencing errors, and RNAs that were too long to have the adapter present.
- No insert found. These are cases where the adapter itself is the first thing sequenced. These can be common and probably derived from PCR artifacts during library construction.
- Too short. Reads that are 1-5 nucleotides after trimming are not retained.
- Output. These are properly trimmed reads 6 nucleotides or longer that are output
Adapter trimming can be skipped using the switch
The lengths and 5'-nucleotides of all successfully trimmed sRNAs are summarized. The data are computed both in terms of "abundance" and "uniques". The "abundance" metric counts all reads, while the "unique" metric counts each unique sequence only once.
Optionally, one or more small RNAs of interest can be analyzed by providing them as a FASTA file to option
--matureFa. These sequences can be provided in DNA or RNA form (T or U) and upper- or lower-case. Each queried sequence is searched and tabulated, and reported.
By default, the query analysis will also report any variants up to a Levenshtein distance of 2. This is because such variants are often quite numerous for microRNAs and other types of sRNAs. The edit distance allowed can be changed with option
--maxEdit to 0 will speed up the search considerably, but won't return any close variants.
- for each fastq file, there will be a corresponding fastq.trimmed file, showing the adapter trimmed data in fastq format
- for each fastq.trimmed file, there will be a corresponding .counts.csv file, which shows each unique sRNA sequence, and the number of reads. The order of the sequence is arbitrary.
- a triminfo.csv file, that summarizes the results of adapter trimming
- a libinfo.csv file, that summarizes sRNA lengths and 5'-nucleotides from the successfully trimmed sRNAs
- a gsat.app.R file, which is an R-script that, when run, allows interactive analysis of the results. When executed, copy and paste the given url to a web browser. This R script requires the following libraries be available to R:
- a .gsat.sqlite file. This is an sqlite database, intended primarily for the R script's use. But it can also be directly queried using any valid sqlite interface/protocol.
- Use the option
--processorsto increase the processor cores. Using a value for
--processorsthat exceeds the number of input .fastq files will not increase performance.
- If analyzing queries using option
--matureFa, setting the option
--maxEditto 0 will speed up the analysis (but will not allow recovery of any close variants to the query sequences).
Testing / Example
Download (this file)[https://psu.box.com/v/gsatTestData], unpack it, and cd into thre resulting directory:
cd gsatData``` This folder has example `--metadata` and `--matureFa` files that will be used for an example analysis. Next we'll use `fasterq-dump` from the (NCBI SRA Toolkit)[https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/] suite, so first install the sra toolkit and in particular `fasterq-dump`. Retrieve three example fastq files: ```fasterq-dump SRR6074038 fasterq-dump SRR6074040 fasterq-dump SRR6074031``` These datasets are plant small RNA-seq, untrimmed reads. Run the following for a full analysis. we'll use 3 processors because we have 3 fastq files. But obviously reduce the setting for `--processors` if your system has fewer available. ```gsat --directory . --metadata gsat_TEST_metadata.csv --matureFa gsat_TEST_queries.fasta --processors 3``` Once completed, if you have installed `R` and the required libraries (DT, Shiny, DBI, ggplot2), use `R --vanilla < gsat_TEST_queries.fasta.gsat.app.R` to initiate a visulaization. Copy the url to your web browser to view, and stop by `^c` on the terminal window.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size gsat-0.1-py3-none-any.whl (18.9 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size gsat-0.1.tar.gz (19.1 kB)||File type Source||Python version None||Upload date||Hashes View|