General Small RNA-seq Analysis Tool
Project description
gsat - General Small RNA-seq Analysis Tool
Synopsis
Adapter trimming, quality control, and simple analyses of small RNA-seq data. Independent of alignments to a reference.
Author
Michael J. Axtell, The Pennsylvania State University University Park, PA 16802 USA mja18@psu.edu
Dependencies
- python3
- R , with libraries:
- shiny
- DT
- DBI
- ggplot2
R and the above libraries are required for interactive visualization of the results, but not for the main analyses.
Install
Install from the Python Package Index using pip
. If you are unfamiliar with installing python packages, read this first. Most importantly, consider whether you want to install in a virtual environment, install to the User Site (using the --user
switch), or perform a general installation.
pip install "gsat"
Usage
gsat [-h] --directory DIRECTORY --metadata METADATA
[--prefixSeq PREFIXSEQ] [--matureFa MATUREFA] [--noTrim]
[--adapterSeq ADAPTERSEQ] [--processors PROCESSORS]
[--maxEdit MAXEDIT]
Options
-h, --help show this help message and exit
--directory DIRECTORY
REQUIRED path to directory containing sRNA-seq fastq
files (gz compressed is OK)
--metadata METADATA REQUIRED .csv file of metadata. First column must be
called 'sample', additional columns are factors.
Sample names must match sRNA-seq file names, stripped
of .fastq.gz suffices.
--prefixSeq PREFIXSEQ
Sequence to use as a prefix key to search for adapter.
Defaults to AAGCTCAGGAGGGATAGCGCC (ath-miR390a).
Should be DNA, upper-case
--matureFa MATUREFA FASTA file of mature sRNA query(ies). Can be DNA or
RNA, upper or lower-case. Must be all ATGUCatguc
characters
--noTrim Don't trim - reads already trimmed
--adapterSeq ADAPTERSEQ
Adapter sequence to use for trimming. Applied to ALL
libraries. Should be DNA, upper-case. If not
specified, will be determined empirically
--processors PROCESSORS
Number of processor cores to use for fastq file
processing. Defaults to 1. Use more, depending on your
machine, in order to speed the analysis
--maxEdit MAXEDIT Maximum edit distance for variant finding. Set to 0 to
drastically increase speed (but miss all variants!)
Defaults to 2.
Important assumptions
If the assumptions below aren't met, you CAN'T use gsat with untrimmed .fastq reads!
- Untrimmed .fastq data are assumed to be single-end, + stranded reads from a standard small RNA-seq library. This means that the first base of each read corresponds to the first RNA nucleotide in the original small RNA.
- It is also assumed that each untrimmed read is longer than the original small RNA, such that, somewhere in the read, the adapter used in sequencing is part of the read.
Workflow
Arrangement of inputs
A set of .fastq files is placed into a directory. These can be gzip compressed (file name must end in .gz), or not. In addition, a .csv file of metadata must also be prepared. This metadata file must have 'sample' as the header for the first column, and then one or more factors in subequent columns. Samples are specified by their 'base name' .. which is the .fastq file name with the .fastq (and anything after the .fastq eg. /.fastq.*$/
removed.
Optionally, a file of one or more small RNA queries in FASTA format can also be prepared.
Identification of the 3'-adapter sequence
For each .fastq file, the 3'-adapter sequence is empirically determined. To do this, all of the reads are scanned to find those that begin with a known microRNA sequence, provided by option --prefixSeq
. By default, this is set to AAGCTCAGGAGGGATAGCGCC
, which is miR390. miR390 is deeply conserved among land plants, but won't be suitable for other species, so adjust --prefixSeq
according to your specific situation.
Once scanned, the most common 3' suffix is kept. If the suffix is longer than 20 bases, it is trimmed back to the first 20 bases.
If you already know the 3' adapter sequences for the libraries in your experiment and it is the same for all libraries, you can bybass this search by providing the adapter sequence with option --adapterSeq
.
Trimming of 3'-adapter sequence
Once the first 20 bases of the 3'-adapter are inferred, the reads trimmed based on finding an exact match to those 20 bases. Thus, the longest possible trimmed read is the read length - 20. For each read, there are four possible outcomes:
- No adapter found. This will include sequencing errors, and RNAs that were too long to have the adapter present.
- No insert found. These are cases where the adapter itself is the first thing sequenced. These can be common and probably derived from PCR artifacts during library construction.
- Too short. Reads that are 1-5 nucleotides after trimming are not retained.
- Output. These are properly trimmed reads 6 nucleotides or longer that are output
Adapter trimming can be skipped using the switch --noTrim
.
Summarizing libraries
The lengths and 5'-nucleotides of all successfully trimmed sRNAs are summarized. The data are computed both in terms of "abundance" and "uniques". The "abundance" metric counts all reads, while the "unique" metric counts each unique sequence only once.
Queries
Optionally, one or more small RNAs of interest can be analyzed by providing them as a FASTA file to option --matureFa
. These sequences can be provided in DNA or RNA form (T or U) and upper- or lower-case. Each queried sequence is searched and tabulated, and reported.
Variants
By default, the query analysis will also report any variants up to a Levenshtein distance of 2. This is because such variants are often quite numerous for microRNAs and other types of sRNAs. The edit distance allowed can be changed with option --maxEdit
. Setting --maxEdit
to 0 will speed up the search considerably, but won't return any close variants.
Outputs
Flat files
- for each fastq file, there will be a corresponding fastq.trimmed file, showing the adapter trimmed data in fastq format
- for each fastq.trimmed file, there will be a corresponding .counts.csv file, which shows each unique sRNA sequence, and the number of reads. The order of the sequence is arbitrary.
- a triminfo.csv file, that summarizes the results of adapter trimming
- a libinfo.csv file, that summarizes sRNA lengths and 5'-nucleotides from the successfully trimmed sRNAs
Other files
- a gsat.app.R file, which is an R-script that, when run, allows interactive analysis of the results. When executed, copy and paste the given url to a web browser. This R script requires the following libraries be available to R:
- shiny
- DT
- DBI
- ggplot2
- a .gsat.sqlite file. This is an sqlite database, intended primarily for the R script's use. But it can also be directly queried using any valid sqlite interface/protocol.
Performance
- Use the option
--processors
to increase the processor cores. Using a value for--processors
that exceeds the number of input .fastq files will not increase performance. - If analyzing queries using option
--matureFa
, setting the option--maxEdit
to 0 will speed up the analysis (but will not allow recovery of any close variants to the query sequences).
Testing / Example
Download (this file)[https://psu.box.com/v/gsatTestData], unpack it, and cd into thre resulting directory:
cd gsatData```
This folder has example `--metadata` and `--matureFa` files that will be used for an example analysis.
Next we'll use `fasterq-dump` from the (NCBI SRA Toolkit)[https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/] suite, so first install the sra toolkit and in particular `fasterq-dump`.
Retrieve three example fastq files:
```fasterq-dump SRR6074038
fasterq-dump SRR6074040
fasterq-dump SRR6074031```
These datasets are plant small RNA-seq, untrimmed reads. Run the following for a full analysis. we'll use 3 processors because we have 3 fastq files. But obviously reduce the setting for `--processors` if your system has fewer available.
```gsat --directory . --metadata gsat_TEST_metadata.csv --matureFa gsat_TEST_queries.fasta --processors 3```
Once completed, if you have installed `R` and the required libraries (DT, Shiny, DBI, ggplot2), use `R --vanilla < gsat_TEST_queries.fasta.gsat.app.R` to initiate a visulaization. Copy the url to your web browser to view, and stop by `^c` on the terminal window.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gsat-0.1.tar.gz
.
File metadata
- Download URL: gsat-0.1.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d463f00512f20c56405beff62d18a9e9ec08617c58dac7d74d4bcc1e94f31535 |
|
MD5 | 1a4759960b3b87224594df9c296a3383 |
|
BLAKE2b-256 | f51873cac86eedcc57a23f75c66394b71ec104cb748b5b37f13d80e2333263c2 |
File details
Details for the file gsat-0.1-py3-none-any.whl
.
File metadata
- Download URL: gsat-0.1-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5dc667d9240f5db4da94cfb154fc42b36f2ac6bb583ddf8344f07316cefa8f4 |
|
MD5 | d79242669ee4022d611c4562ce6fe12a |
|
BLAKE2b-256 | 901a452b3c9c2fd512e712ca17756bc3a014d54273d5bc602e1a6e0cb60df1df |