Sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.
Project description
sencha
What is sencha?
Sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.
- Free software: MIT license
- Documentation: https://czbiohub.github.io/sencha
The name is inspired from the naming pattern of sourmash combined with @olgabot's love of tea. (Sencha is a Japanese green tea.)
Installation
The package can be installed from PyPI using pip
here:
pip install sencha
Developmental install
To install this code and play around with the code locally, clone this github repository and use pip
to install:
git clone https://github.com/czbiohub/sencha.git
cd sencha
# The "." means "install *this*, the folder where I am now"
pip install .
Usage
Extract likely protein-coding reads from sequencing data
A reference proteome must be supplied as the first argument.
sencha translate reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
Save the "coding scores" to a csv or parquet file
The "coding score" of each read is calculated by translating each read in six
frames, then is calculatating the
Jaccard index between any of the
six translated frames of the read and the peptide database. The final coding
score is the maximum Jaccard index across all reading frames. If you'd like to
see the coding scores for all reads, use the --csv
flag or --parquet
flag.
csv:
sencha translate --csv coding_scores.csv reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
parquet:
sencha translate --parquet coding_scores.parquet reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
Save the coding nucleotides to a fasta
By default, only the coding peptides are output. If you'd like to also output
the underlying nucleotide sequence, then use the flag --coding-nucleotide-fasta
sencha translate --coding-nucleotide-fasta coding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
Save the non-coding nucleotides to a fasta
To see the sequence of reads which were deemed non-coding, use the flag
--noncoding-nucleotide-fasta
.
sencha translate --noncoding-nucleotide-fasta noncoding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
Save the low complexity nucleotides to a fasta
To see the sequence of reads found to have too low complexity of nucleotide
sequence to evaluate, use the flag --low-complexity-nucleotide-fasta
. Low
complexity is determined by the same method as the read trimmer
fastp in which we calculate what
percentage of the sequence has consecutive runs of the same base,
or mathematically, how often seq[i] = seq[i+1]
. The default threshold is
0.3
. As an example, the sequence CCCCCCCCCACCACCACCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACACCCCCAACACCC
would be considered low complexity. While this sequence has many nucleotide
k-mers, it is likely a result of a sequencing error and we ignore it.
sencha translate --low-complexity-nucleotide-fasta low_complexity_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
Save the low complexity peptides to a fasta
Even if the nucleotide sequence may pass the complexity filter, the peptide
sequence may still be low complexity. As an example, all translated frames of
the sequence
CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG
would be considered low complexity, as it translates to either
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
(5'3' Frame 1),
SSSSSSSSSSSSSSSSSSSSSSSSSSSSS
(5'3' Frame 2),
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
(5'3' Frame 3 and 3'5' Frame 3),
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
(3'5' Frame 1),
or CCCCCCCCCCCCCCCCCCCCCCCCCCCCC
(3'5' Frame 2). As these sequences have few
k-mers and are difficult to assess for how "coding" they are, we ignore them.
Unlike for nucleotides where we look at runs of consecutive bases, we require
the translated peptide to contain greater than (L - k + 1)/2
k-mers, where
L
is the length of the sequence and k
is the k-mer size. To save the
sequence of low-complexity peptides to a fasta, use the flag
--low-complexity-peptides-fasta
.
sencha translate --low-complexity-peptides-fasta low_complexity_peptides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta
History
0.1.0 (2019-04-10)
- First release on PyPI.
1.0.0 (2020-04-28)
- Sencha release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sencha-1.0.3.tar.gz
.
File metadata
- Download URL: sencha-1.0.3.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15051fefe20e2ac0c47099be8e596d02ff5bc44b03d20b6a7a388a9e3a94b7b2 |
|
MD5 | b1ac0b465391b13ea8dd7b4ff0056e99 |
|
BLAKE2b-256 | 2823e8e48bcd911c413f9509ba3f4ffec4890692180eff0190108f57932aa126 |
File details
Details for the file sencha-1.0.3-py2.py3-none-any.whl
.
File metadata
- Download URL: sencha-1.0.3-py2.py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95224920898be7369f2925eacbedfb88e605e94d816821dd82c466f4230cd045 |
|
MD5 | 65730a370ee487db994f7324ad04ef04 |
|
BLAKE2b-256 | 5ec9f366e324cb151355badc7fde930cf9b84765bef3a576e78d471819fb1820 |