Sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

sencha

Tests Linting

What is sencha?

Sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.

Free software: MIT license
Documentation: https://czbiohub.github.io/sencha

The name is inspired from the naming pattern of sourmash combined with @olgabot's love of tea. (Sencha is a Japanese green tea.)

Installation

The package can be installed from PyPI using pip here:

pip install sencha

Developmental install

To install this code and play around with the code locally, clone this github repository and use pip to install:

git clone https://github.com/czbiohub/sencha.git
cd sencha

# The "." means "install *this*, the folder where I am now"
pip install .

Usage

Extract likely protein-coding reads from sequencing data

A reference proteome must be supplied as the first argument.

sencha translate reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the "coding scores" to a csv or parquet file

The "coding score" of each read is calculated by translating each read in six frames, then is calculatating the Jaccard index between any of the six translated frames of the read and the peptide database. The final coding score is the maximum Jaccard index across all reading frames. If you'd like to see the coding scores for all reads, use the --csv flag or --parquet flag.

csv:

sencha translate --csv coding_scores.csv reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

parquet:

sencha translate --parquet coding_scores.parquet reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the coding nucleotides to a fasta

By default, only the coding peptides are output. If you'd like to also output the underlying nucleotide sequence, then use the flag --coding-nucleotide-fasta

sencha translate --coding-nucleotide-fasta coding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the non-coding nucleotides to a fasta

To see the sequence of reads which were deemed non-coding, use the flag --noncoding-nucleotide-fasta.

sencha translate --noncoding-nucleotide-fasta noncoding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the low complexity nucleotides to a fasta

To see the sequence of reads found to have too low complexity of nucleotide sequence to evaluate, use the flag --low-complexity-nucleotide-fasta. Low complexity is determined by the same method as the read trimmer fastp in which we calculate what percentage of the sequence has consecutive runs of the same base, or mathematically, how often seq[i] = seq[i+1]. The default threshold is 0.3. As an example, the sequence CCCCCCCCCACCACCACCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACACCCCCAACACCC would be considered low complexity. While this sequence has many nucleotide k-mers, it is likely a result of a sequencing error and we ignore it.

sencha translate --low-complexity-nucleotide-fasta low_complexity_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the low complexity peptides to a fasta

Even if the nucleotide sequence may pass the complexity filter, the peptide sequence may still be low complexity. As an example, all translated frames of the sequence CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG would be considered low complexity, as it translates to either QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ (5'3' Frame 1), SSSSSSSSSSSSSSSSSSSSSSSSSSSSS (5'3' Frame 2), AAAAAAAAAAAAAAAAAAAAAAAAAAAAA (5'3' Frame 3 and 3'5' Frame 3), LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL (3'5' Frame 1), or CCCCCCCCCCCCCCCCCCCCCCCCCCCCC (3'5' Frame 2). As these sequences have few k-mers and are difficult to assess for how "coding" they are, we ignore them. Unlike for nucleotides where we look at runs of consecutive bases, we require the translated peptide to contain greater than (L - k + 1)/2 k-mers, where L is the length of the sequence and k is the k-mer size. To save the sequence of low-complexity peptides to a fasta, use the flag --low-complexity-peptides-fasta.

sencha translate --low-complexity-peptides-fasta low_complexity_peptides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

History

0.1.0 (2019-04-10)

First release on PyPI.

1.0.0 (2020-04-28)

Sencha release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

1.0.3

Oct 20, 2020

1.0.2

Oct 11, 2020

1.0.0

May 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sencha-1.0.3.tar.gz (1.6 MB view details)

Uploaded Oct 20, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sencha-1.0.3-py2.py3-none-any.whl (46.9 kB view details)

Uploaded Oct 20, 2020 Python 2Python 3

File details

Details for the file sencha-1.0.3.tar.gz.

File metadata

Download URL: sencha-1.0.3.tar.gz
Upload date: Oct 20, 2020
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for sencha-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`15051fefe20e2ac0c47099be8e596d02ff5bc44b03d20b6a7a388a9e3a94b7b2`
MD5	`b1ac0b465391b13ea8dd7b4ff0056e99`
BLAKE2b-256	`2823e8e48bcd911c413f9509ba3f4ffec4890692180eff0190108f57932aa126`

See more details on using hashes here.

File details

Details for the file sencha-1.0.3-py2.py3-none-any.whl.

File metadata

Download URL: sencha-1.0.3-py2.py3-none-any.whl
Upload date: Oct 20, 2020
Size: 46.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.9.0

File hashes

Hashes for sencha-1.0.3-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`95224920898be7369f2925eacbedfb88e605e94d816821dd82c466f4230cd045`
MD5	`65730a370ee487db994f7324ad04ef04`
BLAKE2b-256	`5ec9f366e324cb151355badc7fde930cf9b84765bef3a576e78d471819fb1820`

See more details on using hashes here.

sencha 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sencha

What is sencha?

Installation

Developmental install

Usage

Extract likely protein-coding reads from sequencing data

Save the "coding scores" to a csv or parquet file

Save the coding nucleotides to a fasta

Save the non-coding nucleotides to a fasta

Save the low complexity nucleotides to a fasta

Save the low complexity peptides to a fasta

History

0.1.0 (2019-04-10)

1.0.0 (2020-04-28)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes