Skip to main content

Sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.

Project description

orpheum

Tests Linting codecov

What is orpheum?

Orpheum which used to be called sencha is a Python package for directly translating RNA-seq reads into coding protein sequence.

Installation

The package can be installed from PyPI using pip here:

pip install orpheum

Developmental install

To install this code and play around with the code locally, clone this github repository and use pip to install:

git clone https://github.com/czbiohub/orpheum.git
cd orpheum

# The "." means "install *this*, the folder where I am now"
pip install .

Usage

Extract likely protein-coding reads from sequencing data

A reference proteome must be supplied as the first argument.

orpheum translate reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the "coding scores" to a csv or parquet file

The "coding score" of each read is calculated by translating each read in six frames, then is calculatating the Jaccard index between any of the six translated frames of the read and the peptide database. The final coding score is the maximum Jaccard index across all reading frames. If you'd like to see the coding scores for all reads, use the --csv flag or --parquet flag.

csv:

orpheum translate --csv coding_scores.csv reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

parquet:

orpheum translate --parquet coding_scores.parquet reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the coding nucleotides to a fasta

By default, only the coding peptides are output. If you'd like to also output the underlying nucleotide sequence, then use the flag --coding-nucleotide-fasta

orpheum translate --coding-nucleotide-fasta coding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the non-coding nucleotides to a fasta

To see the sequence of reads which were deemed non-coding, use the flag --noncoding-nucleotide-fasta.

orpheum translate --noncoding-nucleotide-fasta noncoding_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the low complexity nucleotides to a fasta

To see the sequence of reads found to have too low complexity of nucleotide sequence to evaluate, use the flag --low-complexity-nucleotide-fasta. Low complexity is determined by the same method as the read trimmer fastp in which we calculate what percentage of the sequence has consecutive runs of the same base, or mathematically, how often seq[i] = seq[i+1]. The default threshold is 0.3. As an example, the sequence CCCCCCCCCACCACCACCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACACCCCCAACACCC would be considered low complexity. While this sequence has many nucleotide k-mers, it is likely a result of a sequencing error and we ignore it.

orpheum translate --low-complexity-nucleotide-fasta low_complexity_nucleotides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

Save the low complexity peptides to a fasta

Even if the nucleotide sequence may pass the complexity filter, the peptide sequence may still be low complexity. As an example, all translated frames of the sequence CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG would be considered low complexity, as it translates to either QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ (5'3' Frame 1), SSSSSSSSSSSSSSSSSSSSSSSSSSSSS (5'3' Frame 2), AAAAAAAAAAAAAAAAAAAAAAAAAAAAA (5'3' Frame 3 and 3'5' Frame 3), LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL (3'5' Frame 1), or CCCCCCCCCCCCCCCCCCCCCCCCCCCCC (3'5' Frame 2). As these sequences have few k-mers and are difficult to assess for how "coding" they are, we ignore them. Unlike for nucleotides where we look at runs of consecutive bases, we require the translated peptide to contain greater than (L - k + 1)/2 k-mers, where L is the length of the sequence and k is the k-mer size. To save the sequence of low-complexity peptides to a fasta, use the flag --low-complexity-peptides-fasta.

orpheum translate --low-complexity-peptides-fasta low_complexity_peptides.fasta reference-proteome.fa.gz *.fastq.gz > coding_peptides.fasta

History

0.1.0 (2019-04-10)

  • First release on PyPI.

1.0.0 (2020-04-28)

  • Sencha release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orpheum-1.0.4.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

orpheum-1.0.4-py2.py3-none-any.whl (48.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file orpheum-1.0.4.tar.gz.

File metadata

  • Download URL: orpheum-1.0.4.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for orpheum-1.0.4.tar.gz
Algorithm Hash digest
SHA256 ee2574a5caf665396320636962d63a0310cd95d53a3d9e843df515c874124c3f
MD5 345e94b4455d54569450015854459341
BLAKE2b-256 ccf518d6704db6782f037e9c58c678ab59d5af71d94148bac322999c8bc7e8d6

See more details on using hashes here.

File details

Details for the file orpheum-1.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: orpheum-1.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 48.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.1

File hashes

Hashes for orpheum-1.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 67d1dc119ad26050d17de81a4a83e6ac35baa703a20484bc3f403d645415acb5
MD5 90e5c37571bf0c6a8918cb20dbdf57eb
BLAKE2b-256 204e874b2b4ab755ee224864936ef73dc500172722ad0f425cab846502d9e5cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page