Skip to main content

MosaicProt CLI: Detect ORFs, Find Alternative ORFs, Generate Mosaic Proteins

Project description

What is MosaicProt?

The purpose of MosaicProt is to enable the de novo detection of chimeric peptide sequences produced by programmed ribosomal frameshifting (PRF). The tool can be used for the identification of individual PRF events and candidates for mosaic translation, i.e., sequences produced by multiple PRF events from a single transcript (hence the name MosaicProt). The tool can model potential chimeric sequences by analyzing the positional relationships between overlapping or closely spaced ORFs. It supports cases where altORFs:

  • Are embedded within refORFs.
  • Partially overlap with refORFs, spanning the border between a refORF and a UTR.
  • Are located entirely in a UTR, with simulated “long” frameshift events bridging the non-overlapping regions up to 10 nucleotides apart.

The simulation engine systematically explores multiple reading frames and frameshift values (±1 and ±2 nucleotides) to generate all plausible chimeric peptide combinations between overlapping or adjacent ORFs. For each ORF pair, it performs up to 30 one-nucleotide stepwise iterations to model potential frameshift junctions, simulating chimeric sequences of 40 amino acids in length. Sequences containing premature stop codons are automatically excluded, ensuring that only continuous, MS-compatible peptide models are retained. MosaicProt is a command-line tool designed to:

  • Detect chimeric peptides and proteins from all possible ORFs of a defined length range in a transcriptome. A user can provide either a spliced transcriptome or precursor sequences, depending on the scope of a study.
  • Separate canonical (refProt) and non-canonical (altProt) sequences by comparing to a reference proteome. The reference proteome can be defined by a user or imported from a database.
  • Simulate chimeric protein sequences by combining segments of refProts and altProts. It can also model chimeric sequences from overlapping altProts.

Available Commands

The tool has three commands, which correspond to the modules described in Figure 1 of the published manuscript (the ORF detector module, the altProt/refProt separator module, and the chimeric modeler module):

1. detect_ORFs

This command detects all potential ORFs from a transcriptome FASTA file and translates them into corresponding amino acid sequences.

Usage:

mosaicprot detect_ORFs --transcriptome_file <transcriptome.fasta> [--threshold 30] [--output_file_type fasta]

Parameters:

  • --transcriptome_file (required): Input transcriptome FASTA file.
  • --threshold: Minimum ORF length in amino acids (default: 30).
  • --output_file_type: Output format (fasta or xml, default: fasta).

The input file <transcriptome.fasta> contains an annotated or a de novo sequenced transcriptome of any organism. The output file name is built by the addition of a prefix “_min_30aa_ORFs.fasta” to the input file name, if the threshold is 30. For example, input file name: “my_transcriptome.fasta”; output file name: “my_transcriptome_min_30aa_ORFs.fasta”. The prefix “min_30aa_ORFs” stands for regions (ORF) that are free from in-frame stop codons and can be in silico translated to altProts longer than 29 aa. If the output file format is chosen to be xml, it cannot be used with the next module, which accepts only FASTA files. This xml option was provided to enable export to other pipelines that require xml files as inputs.

2. separate_ORFs

This command separates ORF products into refProts and altProts based on a known or a user-defined reference proteome.

Usage:

mosaicprot separate_ORFs --ORFeome_file <orfs.fasta> --reference_proteome_file <ref.fasta> [--output_alt_file altProts.fasta] [--output_ref_file refProts.fasta]

Parameters:

  • --ORFeome_file (required): Input FASTA file with predicted ORF translations.
  • --reference_proteome_file (required): Input FASTA file with reference proteins.
  • --output_alt_file: Output file for alternative proteins (default: altProts.fasta).
  • --output_ref_file: Output file for reference proteins (default: refProts.fasta).

The input file <orfs.fasta> is the output of module 1 (detect_ORFs). The input file <ref.fasta> contains a canonical proteome downloaded from a database or a user-defined canonical proteome. The default output file names are altProts.fasta and refProts.fasta.

3. simulate_chimeric_proteins

This command simulates chimeric proteins by combining segments of refProts and altProts.

Usage:

mosaicprot simulate_chimeric_proteins
--transcriptome_file <transcriptome.fasta>
--refProts_file <refProts.fasta>
--altProts_file <altProts.fasta>
--candidate_altProt_list <altProt_candidates.txt>
[--processor_num 1]
[--repetition keep_first]

Parameters:

  • --transcriptome_file (required): Transcriptome FASTA file.
  • --refProts_file (required): FASTA file of reference proteins.
  • --altProts_file (required): FASTA file of all alternative proteins.
  • --candidate_altProt_list (required): List of identifiers of selected alternative proteins to include in simulation.
  • --processor_num: Number of processors to use for parallel execution (default: 1).
  • --repetition: Strategy to handle duplicates (default: keep_first).
    • keep_first: Keep only the first of duplicate chimeric proteins.
    • keep_all: Keep all duplicates.
    • drop_all: Remove all duplicates.

Output: A file named simulated_chimeric_proteins.fasta containing the simulated chimeric proteins.

The input file <transcriptome.fasta> is the same file that was used as input for the first module (detect_ORFs). The input files <refProts.fasta> and <altProts.fasta> are the output files of the second module (separate_ORFs). The input file <altProt_candidates.txt> contains a user-defined list of altProt identifiers separated by a new line. It may correspond to conserved altProts, MS-supported altProts, or both. Thus, it may contain a subset of altProt identifiers from the file <altProts.fasta> or identifiers of the entire set of altProts, depending on the scope of a study.

The number of processors to use for parallel execution (default: 1) can be user-defined based on the availability of processors in the system.

The flag [--repetition] helps deal with duplicated models. Such models arise from altORF-altORF pairs (as opposed to refORF-altORF pairs) where one of the ORFs is considered as a refORF and the other one as an altORF, and then the other way around. The “drop_all” option of handling such duplicated models can be useful when the goal is to exclude chimeric models generated from altORF-altORF pairs.


MosaicProt Installation Guide

Requirements

Before installing MosaicProt, ensure you have:

  1. Python3.6 or higher

python --version

  1. pip package manager

pip --version

  1. System Requirements (recommended):
  • 4GB+ RAM for large datasets
  • Multi-core CPU for parallel processing
  • 500MB disk space

Installation

  1. Install from PyPI

pip install mosaicprot

  1. For development/editable installation

git clone https://github.com/aliyurtsevenn/mosaicprot.git

cd mosaicprot

pip install -e .

MosaicProt was developed to advance research on mosaic translation and programmed ribosomal frameshifting. It has enabled the discovery of chimeric proteins across various transcript types (mRNA, ncRNA, rRNA, tRNA) and is adaptable to any annotated or de novo sequenced transcriptome. For biological context and related studies, see our publication: Çakır et al. (2024, preprint).


Citation

If you use MosaicProt in your research, please cite the following article:

Umut Çakır, Noujoud Gabed, Ali Yurtseven, Igor Kryvoruchko (2025).
A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting.
bioRxiv (Cold Spring Harbor Laboratory). https://doi.org/XXXXXXX


To report bugs, ask questions, or suggest features, feel free to open an issue on GitHub. Your feedback and citations help us improve and sustain this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicprot-0.1.6.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicprot-0.1.6-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file mosaicprot-0.1.6.tar.gz.

File metadata

  • Download URL: mosaicprot-0.1.6.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mosaicprot-0.1.6.tar.gz
Algorithm Hash digest
SHA256 2716fb8de1558ced4ef7c53c20b1e6441bb561f404c5e31a00f5a64095c243ed
MD5 c84441f817639e92dd57864fe011747a
BLAKE2b-256 8d767b34f4425204016aff6d199f057b95570695f263e701e0548a2faf436224

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.6.tar.gz:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mosaicprot-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: mosaicprot-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mosaicprot-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ef1b1c4393164e8ff97de354587dfc1925cb6b9bd34b71bb5d37e79a5287a85b
MD5 6e58261fa51fcbd4f27ae867320e3c3f
BLAKE2b-256 915875c0a01dbf37354bb89f00ed4180d4a962789facd47db75039f0078e6e5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.6-py3-none-any.whl:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page