Skip to main content

MosaicProt CLI: Detect ORFs, Find Alternative ORFs, Generate Mosaic Proteins

Project description

What is MosaicProt?

The purpose of MosaicProt is to enable the de novo detection of chimeric peptide sequences produced by programmed ribosomal frameshifting (PRF). The tool can be used for the identification of individual PRF events and candidates for mosaic translation, i.e., sequences produced by multiple PRF events from a single transcript (hence the name MosaicProt). The tool can model potential chimeric sequences by analyzing the positional relationships between overlapping or closely spaced ORFs. It supports cases where altORFs:

  • Are embedded within refORFs.
  • Partially overlap with refORFs, spanning the border between a refORF and a UTR.
  • Are located entirely in a UTR, with simulated “long” frameshift events bridging the non-overlapping regions up to 10 nucleotides apart.

The simulation engine systematically explores multiple reading frames and frameshift values (±1 and ±2 nucleotides) to generate all plausible chimeric peptide combinations between overlapping or adjacent ORFs. For each ORF pair, it performs up to 30 one-nucleotide stepwise iterations to model potential frameshift junctions, simulating chimeric sequences of 40 amino acids in length. Sequences containing premature stop codons are automatically excluded, ensuring that only continuous, MS-compatible peptide models are retained. MosaicProt is a command-line tool designed to:

  • Detect chimeric peptides and proteins from all possible ORFs of a defined length range in a transcriptome. A user can provide either a spliced transcriptome or precursor sequences, depending on the scope of a study.
  • Separate canonical (refProt) and non-canonical (altProt) sequences by comparing to a reference proteome. The reference proteome can be defined by a user or imported from a database.
  • Simulate chimeric protein sequences by combining segments of refProts and altProts. It can also model chimeric sequences from overlapping altProts.

Available Commands

The tool has three commands, which correspond to the modules described in Figure 1 of the published manuscript (the ORF detector module, the altProt/refProt separator module, and the chimeric modeler module):

1. detect_ORFs

This command detects all potential ORFs from a transcriptome FASTA file and translates them into corresponding amino acid sequences.

Usage:

mosaicprot detect_ORFs --transcriptome_file <transcriptome.fasta> [--threshold 30] [--output_file_type fasta]

Parameters:

  • --transcriptome_file (required): Input transcriptome FASTA file.
  • --threshold: Minimum ORF length in amino acids (default: 30).
  • --output_file_type: Output format (fasta or xml, default: fasta).

The input file <transcriptome.fasta> contains an annotated or a de novo sequenced transcriptome of any organism. The output file name is built by the addition of a prefix “_min_30aa_ORFs.fasta” to the input file name, if the threshold is 30. For example, input file name: “my_transcriptome.fasta”; output file name: “my_transcriptome_min_30aa_ORFs.fasta”. The prefix “min_30aa_ORFs” stands for regions (ORF) that are free from in-frame stop codons and can be in silico translated to altProts longer than 29 aa. If the output file format is chosen to be xml, it cannot be used with the next module, which accepts only FASTA files. This xml option was provided to enable export to other pipelines that require xml files as inputs.

2. separate_ORFs

This command separates ORF products into refProts and altProts based on a known or a user-defined reference proteome.

Usage:

mosaicprot separate_ORFs --ORFeome_file <orfs.fasta> --reference_proteome_file <ref.fasta> [--output_alt_file altProts.fasta] [--output_ref_file refProts.fasta]

Parameters:

  • --ORFeome_file (required): Input FASTA file with predicted ORF translations.
  • --reference_proteome_file (required): Input FASTA file with reference proteins.
  • --output_alt_file: Output file for alternative proteins (default: altProts.fasta).
  • --output_ref_file: Output file for reference proteins (default: refProts.fasta).

The input file <orfs.fasta> is the output of module 1 (detect_ORFs). The input file <ref.fasta> contains a canonical proteome downloaded from a database or a user-defined canonical proteome. The default output file names are altProts.fasta and refProts.fasta.

3. simulate_chimeric_proteins

This command simulates chimeric proteins by combining segments of refProts and altProts.

Usage:

mosaicprot simulate_chimeric_proteins
--transcriptome_file <transcriptome.fasta>
--refProts_file <refProts.fasta>
--altProts_file <altProts.fasta>
--candidate_altProt_list <altProt_candidates.txt>
[--processor_num 1]
[--repetition keep_first]

Parameters:

  • --transcriptome_file (required): Transcriptome FASTA file.
  • --refProts_file (required): FASTA file of reference proteins.
  • --altProts_file (required): FASTA file of all alternative proteins.
  • --candidate_altProt_list (required): List of identifiers of selected alternative proteins to include in simulation.
  • --processor_num: Number of processors to use for parallel execution (default: 1).
  • --repetition: Strategy to handle duplicates (default: keep_first).
    • keep_first: Keep only the first of duplicate chimeric proteins.
    • keep_all: Keep all duplicates.
    • drop_all: Remove all duplicates.

Output: A file named simulated_chimeric_proteins.fasta containing the simulated chimeric proteins.

The input file <transcriptome.fasta> is the same file that was used as input for the first module (detect_ORFs). The input files <refProts.fasta> and <altProts.fasta> are the output files of the second module (separate_ORFs). The input file <altProt_candidates.txt> contains a user-defined list of altProt identifiers separated by a new line. It may correspond to conserved altProts, MS-supported altProts, or both. Thus, it may contain a subset of altProt identifiers from the file <altProts.fasta> or identifiers of the entire set of altProts, depending on the scope of a study.

The number of processors to use for parallel execution (default: 1) can be user-defined based on the availability of processors in the system.

The flag [--repetition] helps deal with duplicated models. Such models arise from altORF-altORF pairs (as opposed to refORF-altORF pairs) where one of the ORFs is considered as a refORF and the other one as an altORF, and then the other way around. The “drop_all” option of handling such duplicated models can be useful when the goal is to exclude chimeric models generated from altORF-altORF pairs.


MosaicProt Installation Guide

Requirements

Before installing MosaicProt, ensure you have:

  1. Python3.6 or higher

python --version

  1. pip package manager

pip --version

  1. System Requirements (recommended):
  • 4GB+ RAM for large datasets
  • Multi-core CPU for parallel processing
  • 500MB disk space

Installation

  1. Install from PyPI

pip install mosaicprot

  1. For development/editable installation

git clone https://github.com/aliyurtsevenn/mosaicprot.git

cd mosaicprot

pip install -e .

MosaicProt was developed to advance research on mosaic translation and programmed ribosomal frameshifting. It has enabled the discovery of chimeric proteins across various transcript types (mRNA, ncRNA, rRNA, tRNA) and is adaptable to any annotated or de novo sequenced transcriptome. For biological context and related studies, see our publication: Çakır et al. (2024, preprint).


Citation

If you use MosaicProt in your research, please cite the following article:

Umut Çakır, Noujoud Gabed, Ali Yurtseven, Igor Kryvoruchko (2025).
A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting.
bioRxiv (Cold Spring Harbor Laboratory). https://doi.org/10.1101/2025.05.29.656767


To report bugs, ask questions, or suggest features, feel free to open an issue on GitHub. Your feedback and citations help us improve and sustain this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicprot-0.1.10.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicprot-0.1.10-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file mosaicprot-0.1.10.tar.gz.

File metadata

  • Download URL: mosaicprot-0.1.10.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mosaicprot-0.1.10.tar.gz
Algorithm Hash digest
SHA256 f692379f81d991c384ce21797020586760cccb59886d45cb20101a24a64dacfc
MD5 4c8cb64448df889e7d09623830bed694
BLAKE2b-256 da8b708dde9d43a94c1256f21b1daffd08f0eacc9ff9dc741ace36aaf23c3981

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.10.tar.gz:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mosaicprot-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: mosaicprot-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mosaicprot-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 95089d9cec470404b051a39998d36498a45c5142d7d16089ab9f42f6824d472d
MD5 ab7890c8c1bd1d51d7901080dd181997
BLAKE2b-256 0893d06b850544efa4f2a094fc0a80f636952603ba41162e5c202d2410d556e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.10-py3-none-any.whl:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page