Skip to main content

MosaicProt CLI: Detect ORFs, Find Alternative ORFs, Generate Mosaic Proteins

Project description

What is MosaicProt?

The purpose of MosaicProt is to enable the de novo detection of chimeric peptide sequences produced by programmed ribosomal frameshifting (PRF). The tool can be used for the identification of individual PRF events and candidates for mosaic translation, i.e., sequences produced by multiple PRF events from a single transcript (hence the name MosaicProt). The tool can model potential chimeric sequences by analyzing the positional relationships between overlapping or closely spaced ORFs. It supports cases where altORFs:

  • Are embedded within refORFs.
  • Partially overlap with refORFs, spanning the border between a refORF and a UTR.
  • Are located entirely in a UTR, with simulated “long” frameshift events bridging the non-overlapping regions up to 10 nucleotides apart.

The simulation engine systematically explores multiple reading frames and frameshift values (±1 and ±2 nucleotides) to generate all plausible chimeric peptide combinations between overlapping or adjacent ORFs. For each ORF pair, it performs up to 30 one-nucleotide stepwise iterations to model potential frameshift junctions, simulating chimeric sequences of 40 amino acids in length. Sequences containing premature stop codons are automatically excluded, ensuring that only continuous, MS-compatible peptide models are retained. MosaicProt is a command-line tool designed to:

  • Detect chimeric peptides and proteins from all possible ORFs of a defined length range in a transcriptome. A user can provide either a spliced transcriptome or precursor sequences, depending on the scope of a study.
  • Separate canonical (refProt) and non-canonical (altProt) sequences by comparing to a reference proteome. The reference proteome can be defined by a user or imported from a database.
  • Simulate chimeric protein sequences by combining segments of refProts and altProts. It can also model chimeric sequences from overlapping altProts.

Available Commands

The tool has three commands, which correspond to the modules described in Figure 1 of the published manuscript (the ORF detector module, the altProt/refProt separator module, and the chimeric modeler module):

1. detect_ORFs

This command detects all potential ORFs from a transcriptome FASTA file and translates them into corresponding amino acid sequences.

Usage:

mosaicprot detect_ORFs --transcriptome_file <transcriptome.fasta> [--threshold 30] [--output_file_type fasta]

Parameters:

  • --transcriptome_file (required): Input transcriptome FASTA file.
  • --threshold: Minimum ORF length in amino acids (default: 30).
  • --output_file_type: Output format (fasta or xml, default: fasta).

The input file <transcriptome.fasta> contains an annotated or a de novo sequenced transcriptome of any organism. The output file name is built by the addition of a prefix “_min_30aa_ORFs.fasta” to the input file name, if the threshold is 30. For example, input file name: “my_transcriptome.fasta”; output file name: “my_transcriptome_min_30aa_ORFs.fasta”. The prefix “min_30aa_ORFs” stands for regions (ORF) that are free from in-frame stop codons and can be in silico translated to altProts longer than 29 aa. If the output file format is chosen to be xml, it cannot be used with the next module, which accepts only FASTA files. This xml option was provided to enable export to other pipelines that require xml files as inputs.

2. separate_ORFs

This command separates ORF products into refProts and altProts based on a known or a user-defined reference proteome.

Usage:

mosaicprot separate_ORFs --ORFeome_file <orfs.fasta> --reference_proteome_file <ref.fasta> [--output_alt_file altProts.fasta] [--output_ref_file refProts.fasta]

Parameters:

  • --ORFeome_file (required): Input FASTA file with predicted ORF translations.
  • --reference_proteome_file (required): Input FASTA file with reference proteins.
  • --output_alt_file: Output file for alternative proteins (default: altProts.fasta).
  • --output_ref_file: Output file for reference proteins (default: refProts.fasta).

The input file <orfs.fasta> is the output of module 1 (detect_ORFs). The input file <ref.fasta> contains a canonical proteome downloaded from a database or a user-defined canonical proteome. The default output file names are altProts.fasta and refProts.fasta.

3. simulate_chimeric_proteins

This command simulates chimeric proteins by combining segments of refProts and altProts.

Usage:

mosaicprot simulate_chimeric_proteins
--transcriptome_file <transcriptome.fasta>
--refProts_file <refProts.fasta>
--altProts_file <altProts.fasta>
--candidate_altProt_list <altProt_candidates.txt>
[--processor_num 1]
[--repetition keep_first]

Parameters:

  • --transcriptome_file (required): Transcriptome FASTA file.
  • --refProts_file (required): FASTA file of reference proteins.
  • --altProts_file (required): FASTA file of all alternative proteins.
  • --candidate_altProt_list (required): List of identifiers of selected alternative proteins to include in simulation.
  • --processor_num: Number of processors to use for parallel execution (default: 1).
  • --repetition: Strategy to handle duplicates (default: keep_first).
    • keep_first: Keep only the first of duplicate chimeric proteins.
    • keep_all: Keep all duplicates.
    • drop_all: Remove all duplicates.

Output: A file named simulated_chimeric_proteins.fasta containing the simulated chimeric proteins.

The input file <transcriptome.fasta> is the same file that was used as input for the first module (detect_ORFs). The input files <refProts.fasta> and <altProts.fasta> are the output files of the second module (separate_ORFs). The input file <altProt_candidates.txt> contains a user-defined list of altProt identifiers separated by a new line. It may correspond to conserved altProts, MS-supported altProts, or both. Thus, it may contain a subset of altProt identifiers from the file <altProts.fasta> or identifiers of the entire set of altProts, depending on the scope of a study.

The number of processors to use for parallel execution (default: 1) can be user-defined based on the availability of processors in the system.

The flag [--repetition] helps deal with duplicated models. Such models arise from altORF-altORF pairs (as opposed to refORF-altORF pairs) where one of the ORFs is considered as a refORF and the other one as an altORF, and then the other way around. The “drop_all” option of handling such duplicated models can be useful when the goal is to exclude chimeric models generated from altORF-altORF pairs.


MosaicProt Installation Guide

Requirements

Before installing MosaicProt, ensure you have:

  1. Python3.6 or higher

python --version

  1. pip package manager

pip --version

  1. System Requirements (recommended):
  • 4GB+ RAM for large datasets
  • Multi-core CPU for parallel processing
  • 500MB disk space

Installation

  1. Install from PyPI

pip install mosaicprot

  1. For development/editable installation

git clone https://github.com/aliyurtsevenn/mosaicprot.git

cd mosaicprot

pip install -e .

MosaicProt was developed to advance research on mosaic translation and programmed ribosomal frameshifting. It has enabled the discovery of chimeric proteins across various transcript types (mRNA, ncRNA, rRNA, tRNA) and is adaptable to any annotated or de novo sequenced transcriptome. For biological context and related studies, see our publication: Çakır et al. (2024, preprint).


Citation

If you use MosaicProt in your research, please cite the following article:

Umut Çakır, Noujoud Gabed, Ali Yurtseven, Igor Kryvoruchko (2025).
A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting.
bioRxiv (Cold Spring Harbor Laboratory). https://doi.org/10.1101/2025.05.29.656767


To report bugs, ask questions, or suggest features, feel free to open an issue on GitHub. Your feedback and citations help us improve and sustain this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicprot-0.1.11.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicprot-0.1.11-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file mosaicprot-0.1.11.tar.gz.

File metadata

  • Download URL: mosaicprot-0.1.11.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mosaicprot-0.1.11.tar.gz
Algorithm Hash digest
SHA256 22fe0dfcc1f92f38ac67b2b9dd39a6319763bf7956edab46c0ac49ba5ee4adb7
MD5 72d2b455fd6498cf94c156eaac5e7f52
BLAKE2b-256 3342428768c428a51934b0f4929902dd03dfc24703d91db610150df7a3319166

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.11.tar.gz:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mosaicprot-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: mosaicprot-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mosaicprot-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 28cb81846ecde4741fae0a4a3ce647e35aedc0de8d0ad90b5bf52a052de31a53
MD5 60e6f5b4b4346ec744c3c180e18e116e
BLAKE2b-256 d3de62482c32dd3572483763bd1d86f2fee8e3fcb4af8b60156a5582b43cb661

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaicprot-0.1.11-py3-none-any.whl:

Publisher: publish.yml on aliyurtsevenn/mosaicprot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page