Skip to main content

End-to-end workflow for de novo protein sequencing based on InstaNovo

Project description

InstaNexus logo

A de novo protein sequencing workflow

Conda License Python


Table of Contents


Introduction

InstaNexus is a generalizable, end-to-end workflow for direct protein sequencing, tailored to reconstruct full-length protein therapeutics such as antibodies and nanobodies. It integrates AI-driven de novo peptide sequencing with optimized assembly and scoring strategies to maximize accuracy, coverage, and functional relevance.

This pipeline enables robust reconstruction of critical protein regions, advancing applications in therapeutic discovery, immune profiling, and protein engineering.


Features

  • 🧬 Supports De Bruijn Graph and Greedy-based assembly
  • ⚗️ Handles multiple protease digestions (Trypsin, LysC, GluC, etc.)
  • 🧹 Integrated contaminant removal and confidence filtering
  • 🧩 Clustering, alignment, and consensus sequence reconstruction
  • 🔗 Integrates with external tools:
  • 📊 Output-ready for downstream analysis and visualization

Workflow Diagram

InstaNexus Workflow


Repository Structure

Folder / File Description
docs/ Sphinx documentation, tutorials, and images
fasta/ FASTA reference and contaminant sequences
inputs/ Example input CSV files
json/ Metadata and parameter configuration files
outputs/ Generated results (created during execution)
src/instanexus/ Core InstaNexus package
src/instanexus/main.py Runs the full pipeline
src/instanexus/preprocessing.py Module for data cleaning
src/instanexus/assembly.py Module for sequence assembly
src/instanexus/clustering.py Module for clustering (mmseqs2)
src/instanexus/alignment.py Module for alignment (clustalo)
src/instanexus/consensus.py Module for consensus generation
src/instanexus/opt/ Grid search and optimization workflows
tests/ Pytest unit and integration tests
environment.linux.yml Conda environment for Linux
environment.osx-arm64.yaml Conda environment for macOS
pyproject.toml Package metadata, dependencies, and entry point

Installation

InstaNexus requires Python 3.11+, Conda, MMseqs2, and Clustal Omega.

We strongly recommend installing these dependencies in a dedicated conda environment.

[!IMPORTANT] MMseqs2 and Clustal Omega are available through Conda, but compatibility depends on your system architecture.


Getting Started

Follow these steps to clone the repository and set up the environment using Conda:

Option 1: Install from PyPI

  1. Create and activate your conda environment.
  2. Install the package directly from PyPI:
pip install instanexus

Option 2: Install from Source (for Developers)

If you want to modify or contribute to the code, you can install it from the source repository:

Clone the repository:

git clone git@github.com:Multiomics-Analytics-Group/InstaNexus.git
cd instanexus

Create and activate the Conda environment:

# For Linux
conda env create -f environment.linux.yml
# For macOS (Apple Silicon)
conda env create -f environment.osx-arm64.yaml

conda activate instanexus

Install the package in editable mode:

pip install -e .

Verify the installation

instanexus --help

Command-line usage

After installation (and adding the [project.scripts] entry point), you can run the entire InstaNexus pipeline using the instanexus command.

All parameters for preprocessing, assembly, clustering, and consensus are provided in a single call. The pipeline will automatically create a unique, timestamped output folder for that specific combination of parameters.

instanexus --help

Example: Run the full pipeline This command runs the complete workflow:

Preprocesses the input CSV.

Assembles using dbg (De Bruijn graph).

Clusters the resulting scaffolds.

Aligns the clusters.

Generates consensus sequences.

instanexus \
    --input-csv inputs/bsa.csv \
    --folder-outputs outputs \
    --metadata-json-path json/sample_metadata.json \
    --contaminants-fasta-path fasta/contaminants.fasta \
    --assembly-mode dbg \
    --conf 0.9 \
    --kmer-size 7 \
    --size-threshold 12 \
    --min-overlap 3 \
    --min-seq-id 0.85 \
    --coverage 0.8

The results for this specific run will be saved in a unique directory, such as:outputs/bsa/dbg_c0.9_ks7_mo3_ts12/


License

This project is licensed under the MIT License.


Acknowledgments

InstaNexus was developed at DTU Biosustain and DTU Bioengineering.

We are grateful to the DTU Bioengineering Proteomics Core Facility for maintenance and operation of mass spectrometry instrumentation.

We also thank the Informatics Platform at DTU Biosustain for their support during the development and optimization of InstaNexus.

Special thanks to the users and developers of:


References

  1. Hauser, M., et al. MMseqs2: ultra fast and sensitive sequence searching. Nature Biotechnology 35, 1026–1028 (2016). https://doi.org/10.1038/nbt.3988
  2. Sievers, F., et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7, 539 (2011). https://doi.org/10.1038/msb.2011.75
  3. Eloff, K., Kalogeropoulos, K., Mabona, A., Morell, O., Catzel, R., Rivera-de-Torre, E., ... & Jenkins, T. P. (2025). InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments. Nature Machine Intelligence, 1-15.

Citation

If you find this project useful in your research or work, please cite it as:

Reverenna M., Nielsen M. W., Wolff D. S., Lytra E., Colaianni P. D., Ljungars A., Laustsen A. H., Schoof E. M., Van Goey J., Jenkins T. P., Lukassen M. V., Santos A., Kalogeropoulos K. (2025). Generalizable direct protein sequencing with InstaNexus [Preprint]. bioRxiv. https://doi.org/10.1101/2025.07.25.666861

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instanexus-0.2.0.tar.gz (10.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

instanexus-0.2.0-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file instanexus-0.2.0.tar.gz.

File metadata

  • Download URL: instanexus-0.2.0.tar.gz
  • Upload date:
  • Size: 10.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for instanexus-0.2.0.tar.gz
Algorithm Hash digest
SHA256 eb060f6aec8eb4496b48dc75e0e8e0039de69f5bed821559ec8235d45b1165a0
MD5 f49535cf79826d031326b561903d39be
BLAKE2b-256 6f34b2adc74d0287b41ae417665b7170567e161dd12fb69bcd0be3b5c36e9c2e

See more details on using hashes here.

File details

Details for the file instanexus-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: instanexus-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 38.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for instanexus-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4937872211e278623f6120809552978c69e9a8715d5f3a262edeeca644eee82
MD5 aae5e96a56282ec5680ae1da9d76d297
BLAKE2b-256 fd2e84b480fd957f54c3632ee5f13f44986d125fdb43e194227ad9572da8beaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page