End-to-end workflow for de novo protein sequencing based on InstaNovo
Project description
A de novo protein sequencing workflow
Table of Contents
- Introduction
- Features
- Workflow Diagram
- Repository Structure
- Installation
- Command-Line Usage
- Hyperparameter Optimization
- License
- Acknowledgments
- References
- Citation
Introduction
InstaNexus is a generalizable, end-to-end workflow for direct protein sequencing, tailored to reconstruct full-length protein therapeutics such as antibodies and nanobodies. It integrates AI-driven de novo peptide sequencing with optimized assembly and scoring strategies to maximize accuracy, coverage, and functional relevance.
This pipeline enables robust reconstruction of critical protein regions, advancing applications in therapeutic discovery, immune profiling, and protein engineering.
Features
- 🧬 Supports De Bruijn Graph and Greedy-based assembly
- ⚗️ Handles multiple protease digestions (Trypsin, LysC, GluC, etc.)
- 🧹 Integrated contaminant removal and confidence filtering
- 🧩 Clustering, alignment, and consensus sequence reconstruction
- 🔗 Integrates with external tools:
- MMseqs2 for fast clustering
- Clustal Omega for high-quality alignment
- 📊 Output-ready for downstream analysis and visualization
Workflow Diagram
Repository Structure
| Folder / File | Description |
|---|---|
docs/ |
Sphinx documentation, tutorials, and images |
fasta/ |
FASTA reference and contaminant sequences |
inputs/ |
Example input CSV files |
json/ |
Metadata and parameter configuration files |
outputs/ |
Generated results (created during execution) |
src/instanexus/ |
Core InstaNexus package |
src/instanexus/main.py |
Runs the full pipeline |
src/instanexus/preprocessing.py |
Module for data cleaning |
src/instanexus/assembly.py |
Module for sequence assembly |
src/instanexus/clustering.py |
Module for clustering (mmseqs2) |
src/instanexus/alignment.py |
Module for alignment (clustalo) |
src/instanexus/consensus.py |
Module for consensus generation |
src/instanexus/opt/ |
Grid search and optimization workflows |
tests/ |
Pytest unit and integration tests |
environment.linux.yml |
Conda environment for Linux |
environment.osx-arm64.yaml |
Conda environment for macOS |
pyproject.toml |
Package metadata, dependencies, and entry point |
Installation
InstaNexus requires Python 3.11+, Conda, MMseqs2, and Clustal Omega.
We strongly recommend installing these dependencies in a dedicated conda environment.
[!IMPORTANT] MMseqs2 and Clustal Omega are available through Conda, but compatibility depends on your system architecture.
Getting Started
Follow these steps to clone the repository and set up the environment using Conda:
Option 1: Install from PyPI
- Create and activate your conda environment.
- Install the package directly from PyPI:
pip install instanexus
Option 2: Install from Source (for Developers)
If you want to modify or contribute to the code, you can install it from the source repository:
Clone the repository:
git clone git@github.com:Multiomics-Analytics-Group/InstaNexus.git
cd instanexus
Create and activate the Conda environment:
# For Linux
conda env create -f environment.linux.yml
# For macOS (Apple Silicon)
conda env create -f environment.osx-arm64.yaml
conda activate instanexus
Install the package in editable mode:
pip install -e .
Verify the installation
instanexus --help
Command-line usage
After installation (and adding the [project.scripts] entry point), you can run the entire InstaNexus pipeline using the instanexus command.
All parameters for preprocessing, assembly, clustering, and consensus are provided in a single call. The pipeline will automatically create a unique, timestamped output folder for that specific combination of parameters.
instanexus --help
Example: Run the full pipeline This command runs the complete workflow:
Preprocesses the input CSV.
Assembles using dbg (De Bruijn graph).
Clusters the resulting scaffolds.
Aligns the clusters.
Generates consensus sequences.
instanexus \
--input-csv inputs/bsa.csv \
--folder-outputs outputs \
--metadata-json-path json/sample_metadata.json \
--contaminants-fasta-path fasta/contaminants.fasta \
--assembly-mode dbg \
--conf 0.9 \
--kmer-size 7 \
--size-threshold 12 \
--min-overlap 3 \
--min-seq-id 0.85 \
--coverage 0.8
The results for this specific run will be saved in a unique directory, such as:outputs/bsa/dbg_c0.9_ks7_mo3_ts12/
License
This project is licensed under the MIT License.
Acknowledgments
InstaNexus was developed at DTU Biosustain and DTU Bioengineering.
We are grateful to the DTU Bioengineering Proteomics Core Facility for maintenance and operation of mass spectrometry instrumentation.
We also thank the Informatics Platform at DTU Biosustain for their support during the development and optimization of InstaNexus.
Special thanks to the users and developers of:
References
- Hauser, M., et al. MMseqs2: ultra fast and sensitive sequence searching. Nature Biotechnology 35, 1026–1028 (2016). https://doi.org/10.1038/nbt.3988
- Sievers, F., et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7, 539 (2011). https://doi.org/10.1038/msb.2011.75
- Eloff, K., Kalogeropoulos, K., Mabona, A., Morell, O., Catzel, R., Rivera-de-Torre, E., ... & Jenkins, T. P. (2025). InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments. Nature Machine Intelligence, 1-15.
Citation
If you find this project useful in your research or work, please cite it as:
Reverenna M., Nielsen M. W., Wolff D. S., Lytra E., Colaianni P. D., Ljungars A., Laustsen A. H., Schoof E. M., Van Goey J., Jenkins T. P., Lukassen M. V., Santos A., Kalogeropoulos K. (2025). Generalizable direct protein sequencing with InstaNexus [Preprint]. bioRxiv. https://doi.org/10.1101/2025.07.25.666861
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file instanexus-0.2.0.tar.gz.
File metadata
- Download URL: instanexus-0.2.0.tar.gz
- Upload date:
- Size: 10.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb060f6aec8eb4496b48dc75e0e8e0039de69f5bed821559ec8235d45b1165a0
|
|
| MD5 |
f49535cf79826d031326b561903d39be
|
|
| BLAKE2b-256 |
6f34b2adc74d0287b41ae417665b7170567e161dd12fb69bcd0be3b5c36e9c2e
|
File details
Details for the file instanexus-0.2.0-py3-none-any.whl.
File metadata
- Download URL: instanexus-0.2.0-py3-none-any.whl
- Upload date:
- Size: 38.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4937872211e278623f6120809552978c69e9a8715d5f3a262edeeca644eee82
|
|
| MD5 |
aae5e96a56282ec5680ae1da9d76d297
|
|
| BLAKE2b-256 |
fd2e84b480fd957f54c3632ee5f13f44986d125fdb43e194227ad9572da8beaf
|