Skip to main content

dleamse's encoding and embedding methods, and dleamse's faiss index (IndexIDMap type) write.

Project description

DLEAMSE

A Deep LEArning-based Mass Spectra Embedder for spectral similarity scoring.

DLEAMSE (based on Siamese Network) is trained and tested with a larger dataset from PRIDE Cluster. The repository stores the encoder and embedder scripts of DLEAMSE to encode and embed spectra.

Training data set

A larger spectral set from PRIDE Cluster is used to construct the training and test data, which use high confidence spectra retrieved from high consistency clusters. We chose PRIDE Cluster data to train and test DLEAMSE, for two reasons: 1. The spectra in high consistency clusters are high confidence spectra. 2. The spectral set from PRIDE Cluster covers more species and instrument types. Two filters were used for retrieving high confidence spectra. The first filter controls the quality of collected clusters. We customized clustering-file-converter (https://github.com/spectra-cluster/clustering-file-converter) to retain the high-quality spectral clusters (cluster size >= 30, cluster ratio >= 0.8, and the total ions current (TIC) >= 0.2). The second filter eliminates duplicate clusters assigned with same peptide sequence, only one in the dupli-cates has been chosen, to ensure that the retained clusters are from different peptides. Then 113,362 clusters have been retrained from PRIDE Cluster release 201504. The needed spectra in clusters are acquired from the PRIDE Archive.

Model and Training

In DLEAMSE, Siamese network (Figure 1a) trains two same embedding models (Figure 1c) with shared weights, and spectra are encoded by the same encoder (Figure 1b) before the embedding. Based on the Euclidean distance between the pair of embedded spectra, the weights of embedding model is learned by contrastive loss function adapted from Hadsell et. al. that penalizes far-apart same-label spectra (label=1) and nearby different-label spectra (label=0). Back propagation from the loss function is used to update the weights in the network. The net-work is trained by stochastic gradient descent with the Adam update rule with a learning rate of 0.005. The codes are implemented in Python3 with the PyTorch framework. model

Testing

loss and test

Requirements

  • Python3.7 (or Anaconda3)
  • torch==1.0.0 (python -m pip install torch===1.0.0 torchvision===0.2.1 -f https://download.pytorch.org/whl/torch_stable.html)
  • pyteomics>=3.5.1
  • numpy>=1.13.3
  • numba>=0.45.0
  • faiss-gpu==1.5.3 (if you want to use faiss index making and searching function)
  • more_itertools==7.1.0

Installation

DLEAMSE鈥檚 encoder and embedder have been packaged and uploaded to pypi library, the package鈥檚 name is dleamse.

python -m pip install dleamse

Usage

The model file of DLEAMSE: 080802_20_1000_NM500R_model.pkl The 500 reference spectra used in our project: 500_rfs_spectra.mgf

Encode and Embed spectra, then write faiss index

# -*- coding:utf8 -*-
from dleamse.dleamse_encode_and_embed import encode_and_embed_spectra
from dleamse.dleamse_encode_and_embed import SiameseNetwork2
from dleamse.dleamse_faiss_index_writer import FaissWriteIndex

if __name__ == '__main__':
    # encode and embedded spectra
     model = "./dleamse_model_references/080802_20_1000_NM500R_model.pkl"
    prj = "test"
    input_file = "PXD003552_61576_ArchiveSpectrum.json"
    reference_spectra = "./dleamse_model_references/0722_500_rf_spectra.mgf"

    embedded_spectra_data = encode_and_embed_spectra(model, prj, input_file, reference_spectra\)

    # faiss index writer
    embedded_spectra_path = "."  # The path of file to store the data of embedded_spectra, which is end with _embedded.txt
    index_ids_save_file = "index_ids_save,txt"
    index_save_file = "test_0325.index"

    index_writer = FaissWriteIndex()
    index_writer.create_index_for_embedded_spectra(embedded_spectra_path, index_ids_save_file, index_save_file)

DLEAMSE's Scripts

dleamse_encode_and_embed.py:

Encode and embed the spectra to vectors. This script support the spectra file with .mgf, .mzML and .json. By default, two or three files would be generated from this script, the spectra embedding vectors file , spectra usi file and the record file of spectra with missing charge. By default, GPU is used; the default directory of DLEASME model and 500 reference spectra file are in dleamse_model_references directory which is under current directory.
In this example, the input spectra file is PXD003552_61576_ArchiveSpectrum.json, and the three generated files are: PXD003552_61576_ArchiveSpectrum_embedded.npy; PXD003552_61576_ArchiveSpectrum_spectrum_usi.txt; PXD003552_61576_ArchiveSpectrum_miss_record.txt (if exist the charge missing spectra)

from dleamse.dleamse_encode_and_embed import encode_and_embed_spectra
from dleamse.dleamse_encode_and_embed import SiameseNetwork2
def test_encode_and_embeder():
    # encode and embedded spectra
    model = "./dleamse_model_references/080802_20_1000_NM500R_model.pkl"
    prj = "test"
    input_file = "PXD003552_61576_ArchiveSpectrum.json"
    reference_spectra = "./dleamse_model_references/0722_500_rf_spectra.mgf"
    output_embedd_file = "PXD003552_61576_ArchiveSpectrum_embedded.npy"
    embedded_vstack_data = encode_and_embed_spectra(model, prj, input_file, reference_spectra, output_embedded_file)

dleamse_index_writer.py:

from dleamse.dleamse_faiss_index_writer import FaissWriteIndex

def test_index_write():
    # faiss index writer

    embedded_vstack_data = "PXD003552_61576_ArchiveSpectrum_embedded.npy"
    index_ids_save_file = "index_ids_save,txt"
    index_save_file = "test_0325.index"

    index_writer = FaissWriteIndex()
    index_writer.create_index(embedded_vstack_data, index_ids_save_file, index_save_file)

search_vectors_against_index.py:

  • Range Search query 32D spectra vectors against spectra library's index file, Default threshold is 0.1.:
    Range Search query 32D spectra vectors (PXD003552_61576_ArchiveSpectrum_embedded.npy) against spectra library's index file (PXD003552_61576_ArchiveSpectrum.index), and generate a result file (test.csv). Library index file (--index_file), USI file of spectral library(--index_usi_file),vectors file to be searched (-i, --input_embedded_spectra), and search result file (-o, --output) need to be specified.
    python search_vectors_against_index.py --index_file=PXD003552_61576_ArchiveSpectrum.index --index_usi_file=PXD003552_61576_ArchiveSpectrum_spectra_usi.txt -i=PXD003552_61576_ArchiveSpectrum_embedded.npy -o=test.csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dleamse-0.2.6.tar.gz (10.5 kB view hashes)

Uploaded Source

Built Distribution

dleamse-0.2.6-py3-none-any.whl (15.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page