Large-scale tandem mass spectrum clustering using fast nearest neighbor searching
Project description
falcon
For more information:
The falcon spectrum clustering tool uses advanced algorithmic techniques for highly efficient processing of millions of MS/MS spectra. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively compare all spectra to each other. Finally, density-based clustering is performed to group similar spectra into clusters.
The software is available as open-source under the BSD license.
If you use falcon in your work, please cite the following publication:
- Wout Bittremieux, Kris Laukens, William Stafford Noble, Pieter C. Dorrestein. Large-scale tandem mass spectrum clustering using fast nearest neighbor searching. publication pending (2021).
Installation
falcon requires Python 3.8+ and is available on the Linux and OSX platforms.
You can easily install falcon with pip:
pip install falcon-ms
Running falcon
falcon can be run from the command line, with settings specified as command-line arguments or set in an INI config file. falcon takes peak files (in the mzML, mzXML, or MGF format) as input and exports the clustering result as a comma-separated file with each MS/MS spectrum and its cluster label on a single line. Representative spectra for each cluster can optionally be exported to an MGF file.
Example falcon run with some relevant command-line arguments:
falcon peak/*.mzml falcon --export_representatives --precursor_tol 20 ppm --fragment_tol 0.05 --eps 0.10
This will cluster all MS/MS spectra in mzML files in the peak
directory with
the specified settings and write (i) the cluster assignments to the falcon.csv
file, and (ii) the cluster representatives to the falcon.mgf
file.
For detailed information on all available settings, run falcon -h
or
falcon --help
.
Important settings
Here we provide information on the most important settings that influence the falcon clustering performance. All settings have sensible default values which should give good results for a wide variety of datasets.
Spectrum comparison
precursor_tol
: The precursor mass tolerance and unit (in ppm or Dalton) to compare spectra to each other.fragment_tol
: The fragment mass tolerance (in Dalton) used during spectrum comparison.
Clustering
eps
: The maximum cosine distance between two spectra for them to be considered as neighbors of each other. This parameter crucially governs cluster purity (i.e. clusters contain spectra corresponding to only a single peptide). The ideal value of this parameter depends on the spectral characteristics of your data and optional spectrum preprocessing configured in falcon. Values between 0.05 and 0.15 will typically generate a pure clustering result. For more aggressive clustering values up to 0.30 can be used.
Nearest neighbor indexing (see below)
n_probe
: The maximum number of lists in the inverted index to inspect during querying. Inspecting fewer lists will run faster but will give slightly less accurate clustering results.n_neighbors
andn_neighbors_ann
: The final number of neighbors to consider for each spectrum and during nearest neighbor searching. Querying less neighbors will run faster but will give slightly less accurate clustering results.n_neighbors_ann
should be equal or greater thann_neighbors
.hash_len
: The length of the hashed vectors used for nearest neighbor searching. Larger vectors will minimize the number of hash collisions and more accurately approximate the true cosine distance, at the expense of longer nearest neighbor searching and memory requirements.
Spectrum preprocessing
- There are several options to configure spectrum preprocessing prior to the clustering. See the command-line documentation for more information.
How does it work?
- High-resolution MS/MS spectra are converted to low-dimensional vectors using feature hashing. First, spectra are converted to sparse vectors using small mass bins to tightly capture their fragment masses. Next, the sparse, high-dimensional, vectors are hashed to lower-dimensional vectors by using a hash function (the non-cryptographic MurmurHash3 algorithm) to map the mass bins separately to a small number of hash bins. This feature hashing conserves the cosine similarity between hashed vectors and can be used to approximate the similarity between the original spectra.
- Vectors are split into buckets based on the precursor m/z of the corresponding spectra to construct nearest neighbor indexes for highly efficient spectrum comparison. The spectrum vectors in each bucket are partitioned into data subspaces to create a Voronoi diagram, and all vectors are assigned to their nearest representative vector in an inverted index.
- A sparse pairwise distance matrix is computed by retrieving similarities to neighboring spectra using the nearest neighbor indexes. The accuracy and speed of similarity searching is governed by the number of neighboring cells to explore during searching: exploring more cells during searching decreases the chance of missing a nearest neighbor in the high-dimensional space, at the expense of a longer searching time.
- Density-based clustering using the pairwise distance matrix is performed to find spectrum clusters. The DBSCAN algorithm is used to find spectra that are close to each other and that form a dense data subspace, and group them into clusters.
Contact
For more information you can visit the official code website or send an email to wbittremieux@health.ucsd.edu.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file falcon-ms-0.1.3.tar.gz
.
File metadata
- Download URL: falcon-ms-0.1.3.tar.gz
- Upload date:
- Size: 584.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f51109144549be037be7d8f0ce13598fb48d5c460f2b801222a8880aa61b8373 |
|
MD5 | 2c72ace80f0043c3b8ab9e09c3e6ca0a |
|
BLAKE2b-256 | 6a35650f11394252edd370e446effb773c560c49f039297df0fc0fe30d333aab |
File details
Details for the file falcon_ms-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: falcon_ms-0.1.3-py3-none-any.whl
- Upload date:
- Size: 27.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e66d343002b5ec7e328f9d2526262e47285e9f41e4d5bd6ddff92b1a3b7005ab |
|
MD5 | b27266da4c1582229593471f39163d3f |
|
BLAKE2b-256 | 2c9b69d92bcc557ce7bf56b6925b8f708c2d0995bc593a9bb88a6a5cdd7791c2 |