A datastructure for efficient storage of mass spectrometry data
Project description
ms_map
A hash map used for the sole purpose of storing mass spectrometry data
Background
What is mass spectrometry data
If you hadn't heard of mass spectrometry, I doubt you will get much out of this datastructure, but feel free to stick around and learn!
Mass spectrometry data is data that is generated from a mass spectrometer (duh). More specifically, this data structure is meant for identify peptides (short sequences of amino acids) from the list of numbers (floats) from a mass spectrometer run. Amino acids are described as a sequence masses.
The primary use of this data structure, to be even MORE specific, is by using singly and doubly charged masses to identify a peptide. The output, at its most basic form, is a list of floating point numbers. It might look something like the following
[99.023, 140.743, 209.887, 288.115, 402.778]
This sequence of floating point numbers describe a particular sequence of amino acids. The way this works is as follows:
Say our sequence of amino acids is: MALW
(don't shoot me I know the masses don't make sense). The above sequence of amino acids describe any arrangement of ion types and charges. For the sake of this example, we will limit our scope to what're called b
and y
ions with possible charges of 1
and 2
. The b
ions describe amino acids in a left-to-right fashion, and y
right-to-left. So for our example sequence, we can break it down to something like:
b1 b2 b3
M | A | L | W
y3 y2 y1
So M is described by both a b
ion and a y
ion. The difference is that b
is the mass of M
and y
is really the mass of ALW
. But you can see how these ions complement eachother.
To dive a bit deeper, charges, as the name suggests, are the actual charges of the molecules and amino acid chains from the mass spectrometer. The most common of which are 1 and 2. So we could have up to (in this case) 4 different combinations of ions (b+, b++, y+, y++) that describe each junction.
Hash map with ms data
If you don't know anything about the hash map data structure, look through this wikipedia page.
The hash map works as follows:
Insertions
A pair of a mass (a floating point number) and a sequence (string of amino acids) are added.
insert(mass, spectrum)
The integer value of the mass is used as the key. Then the pair of the mass, sequence, is inserted into the list at the entry of the integer mass.
input: 123.456, ABC
key: 123
inserting value: 123.456, ABC
Any value with the SAME mass is attached to the same mass value to avoid redunancy
table: {123: [(123.456, [ABC])]}
insert: 123.456, XYZ
table: {123: [(123.456, [ABC, XYZ])]}
Values that do NOT have the same mass are not inserted in the same spot
table: {123: [(123.456, [ABC, XYZ])]}
insert: 567.890, LMNOP
table: {123: [(123.456, [ABC, XYZ])], 567: [(567.890, [LMNOP])]}
Lookups
A lookup on the table is a lookup on masses, not sequences. Much like an insertion, the lookup starts with taking the integer value of a mass. This is used as the search key.
search(mass, tolerance)
The tolerance parameter is used to determine how much leeway is allowed when searching for a mass. Once the key is extracted, we can look through a list for values that fall within the upper and lower bounds of the search mass.
search parameters: 123.455, 20
tolerance: 20 ppm of 123.455 -> 0.0025
upper bound: 123.455 + .0025 = 123.4575
lower bound: 123.455 - .0025 = 123.4525
key: 123
table entries at 123: [(123.456, [ABC, XYZ]), (123.999, [YYY])]
table entries at 123 with values between 123.4525 and 123.4575: [(123.456, [ABC, XYZ])]
Then the seqeunces of the values that fall within the mass range are returned
output: [ABC, XYZ]
Installation
To install the python bindings as a package, run
pip install ms_map
To install the source code, run
git clone https://github.com/zmcgrath96/ms_map.git
To build the C++ code, run tests and build the python bindings, run
cd ms_map
$ms_map> chmod u+x build_py.sh
$ms_map> ./build_py.sh
Example
from ms_map import PyMSMap
if __name__ == '__main__':
map = PyMSMap()
mass = 316.308
sequence = "ZAM"
print("Adding mass with sequence to the map")
map.insert(mass, sequence)
mass = 218.095
sequence = "PSM"
print("Adding a different mass with a different sequence to the map")
map.insert(mass, sequence)
print("Searching for mass that we put in there")
results = map.search(316.302, 20)
print("Results:")
for result in results:
print(result)
API
- insert(mass: float, sequence: str) -> None: insert a mass with a sequence into the table
- search(mass: float, ppm_tol: int) -> list: search for the mass with an upper and lower bound value calculated from the ppm tolerance parameter. Returns all sequences with a mass in this range.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ms_map-1.1.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 840ca325f2266efd3f50c24194148f4ff1d3666309adf0e40ef9d3decbcd00f8 |
|
MD5 | 67602ef2204974250b4101dea3606c68 |
|
BLAKE2b-256 | 56f6a71b41bc77f8d94aa8340734614489b97182f6671eaee9537af049d3d1c6 |