Skip to main content

A package for HRMS data analysis

Project description

mass-suite logo

Build Status Coverage Status License Stars Forks

Python Issues Dependencies File Size Last Commit

This package is initiated from the University of Washington eScience capstone project.

Mass-suite: Comprehensive Mass Spectrometry(MS) data analysis tool

Mass-suite is a package used for general MS data process and advanced data analysis.

It provided a full capacity including data import, alignment, noise removal, visualization and fragment searching using online database.

The mass-suite package is also providing a data mining option for the feature clustering using dilution datasets, aiming at dilution prediction, source tracking and matrix effect assessment.

Contributers: Ximin Hu, Derek Mar, Nozomi Suzuki, Bowei Zhang

Release date: 2020/06/24

Update

  • 2020/09/01: Feature extraction pipeline validated comparing with msdial pipeline output
  • 2020/09/07: Updated formula prediction and isotopic score function, currently support formula prediction for small molecules under 500 Da

Installation & major dependencies

Activating the virtual environment

  • Included within the root of the repository is a virtual environment pre-suited to run mss
    • The virtual environment is located within environment.yml
    • To create the virtual environment from the .yml file: conda env create -f environment.yml
    • To activate the virtual environment: conda activate mss

Mass-suite requires major dependencies:

  • Pandas 1.0.3 or higher
  • Numpy 1.16.5 or higher
  • Scikit-learn 0.23.0 or higher
  • Scipy 1.4.1 or higher
  • pymzml 2.4.6 or higher
  • plotly 4.5.2 or higher

Optional dependencies for machine-learning models:

  • Keras 2.3.1
  • Tensorflow 2.0.0
  • Pytorch 1.4.0 or higher

Please make sure the major dependencies are up to date before the installation.

The package can be installed & imported as following:

pip install mass-suite
import mss

Organization of the project

The project has the following major structure:

mass-suite/
  |- mss/
     |- __init__.py
     |- alignment.py
     |- dm.py
     |- frag.py
     |- mssdata.py
     |- mssmain.py
     |- version.py
     |- visreader.py
     |- *.pkl
     |- tests/
        |- __init__.py
        |- test_*.py
     |- dev/  
        |- *.ipynb
  |- script/
     |- mss_processor.py
  |- doc/
  |- example_data/
  |- README.md
  |- LICENSE
  |- environment.yml

Featured modules

MS data import and alignment

Supported formats:

  • .mzml

MS data will be imported and filtered to create a dataframe in readable format from the mzml file.

A random forest model is integrated in the function as optional assessing tool for peak quality assessment.

import

Data from different experiments will be aligned together according to user defined error range.

Supported export formats: .csv/.xlsx

alignment

There is also a script version of the functions that integrated the functions into a automated workflow, please see script/mss_processor.py

The automated workflow is excutable via command line:

python -m mss_processor.py

example:

processor

Currently the script only support .csv export.

MS data analysis

Supported formats: formats supported by Pandas

This is a data analysis module using cluster algorithm to find grouping features, detecting noises and mining for dilution prediction/source tracking.

ms_cluster function will grouping the features into clusters using selected algorithm. Dimension reduction and cluster report are optional within the function settings.

Current clustering algorithms:

  • DBSCAN
  • OPTICS
  • Non-parametric hypothesis test (please refer to trend_calc function)

example output:

cluster

cluster report output:

report

After clustering, modeling function is provided to predict the dilution rate from samples and catch the behavior of individual clusters

feature_model(d_transpose, cluster_algorithm=True, model_method='ensemble_bagging_svc', report=True)

User can define either to use all features for modeling or individual clusters for modeling

Currently supporting models: linear regression, random forest regression, ensemble bagging regression

Model assessment report is also optional upon function settings:

modeling

User can use the model and cluster information for new sample prediction:

predict

Simple source identification tool is also incorporated in this module:

source

Please refer to function documentation for more details.

MS data visualization & interactive plots(extended work)

Example for TIC plot

Molecule identification using fragment information (extended work)

All the online searching is using MoNA MassBank database.

frag

Project Data

All the data used in the project is giving credit to Center of Urban Water.

Reference:

Application of Nontarget High Resolution Mass Spectrometry Data to Quantitative Source Apportionment
Katherine T. Peter, Christopher Wu, Zhenyu Tian, and Edward P. Kolodziej
Environmental Science & Technology 2019 53 (21), 12257-12268
DOI: 10.1021/acs.est.9b04481

Model training and testing

All the data used for model training is under 'example_data' folder and can be repeatly trained by users according to the settings.

Testing and continuous integration

The majority of code is tested under unittest and pushed through travis CI.

Licensing

The package is open source and can be utilized under MIT license. Please find the detail in licence file.

Call for contributions

Mass-suite welcomes help from wide range of different backgrounds. As the package will keep developing, suggestions and improvements are always appreciated. If you are interested in testing the package using your own MS data, giving suggestions for adding modules or improving existing ones, have some question or seeking for cooperation, please contact the main contributor(xhu66@uw.edu).

Special Thanks

  • A big shout to Professor Edward Kolodziej, Professor Dave Beck, Katherine Peter, and the Center of Urban Water for all of their help during the development of this package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mass-suite-1.0.2.tar.gz (29.8 MB view details)

Uploaded Source

Built Distribution

mass_suite-1.0.2-py3-none-any.whl (30.2 MB view details)

Uploaded Python 3

File details

Details for the file mass-suite-1.0.2.tar.gz.

File metadata

  • Download URL: mass-suite-1.0.2.tar.gz
  • Upload date:
  • Size: 29.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for mass-suite-1.0.2.tar.gz
Algorithm Hash digest
SHA256 be913c2f4f66afd03667bf116a1d34ee78e382b4b45169cb3b8b72246d025a9e
MD5 b413b72f3c7d98fd9ec44760de570ff5
BLAKE2b-256 7331f061185310c6e64849acaa58829f804197ec4f248a47d7da31b89506c366

See more details on using hashes here.

File details

Details for the file mass_suite-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mass_suite-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 30.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for mass_suite-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 59547bda70ea69702154f68d508ccb67beac421fd0643eae297d7c0b53449dbb
MD5 070cafa3cf64abe0e47a3bf1aaa5ad03
BLAKE2b-256 755855762748435ecdd57de985ccf3654aa4dcd5dd3418d2dfb12653f3672dc5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page