A powerful GC/LC-HRMS data analysis tool
Project description
PyHRMS: Tools For working with High Resolution Mass Spectrometry (HRMS) data in Environmental Science
PyHRMS is a Python package designed to process high-resolution mass spectrometry data that is coupled with gas chromatography (GC) or liquid chromatography (LC). Its primary objective is to provide scientists with a user-friendly tool that can be used to read, process, and visualize LC/GC-HRMS data.
By utilizing PyHRMS, users can more easily analyze complex data sets, resulting in a more efficient and streamlined research process. Whether working with GC or LC coupled HRMS data, PyHRMS is a reliable and effective solution that can help researchers to achieve their scientific goals.
Contributer: Rui Wang
First release date: Nov.15.2021
Update
Aug.25.2024: pyhrms update Log 0.9.3:
* Removed all files before processing. * Added a new parameter: frag_rt_error = 0.02. * Introduced a text file to record parameters.
pyhrms can be installed and import as following:
pip install pyhrms
If you just want to update a new version, please update as following:
pip install pyhrms -U
pyhrms requires major dependencies:
numpy>=1.19.2
pandas>1.3.3
matplotlib>=3.3.2
pymzml>=2.4.7
scipy>=1.6.2
molmass==2021.6.18
tqdm>=4.62.3
openpyxl>=3.0.9
networkx>=2.6.3
scikit-learn>=1.0.2
pyopenms >= 3.0.0
Features
PyHRMS provides following functions:
Read raw LC/GC-HRMS data in mzML format;
Powerful and accurate peak picking function for LC/GC HRMS;
retention time (rt) and mass over Z stands for charge number of ions (m/z) will be aligned based on user defined error range.
Accurate function for comparing response between/among two or more samples;
Covert profile data to centroid
Parallel computing to improve efficiency;
Interactive visualizations of raw mzML data;
Supporting searching for Local database and massbank;
MS quality evaluation for ms data in profile.
Can process SWATH data.
Paper Published Utilizing PyHRMS
Qiu, Y., Liu, L., Xu, C., Zhao, B., Lin, H., Liu, H., Xian, W., Yang, H., Wang, R., Yang, X., 2024. Farmland’s silent threat: Comprehensive multimedia assessment of micropollutants through non-targeted screening and targeted analysis in agricultural systems. J. Hazard. Mater. 476, 135064. https://doi.org/10.1016/j.jhazmat.2024.135064
Song, D., Tang, T., Wang, R., Liu, H., Xie, D., Zhao, B., Dang, Z., Lu, G., 2024. Enhancing compound confidence in suspect and non-target screening through machine learning-based retention time prediction. Environ. Pollut. 347, 123763. https://doi.org/10.1016/j.envpol.2024.123763
Chen, J., Tang, T., Li, Y., Wang, R., Chen, X., Song, D., Du, X., Tao, X., Zhou, J., Dang, Z., Lu, G., 2024. Non-targeted screening and photolysis transformation of tire-related compounds in roadway runoff. Sci. Total. Environ. 924, 171622. https://doi.org/10.1016/j.scitotenv.2024.171622
Xia,D., Liu, L., Zhao, B., Xie,D., Lu, G., and Wang, R., Application of Nontarget High-Resolution Mass Spectrometry Fingerprints for Qualitative and Quantitative Source Apportionment: A Real Case Study.Environ. Sci.Technol. 2024, 58,727−738 https://doi.org/10.1021/acs.est.3c06688
Jia, W., Liu, H., Ma, Y., Huang, G., Liu, Y., Zhao, B., Xie, D., Huang, K., Wang, R., 2024. Reproducibility in nontarget screening (NTS) of environmental emerging contaminants: Assessing different HLB SPE cartridges and instruments. Sci. Total Environ. 912, 168971. https://doi.org/10.1016/j.scitotenv.2023.168971
Wang, R., Yan, Y., Liu, H., Li, Yanxi, Jin, M., Li, Yuqing, Tao, R., Chen, Q., Wang, X., Zhao, B., Xie, D., 2023. Integrating data dependent and data independent non-target screening methods for monitoring emerging contaminants in the Pearl River of Guangdong Province, China. Sci. Total Environ. 891, 164445. https://doi.org/10.1016/j.scitotenv.2023.164445
Jiang, X., Xue, Z., Chen, W., Xu, M., Liu, H., Liang, J., Zhang, L., Sun, Y., Liu, C., Yang, X., 2023. Biotransformation kinetics and pathways of typical synthetic progestins in soil microcosms. J. Hazard. Mater. 446, 130684. https://doi.org/10.1016/j.jhazmat.2022.130684
Liang, J., Wang, R., Liu, H., Xie, D., Tao, X., Zhou, J., Yin, H., Dang, Z., Lu, G., 2022. Unintentional formation of mixed chloro-bromo diphenyl ethers (PBCDEs), dibenzo-p-dioxins and dibenzofurans (PBCDD/Fs) from pyrolysis of polybrominated diphenyl ethers (PBDEs). Chemosphere 308, 136246. https://doi.org/10.1016/j.chemosphere.2022.136246
Xia, D., Liu, H., Lu, Y., Liu, Y., Liang, J., Xie, D., Lu, G., Qiu, J., Wang, R., 2023. Utility of a non-target screening method to explore the chlorination of similar sulfonamide antibiotics: Pathways and N Cl intermediates. Sci. Total Environ. 858, 160042. https://doi.org/10.1016/j.scitotenv.2022.160042
Yang, X., Wang, R., He, Z., Dai, X., Jiang, X., Liu, H., Li, Y., 2023. Abiotic transformation of synthetic progestins in representative soil mineral suspensions. J. Environ. Sci. 127, 375–388. https://doi.org/10.1016/j.jes.2022.06.007
Liu, H., Wang, R., Zhao, B., Xie, D., 2024. Assessment for the data processing performance of non-target screening analysis based on high-resolution mass spectrometry. Sci. Total Environ. 908, 167967. https://doi.org/10.1016/j.scitotenv.2023.167967
Liu, H.; Zhao, B.; Jin, M.; Wang, R.; Ding, Z.; Wang, X.; Xu, W.; Chen, Q.; Tao, R.; Fu, J.; Xie, D. Anthropogenic-Induced Ecological Risks on Marine Ecosystems Indicated by Characterizing Emerging Pollutants in Pearl River Estuary, China. Sci. Total Environ. 2024, 926, 172030. https://doi.org/10.1016/j.scitotenv.2024.172030.
Licensing
The package is open source and can be utilized under MIT license. Please find the detail in licence file.
PyHRMS documentation
I want starting using PyHRMS
from pyhrms import pyhrms as pms
Project structure:
pyhrms/
1. Basic functions
==================
|- multi_process/
|- first_process
|- sep_scans
|- gen_df
|- peak_picking
|- peak_finding
|- evaluate_ms
|- target_spec
|- spec_at_rt
|- interpolate_series
|- find_locators
|- cal_bg
|- isotope_distribution
|- split_peak_picking
|- remove_unnamed_columns
|- identify_isotopes
|- peak_alignment
|- gen_ref
|- second_process
|- peak_checking_area
|- peak_checking_area_split
|- DDA_to_DIA_result
|- fold_change_filter
|- concat_alignment
|- gen_DDA_ms2_df
|- ms_to_centroid
|- multi_process_database_matching
|- database_match
|- ms2_matching
|- ms2_matching
|- compare_frag
|- rt_matching
|- parent_tp_analysis
|- post_filter
|- remove_adducts_all
|- remove_adducts
|- summarize_results
|- summarized_results_concat
|- summarize_pos_neg_result
|- final_result_filter
|- isotope_matching
|- formula_to_distribution
|- isotope_score
2. Swath data processing
=========================
|- one_step_process_swath
|- swath_process
|- split_peak_picking_swath
|- swath_frag_extract
|- swath_frag_raw
|- extract
|- precursor_frag_peak_area
|- peak_checking_area_precursor_frag_swath
|- gen_ref_swath
|- eval2
|- swath_window_checking
3. Omics functions
==================
|- omics_final_area
|- omics_index_dict
|- omics_filter
|- map_values
|- PCA_analysis
|- omics_cmp_numbers
|- omics_cmp_total_area
|- omics_correcting_area
|- check_istd_quality
|- KMD_cal
4. FT-ICRMS data processing
===========================
|- FT_ICRMS_process
|- formula_prediction
|- draw_Van_Krevelen_diagrams
5. Ion mobility mass data processing
==================
|- first_step_for_IMS
|-peak_picking_ion_mobility_DIA1
|-split_peak_picking2
6. other functions
==================
|- one_step_process
|- one_step_process_DDA
|- get_ms2_from_DDA
|- extract_tic
|- ms_bg_removal
|- JsonToExcel
|- suspect_list_matching
|- rename_files
|- Calibration
|- get_frag_DIA
|- get_chinese_name
|- AIF_multi_ce
|- pubchem_search
|- draw_pie_chart
|- fingerprint_application
|- build_molecular_network
|- ISTD_evaluation
|- convert_db
|- get_chemical_name
|- calculate_mass_percentage
|- pubchem_search
|- get_correction_factor_waters
|- compare_ms_spectra
|- first_process_ms2
|- second_process_ms2
|- one_step_process_ms2
|- convert_df_to_mgf
Table of Content
Quick start
Feature prioritization : multi_process()
Database matching : multi_process_database_matching()
Result filtering : post_filter()
Result summarizing : summarize_results()
Combining results of all samples : summarized_results_concat2()
Combining results of pos & neg : summarize_pos_neg_result()
1. Quick start
1.1 Feature prioritization:
This function primarily includes peak picking, peak alignment, and blank comparison to prioritize features that are unique to the sample compared to the blank.To ensure that the program distinguishes between the sample set and the control set, include the strings ‘methanol’, ‘blank’, and ‘control’ in your control set files, and exclude these strings from your sample set files.
path = '../Users/Desktop/my_HRMS_files'
company = 'Waters'
pms.multi_process(path, company, profile=True, control_group=['lab_blank', 'methanol'], processors=1, ms2_analysis=True,
area_threshold=200, filter_type=2)
The output file will have the suffix ‘_unique_cmps.xlsx’ and will be structured as follows:
new_index |
rt |
mz |
intensity |
S/N |
area |
… |
---|---|---|---|---|---|---|
15.48_241.05 |
15.5 |
241.0541 |
90817 |
1135.21 |
53476 |
… |
10.11_591.32 |
10.11 |
591.3243 |
78236 |
1738.58 |
12272 |
… |
… |
… |
… |
… |
… |
… |
… |
1.2 Database matching
How to create a database using excel?
Here is an example template for an Excel database of compounds:
Inchikey |
Precursor |
Frag |
Formula |
Smile |
Mode |
RT |
Source |
Source info |
---|---|---|---|---|---|---|---|---|
Inchikey1 |
211.1109 |
[117.0459, 92.0506] |
C13H13N3 |
smile1 |
pos |
15.36 |
massbank |
MoNA |
Inchikey2 |
165.0425 |
[135.0293, 135.0301] |
C11H14N4O5 |
smile2 |
neg |
8.54 |
massbank |
MoNA |
… |
… |
… |
… |
… |
… |
… |
… |
… |
After setting up your local database, you can use the following function to match compounds and generate output files with the suffix “_rt_ms2_match.xlsx”.
path = '../Users/Desktop/my_HRMS_files'
database = pd.read_excel(r'..//Users/Desktop/my_database.xlsx')
pms.multi_process_database_matching(path, database, processors=4, ms1_error=50, ms2_error=0.015, rt_error=0.1,
mode='pos')
1.3 Result filtering
This function lets users filter results based on criteria such as p-value, fold change, intensity, and area. Any feature with a p-value greater than the user-defined threshold (e.g., 0.05) will be removed from the result dataframe. The filtered result will be automatically exported with a filename suffix “_filter.xlsx”.
path = r'../Users/Desktop/my_HRMS_files/excel_files_need_filter'
pms.post_filter(path, fold_change=5, p_value=0.05, i_threshold=500, area_threshold=500, drop=None)
1.4 Single Result summarizing
The function is designed to collect identified features and ignore unidentified ones, resulting in a dataframe with the relevant information. In order to achieve this, the function requires three input dataframes: a suspect list from the Norman network, an ecotoxicity database from the Norman network, and a compound’s category excel.When the function is used, it will extract the name, smile, CAS number, categories, and toxicity data for each identified feature. This information is then compiled into a new dataframe, which includes only the identified features and their associated data. By using this function, users can easily extract and organize the relevant information for identified features, without having to manually sift through large amounts of data.
df = pd.read_excel(r'../Users/Desktop/my_HRMS_files/sample_rt_ms2_match_filter.xlsx')
result_df = pms.summarize_results(df, db_category, suspect_list, db_toxicity)
How to build a category database?
Here is an example template for an category database:
Inchikey |
category |
---|---|
AAEJJSZYNKXKSW-UHFFFAOYSA-N |
[‘PFAS’] |
AAIXLNBYXIVUKR-UHFFFAOYSA-N |
[‘PFAS’] |
… |
[‘..’,’..’] |
1.5 Combining Results from Samples with specific ESI Polarity
The function iterates through all result files with specific ESI polarity (positive or negative) and summarizes the results, generating a new Excel file that contains the summarized information.
path = r'../Users/Desktop/my_HRMS_files/summarized_result')
all_name_index = ['site01','site02','site03','site04',...]
mode = 'pos'
result_df = pms.summarized_results_concat(path, all_name_index, mode)
1.6 Combining results of pos & neg
This function combined positive summarized result and negative summarized results into one final result.
all_df_pos = pms.summarized_results_concat(path_pos, all_name_index, 'pos')
all_df_neg = pms.summarized_results_concat(path_neg, all_name_index, 'neg')
result_df = pms.summarize_pos_neg_result(all_df_pos, all_df_neg)
Acknowledgment
During the development of this package, I received valuable suggestions from Zhao Bo, Liu He, Xie Danping, Xia Di, and Zheng Jing at the South China Institute of Environmental Science, as well as from Lu Guining and Tang Ting at the South China University of Technology. I would also like to express my gratitude for the funding provided by the National Natural Science Foundation of China (Grant No. 22206133) and the National Key R&D Program of China (Project No. 2019YFC1804502).
Note
Please note that the documentation is currently a work in progress, and there is more content that is being written. I apologize for any inconvenience this may cause, but rest assured that I am continually updating the documentation to provide you with the most comprehensive guide to using PyHRMS.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.