pyhrms

A powerful GC/LC-HRMS data analysis tool

These details have not been verified by PyPI

Project links

Homepage

Project description

PyHRMS: Tools For working with High Resolution Mass Spectrometry (HRMS) data in Environmental Science

PyHRMS is a Python package designed to process high-resolution mass spectrometry data that is coupled with gas chromatography (GC) or liquid chromatography (LC). Its primary objective is to provide scientists with a user-friendly tool that can be used to read, process, and visualize LC/GC-HRMS data.

By utilizing PyHRMS, users can more easily analyze complex data sets, resulting in a more efficient and streamlined research process. Whether working with GC or LC coupled HRMS data, PyHRMS is a reliable and effective solution that can help researchers to achieve their scientific goals.

Contributer: Rui Wang

First release date: Nov.15.2021

Update

July.13.2023: pyhrms 0.6.1 new features:

rewrite fold_chagne_filter function

added AIF_multi_ce function to process AIF data

added pubchem_search and draw_pie_chart

pyhrms can be installed and import as following:

pip install pyhrms

If you just want to update a new version, please update as following:

pip install pyhrms -U

pyhrms requires major dependencies:

numpy>=1.19.2
pandas>1.3.3
matplotlib>=3.3.2
pymzml>=2.4.7
scipy>=1.6.2
molmass>=2021.6.18
tqdm>=4.62.3
openpyxl>=3.0.9
networkx>=2.6.3
scikit-learn>=1.0.2

Features

PyHRMS provides following functions:

Read raw LC/GC-HRMS data in mzML format;
Powerful and accurate peak picking function for LC/GC HRMS;
retention time (rt) and mass over Z stands for charge number of ions (m/z) will be aligned based on user defined error range.
Accurate function for comparing response between/among two or more samples;
Covert profile data to centroid
Parallel computing to improve efficiency;
Interactive visualizations of raw mzML data;
Supporting searching for Local database and massbank;
MS quality evaluation for ms data in profile.
Can process SWATH data.

Paper Published Utilizing PyHRMS

Wang, R., Yan, Y., Liu, H., Li, Y., Jin, M., Li, Y., Tao, R., Chen, Q., Wang, X., Zhao, B., Xie, D., 2023. Integrating data dependent and data independent non-target screening methods for monitoring emerging contaminants in the Pearl River of Guangdong Province, China. Science of the Total Environment 891 (2023) 164445. http://dx.doi.org/10.1016/j.scitotenv.2023.164445
Jiang, X., Xue, Z., Chen, W., Xu, M., Liu, H., Liang, J., Zhang, L., Sun, Y., Liu, C., Yang, X., 2023. Biotransformation kinetics and pathways of typical synthetic progestins in soil microcosms. Journal of Hazardous Materials 446, 130684. https://doi.org/10.1016/j.jhazmat.2022.130684
Liang, J., Wang, R., Liu, H., Xie, D., Tao, X., Zhou, J., Yin, H., Dang, Z., Lu, G., 2022. Unintentional formation of mixed chloro-bromo diphenyl ethers (PBCDEs), dibenzo-p-dioxins and dibenzofurans (PBCDD/Fs) from pyrolysis of polybrominated diphenyl ethers (PBDEs). Chemosphere 308, 136246. https://doi.org/10.1016/j.chemosphere.2022.136246
Xia, D., Liu, H., Lu, Y., Liu, Y., Liang, J., Xie, D., Lu, G., Qiu, J., Wang, R., 2023. Utility of a non-target screening method to explore the chlorination of similar sulfonamide antibiotics: Pathways and N Cl intermediates. Science of The Total Environment 858, 160042. https://doi.org/10.1016/j.scitotenv.2022.160042
Yang, X., Wang, R., He, Z., 2023. Abiotic transformation of synthetic progestins in representative soil mineral suspension. Journal of Environmental Science 127, 375-388. https://doi.org/10.1016/j.jes.2022.06.007

Licensing

The package is open source and can be utilized under MIT license. Please find the detail in licence file.

PyHRMS documentation

I want starting using PyHRMS

from pyhrms import pyhrms as pms

Project structure:

pyhrms/
1. Basic functions
==================
|- multi_process/
   |- first_process
      |- sep_scans
      |- gen_df
      |- peak_picking
          |- peak_finding
          |- evaluate_ms
              |- target_spec
              |- spec_at_rt
              |- interpolate_series
          |- find_locators
          |- cal_bg
          |- isotope_distribution
      |- split_peak_picking
      |- remove_unnamed_columns
      |- identify_isotopes
   |- peak_alignment
      |- gen_ref
   |- second_process
      |- peak_checking_area
      |- peak_checking_area_split
   |- fold_change_filter
      |- concat_alignment
   |- gen_DDA_ms2_df
      |- ms_to_centroid
|- multi_process_database_matching
  |- database_match
      |- ms2_matching
          |- ms2_matching
              |- compare_frag
      |- rt_matching
|- parent_tp_analysis
|- post_filter
|- remove_adducts_all
  |- remove_adducts
|- summarize_results
|- summarized_results_concat
|- summarize_pos_neg_result
|- final_result_filter
|- isotope_matching
  |- formula_to_distribution


2. Swath data processing
=========================
|- swath_process
    |- split_peak_picking_swath
    |- swath_frag_extract
    |- swath_frag_raw
    |- extract

3. Omics functions
==================
|- omics_final_area
|- omics_index_dict
|- omics_filter
|- map_values
|- PCA_analysis
|- omics_cmp_numbers
|- omics_cmp_total_area
|- omics_correcting_area
|- check_istd_quality
|- KMD_cal

4. FT-ICRMS data processing
===========================
|- FT_ICRMS_process
  |- gen_possible_formula
  |- frag_correction
      |- formula_prediction
          |- append_list
  |- formula_sep

5. other functions
==================
|- get_ms2_from_DDA
|- extract_tic
|- ms_bg_removal
|- JsonToExcel
|- suspect_list_matching
|- rename_files
|- Calibration
|- get_frag_DIA
|- AIF_multi_ce
|- pubchem_search
|- draw_pie_chart

Table of Content

Quick start

Feature prioritization : multi_process()
Database matching : multi_process_database_matching()
Result filtering : post_filter()
Result summarizing : summarize_results()
Combining results of all samples : summarized_results_concat2()
Combining results of pos & neg : summarize_pos_neg_result()

1. Quick start

1.1 Feature prioritization:

This function primarily includes peak picking, peak alignment, and blank comparison to prioritize features that are unique to the sample compared to the blank.To ensure that the program distinguishes between the sample set and the control set, include the strings ‘methanol’, ‘blank’, and ‘control’ in your control set files, and exclude these strings from your sample set files.

path = '../Users/Desktop/my_HRMS_files'
company = 'Waters'
pms.multi_process(path, company, profile=True, control_group=['lab_blank', 'methanol'], processors=1, ms2_analysis=True,
              area_threshold=200, filter_type=2)

Note

Parameters explanation:

path: The file path for the mzML files that will be processed. For example, ‘../Users/Desktop/my_HRMS_files’.
company: The type of mass spectrometer used to acquire the data. Valid options are ‘Waters’, ‘Thermo’, ‘Sciex’, and ‘Agilent’.
profile: A Boolean value that indicates whether the data is in profile or centroid mode. True for profile mode, False for centroid mode.
processors: This setting determines the number of processors that will be used for data processing in parallel running. If the memory usage exceeds 90%, please note that some Excel files may not be generated.
control_group (List[str]): A list of labels representing the control group.These labels are used in the search for relevant file names.
filter_type (int): Determines the mode of operation.

Set to 1 for data without triplicates; fold change is computed as the ratio of the sample area to the maximum control area. Set to 2 for data with triplicates; the function will calculate p-values, and fold change is computed as the ratio of the mean sample area to the mean control area.
ms2_analysis: A Boolean value that indicates whether to perform DIA fragment analysis. Set to True to enable DIA fragment analysis, or False to disable it.
area_threshold: The minimum peak area threshold. Peaks with an area below this threshold will be excluded from analysis.

The output file will have the suffix ‘_unique_cmps.xlsx’ and will be structured as follows:

new_index	rt	mz	intensity	S/N	area	…
15.48_241.05	15.5	241.0541	90817	1135.21	53476	…
10.11_591.32	10.11	591.3243	78236	1738.58	12272	…
…	…	…	…	…	…	…

Note

If you have any questions about the column names in the output files, you can refer to the explanations provided below:

Inchikey: Fixed-length format directly derived from International Chemical Identifier of a compound.
rt_error: Retention time difference between observed retention time and recorded retention in database.
rt: Retention time of a compound.
mz: observed mass of a compound.
new_index: a index after alignment for m/z & retention pair.
MS2_spectra: MS/MS spectra of compounds from DDA analysis (if available).
ms1_error: mass difference between observed mass and theoretical mass (unit: part per million, i.e., ppm).
ms1_opt_error: Mass difference between optimized mass and theoretical mass (For profile data only). The optimized mass was obtained by calculating the middle point for the full width at half the maximum of a mass peak.
frag_match_num: Number for matched fragment.
match_info: Information for matched fragments. For example: {344.1007: 0.0026, 372.0975: 0.0004} means two fragments were matched, i.e., 344.1007 and 372.0975 Da, and the mass error were 0.0026 and 0.0004 Da, respectively.
Source: database source.
MS2 mode: The fragments were obtained by DDA mode, DIA mode or both.
Smile: Simplified molecular-input line-entry system.
CAS: a unique identification number assigned by the Chemical Abstracts Service (CAS).
name: compound name.
formula: compound formula.
Norman_SusDat_ID: Norman suspect database ID.
Sites: Sites for detected compounds in pearl river.
Confidence level: Confidence level for structure identification.
Mode: ESI mode for detected compounds. For example, {‘pos’: 17, ‘neg’: 40} means this compound were detected in 17 sampling sites in positive mode, while were detected in 40 sampling sites in negative mode.
sites_num: number of sampling sites for detected compounds.
category: category of detected compound.
usage: usage of detected compound.
Lowest PNEC Freshwater [ug/l]: Lowest predicted no-effect concentration in freshwater. These data were obtained from NORMAN ecotoxicology database.
conc(ng/L): Concentration range for detected compounds.
frag_DIA: This represents the fragment generated by analyzing data-independent acquisition (DIA) data.
iso_distribution: This contains information about isotopes. For example, {591.3243: 1.0, 592.3254: 0.168} means that the m/z 591.3243 has a relative abundance of 100%, while 592.3254 has a relative abundance of 16.8%.
resolution: This represents the resolution of the mass peak.
Ciso: This is the potential carbon isotope peak. If the rt&mz pair have a value in Ciso (e.g., ‘10.11_592.3254’ has a value ‘C13:10.11 _591.3243’ in Ciso), it means that 10.11_592.3254 might be the C13 isotope peak of 10.11_591.3243.
Cliso and Briso: These represent the potential chlorine and bromine isotope peaks, respectively. They work similarly to Ciso.
Na adducts and K adducts: These represent the potential sodium and potassium adduct peaks, respectively. If the rt&mz pair have a value in Na adducts (e.g., ‘9.99_598.2756’ has a value ‘Na adducts: 9.98 _576.2983’ in Na adducts), it means that 9.99_598.2756 might be the sodium adduct of 9.98_576.2983. K adducts work similarly.
Sample_area_mean: If duplicates/triplicates are available, this represents the average peak area for these samples.
Sample_area_std: If duplicates/triplicates are available, this represents the standard error for these samples’ peak areas.
p_value: If triplicates are available, this represents the p-value when comparing the control set and sample set.
fold_change: This represents the fold change value when comparing the peak area of the control set and sample set.
frag_DDA: This represents the MS/MS spectra of compounds from data-dependent acquisition (DDA) analysis, if available.

1.2 Database matching

How to create a database using excel?

Here is an example template for an Excel database of compounds:

Inchikey	Precursor	Frag	Formula	Smile	Mode	RT	Source	Source info
Inchikey1	211.1109	[117.0459, 92.0506]	C13H13N3	smile1	pos	15.36	massbank	MoNA
Inchikey2	165.0425	[135.0293, 135.0301]	C11H14N4O5	smile2	neg	8.54	massbank	MoNA
…	…	…	…	…	…	…	…	…

Note

To build a local database, you will need to create an Excel file with information about the compounds you want to include in the database. It is important to note that you should not change the names of the columns in the Excel file, as they are used to map the information to the appropriate fields in the database.

Inchikey: A fixed-length format derived from the International Chemical Identifier (InChI) of a compound. InChI is a standard way of representing chemical structures.
Precursor: The monoisotopic mass of a compound, which is neutral and does not include any additional atoms that would result in a positive or negative charge.
Frag: The fragments of a compound, represented as a list of values. For example, [117.0459, 92.0506] would represent two fragments with masses of 117.0459 and 92.0506.
Formula: The molecular formula of a compound, which describes the types and numbers of atoms present in the molecule.
Smile: The Simplified Molecular Input Line Entry System (SMILES) notation for a compound, which is a string representation of its chemical structure.
Mode: Indicates whether the ion mode for the compound is positive or negative.
RT: Retention time of a compound.
Source: The source of the compound’s information, such as a database or literature reference.
Source info: Any additional information about the source of the compound’s information, such as the name of the database or the publication where the information was found.

After setting up your local database, you can use the following function to match compounds and generate output files with the suffix “_rt_ms2_match.xlsx”.

path = '../Users/Desktop/my_HRMS_files'
database = pd.read_excel(r'..//Users/Desktop/my_database.xlsx')
pms.multi_process_database_matching(path, database, processors=4, ms1_error=50, ms2_error=0.015, rt_error=0.1,
                                mode='pos')

1.3 Result filtering

This function lets users filter results based on criteria such as p-value, fold change, intensity, and area. Any feature with a p-value greater than the user-defined threshold (e.g., 0.05) will be removed from the result dataframe. The filtered result will be automatically exported with a filename suffix “_filter.xlsx”.

path = r'../Users/Desktop/my_HRMS_files/excel_files_need_filter'
pms.post_filter(path, fold_change=5, p_value=0.05, i_threshold=500, area_threshold=500, drop=None)

1.4 Single Result summarizing

The function is designed to collect identified features and ignore unidentified ones, resulting in a dataframe with the relevant information. In order to achieve this, the function requires three input dataframes: a suspect list from the Norman network, an ecotoxicity database from the Norman network, and a compound’s category excel.When the function is used, it will extract the name, smile, CAS number, categories, and toxicity data for each identified feature. This information is then compiled into a new dataframe, which includes only the identified features and their associated data. By using this function, users can easily extract and organize the relevant information for identified features, without having to manually sift through large amounts of data.

df = pd.read_excel(r'../Users/Desktop/my_HRMS_files/sample_rt_ms2_match_filter.xlsx')
result_df = pms.summarize_results(df, db_category, suspect_list, db_toxicity)

How to build a category database?

Here is an example template for an category database:

Inchikey	category
AAEJJSZYNKXKSW-UHFFFAOYSA-N	[‘PFAS’]
AAIXLNBYXIVUKR-UHFFFAOYSA-N	[‘PFAS’]
…	[‘..’,’..’]

1.5 Combining Results from Samples with specific ESI Polarity

The function iterates through all result files with specific ESI polarity (positive or negative) and summarizes the results, generating a new Excel file that contains the summarized information.

path = r'../Users/Desktop/my_HRMS_files/summarized_result')
all_name_index = ['site01','site02','site03','site04',...]
mode = 'pos'
result_df = pms.summarized_results_concat(path, all_name_index, mode)

1.6 Combining results of pos & neg

This function combined positive summarized result and negative summarized results into one final result.

all_df_pos = pms.summarized_results_concat(path_pos, all_name_index, 'pos')
all_df_neg = pms.summarized_results_concat(path_neg, all_name_index, 'neg')
result_df = pms.summarize_pos_neg_result(all_df_pos, all_df_neg)

NOTE

Please note that the documentation is currently a work in progress, and there is more content that is being written. I apologize for any inconvenience this may cause, but rest assured that I am continually updating the documentation to provide you with the most comprehensive guide to using PyHRMS.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.9.8

Oct 29, 2024

0.9.7

Oct 27, 2024

0.9.6

Oct 21, 2024

0.9.5

Sep 20, 2024

0.9.4

Sep 15, 2024

0.9.3

Aug 25, 2024

0.9.2

May 25, 2024

0.9.1

May 9, 2024

0.9.0

Apr 20, 2024

0.8.9

Apr 14, 2024

0.8.8

Mar 25, 2024

0.8.7

Mar 14, 2024

0.8.6

Mar 10, 2024

0.8.5

Mar 8, 2024

0.8.4

Mar 6, 2024

0.8.3

Jan 25, 2024

0.8.2

Jan 24, 2024

0.8.1

Jan 24, 2024

0.8.0

Jan 24, 2024

0.7.9

Jan 23, 2024

0.7.8

Jan 10, 2024

0.7.7

Jan 2, 2024

0.7.6

Dec 18, 2023

0.7.5

Dec 14, 2023

0.7.4

Nov 22, 2023

0.7.3

Oct 30, 2023

0.7.2

Oct 29, 2023

0.7.1

Oct 13, 2023

0.7.0

Oct 13, 2023

0.6.9

Sep 21, 2023

0.6.8

Sep 21, 2023

0.6.7

Aug 9, 2023

0.6.6

Aug 8, 2023

0.6.5

Jul 26, 2023

0.6.4

Jul 26, 2023

0.6.3

Jul 26, 2023

0.6.2

Jul 13, 2023

This version

0.6.1

Jul 13, 2023

0.6.0

Jul 2, 2023

0.5.9

Jun 29, 2023

0.5.8

Jun 14, 2023

0.5.7

May 14, 2023

0.5.6

May 5, 2023

0.5.5

Mar 23, 2023

0.5.4

Mar 13, 2023

0.5.3

Mar 7, 2023

0.5.2

Mar 6, 2023

0.5.1

Feb 15, 2023

0.4.0

Jun 27, 2022

0.3.9

May 29, 2022

0.3.8

Apr 19, 2022

0.3.7

Apr 6, 2022

0.3.6

Mar 25, 2022

0.3.5

Mar 16, 2022

0.3.4

Mar 15, 2022

0.3.3

Mar 9, 2022

0.3.2

Mar 6, 2022

0.3.1

Feb 28, 2022

0.2.10

Feb 17, 2022

0.2.9

Jan 31, 2022

0.2.7

Jan 6, 2022

0.2.6

Dec 29, 2021

0.2.5

Dec 28, 2021

0.2.4

Dec 27, 2021

0.2.3

Dec 26, 2021

0.2.2

Dec 26, 2021

0.2.1

Dec 23, 2021

0.2.0

Dec 20, 2021

0.1.9

Dec 17, 2021

0.1.8

Dec 13, 2021

0.1.7

Dec 13, 2021

0.1.6

Dec 9, 2021

0.1.5

Dec 8, 2021

0.1.4

Dec 4, 2021

0.1.3

Dec 3, 2021

0.1.2

Nov 26, 2021

0.1.1

Nov 25, 2021

0.1.0

Nov 22, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhrms-0.6.1.zip (77.1 kB view hashes)

Uploaded Jul 13, 2023 Source

Built Distribution

pyhrms-0.6.1-py3-none-any.whl (60.4 kB view hashes)

Uploaded Jul 13, 2023 Python 3

Hashes for pyhrms-0.6.1.zip

Hashes for pyhrms-0.6.1.zip
Algorithm	Hash digest
SHA256	`95ec5929a656973c30cbed08016f87f641f75963f14a1c1493ad0e0181773bd2`
MD5	`ec0b671e6dfad2b93c8353a507800e65`
BLAKE2b-256	`f1de054d8ab4a079702676c48f1ae7f4c39e45169ffbdc3d837b0a035aab532c`

Hashes for pyhrms-0.6.1-py3-none-any.whl

Hashes for pyhrms-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d486cd35622d02c008505f753406941241844b3947e64230adf13b5abc31e42`
MD5	`607e120d3140e81fd91c8bbe91f4074c`
BLAKE2b-256	`0cf748be239f2a93c048a43f315804fee7f6d6f149de574db2bbbd062184f8d6`

pyhrms 0.6.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyHRMS: Tools For working with High Resolution Mass Spectrometry (HRMS) data in Environmental Science

Contributer: Rui Wang

Update

pyhrms requires major dependencies:

Features

Paper Published Utilizing PyHRMS

Licensing

PyHRMS documentation

Table of Content

1. Quick start

1.1 Feature prioritization:

1.2 Database matching

1.3 Result filtering

1.4 Single Result summarizing

1.5 Combining Results from Samples with specific ESI Polarity

1.6 Combining results of pos & neg

NOTE

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution