Skip to main content

de novo sequencer for heterogeneous oligomer mixtures

Project description

Oligomer Soup Sequencing (OLIGOSS)

OLIGOSS is a package for de novo sequencing of linear oligomers from tandem mass spectrometry data.

Note: proper use of this package requires a good working knowledge of both the chemistry and analytical methods used to obtain results. It is not a magic bullet to solve all of your oligomer sequencing needs! Please ensure that all raw mass spectrometry data input to the package is of good quality. All results should be checked and validated wherever possible, especially if running data from a new instrument and / or oligomer class.

Installation

OLIGOSS is available through Pip (Python Package Index):

pip install oligoss --user

Source Code

Source code can be viewedon github: https://github.com/croningp/oligoss.git

System Requirements

OLIGOSS was developed and tested on Ubuntu 19.10, and should therefore be compatible with any Unix OS. As of version 0.0.2, OLIGOSS is incompabitible with Windows. Windows compatibility will be introduced in a later version.

Dependencies

  • Python (version 3.6.0 or later)

This package was written in Python 3.7, but should be compatible with version 3.6 or later.

  • mzML

Mass spectrometry data must be in .mzML file format. mzML files can be generated from a variety of vendors using Proteowizard MS Convert, which is freely available here: http://proteowizard.sourceforge.net/download.html

-mzmlripper

Graham Keenan's mzmlripper is required for converting mzML files into JSON format. For full documentation: https://pypi.org/project/mzmlripper/

To install mzML ripper:

pip install mzmlripper --user

Run

To run an OLIGOSS sequencing workflow, run the following command:

python -m oligoss -i input_params.json -r ripper_folder -o output_folder

  • input_params.json = input parameters file

This should contain all relevant input parameters for executing a OLIGOSS sequencing workflow (see Input Parameters, below).

  • ripper_folder = data directory. NOTE: this argument can either be passed in via the command line directly (as above) or specified in the input parameters file using the data_folder parameter.

This folder should contain input MS data in either mzML or ripper JSON format.

  • out_dir = output directory. NOTE: this argument can either be passed in via the command line directly (as above) or specified in the input parameters file using the output_folder parameter.

All output data will be dumped to this folder.

Sequencing Workflows

There is currently only one sequencing workflow available in OLIGOSS. The exhaustive screening workflow is to be used for sequencing oligomers with well-characterised fragmentation pathways and known monomer libraries. For oligomer classes with poorly characterised fragmentation pathways and / or data with unknown monomers, knew workflows based on previous mass difference screen and Kendrick Analysis will be coming shortly.

Configuration Files and Input Parameters

The ionization and fragmentation pathway of a set of oligomers, and the spectra observed for MS1 and MS2 hits, varies with:

  1. oligomer class: each oligomer class has its own set of possible ionization and fragmentation pathways. There are usually many different ways an oligomer can fragment.

  2. instrumentation: the ionization source and fragmentation method used determine which fragmentation pathways dominate for an oligomer class. The resolution, sensitivity and other scan parameters affect how many MS1 precursors and MS2 product ions will be detected and assigned to the correct sequence.

  3. analyte properties: the analyte matrix (e.g. pH, salts present), as well as the monomers used in each experiment will affect not only what oligomers are present but how they may ionize and fragment.

Fragmentation pathways are stored in polymer-specific configuration files for each class of oligomers (see Polymer Configs section). Default operating conditions for mass spectrometers are stored in instrument configuration files (see Instrument Configs section). All pre-configured instrument settings are used only as defaults, and can be overwritten in the input parameters file.

All parameters which vary between individual experiments are specified in the input parameters file (see Input Parameters section, below).

Input Parameters

Input parameter files must be in JSON format. The following parameters are required upon execution of every sequencing experiment:

  1. mode:
    • Description: specifies the charge of ions.
    • Type: str
    • Options: either "pos" or "neg" for positive and negative ion mode, respectively.
  2. monomers:
    • Description: monomers used in experiment.
    • Type: List[str]
    • Options: any combination of monomer one letter codes from polymer-specific configuration files.
  3. screening method:
    • Description: specifies which sequencing workflow to use.
    • Type: str
    • Options: currently only option is "exhaustive".
  4. polymer_class:
    • Description: specifies the backbone of oligomers being fragmented.
    • Type: str
    • Options: any valid alias associated with a polymer config file (see Polymer Configs).
  5. silico:
    • Description: parameters that define in silico properties.
    • Type: dict
    • Options: many sub-parameters must be defined (see Silico Parameters).
  6. extractors:
    • Description: parameters that define properties essential to matching and filtering in MS/MS data.
    • Type: dict
    • Options: many sub-parameters must be defined (see Extractor Parameters).
  7. postprocess:
    • Description: parameters that define properties relevant to assigning confidence scores to sequences based on extracted data.
    • Type: dict
    • Options: many sub-parameters must be defined (see Postprocess Parameters).

In addition to these parameters, optional parameters for mass spec model and chromatography can be defined. NOTE: if instruments are not defined, additional parameters must be defined in silico, extractors and postprocess (see Instrument Configs).

  1. instrument:
  • Description: specifies the mass spec model used to acquire MS/MS data.
  • Type: str
  • Options: any valid alias for a pre-configured instrument file.
  1. chromatography:
  • Description: specifies the chromatography used to separate products prior to detection via MS.
  • Type: str
  • Options: any valid alias for a pre-configured chromatography file.

Silico Parameters

Silico parameters define the properties required to construst theoretical MS1 precursors and MS2 product ions. There are general silico parameters, as well as parameters specific to MS1 and MS2 ions, defined in silico.ms1 and silico.ms2 subparamters, respectively.

  1. min_length:
  • Description: specifies minimum length (in monomer units) of oligomers in product mixtures.
  • Type: int
  • Options: Any integer value. NOTE: length of target oligomers affects size of the sequence space to screen. Limits on sequence space for screening will depend on oligomer class, types of monomers, length of oligomers and computing resources avaialable.
  • Default: no default available
  1. max_length:
  • Description: specifies maximum length (in monomer units) of oligomers in product mixtures.
  • Type: int
  • Options:Any integer value. NOTE: length of target oligomers affects size of the sequence space to screen. Limits on sequence space for screening will depend on oligomer class, types of monomers, length of oligomers and computing resources avaialable.
  • Default: no default available
  1. isomeric_targets:
  • Description: specifies isomeric sequence pool for targetting. If specified, only sequences isomeric to one or more isomeric targets will be included in the screen.
  • Type: List[str]
  • Options: list of any sequence strings that would be generated from input monomers and length distribution.
  • Default: None (not required).
  1. modifications:
  • Description: specifies targets for any covalent modifications.
  • Type: Dict[str or int, List[str]]
  • Options: keys specify modification targets (terminal and sidechain).
    • Terminal Keys: -1 or "-1" / 0 or "0" for terminus -1 and 0, respectively.
    • Sidechain Keys: keys must be one-letter codes for monomers with compatible sidechains (see Polymer Configs).
    • Modification Values: list of modification strings. Strings must correspond to valid modification alias (see Polymer Configs).
    • Example: {"K": ["Ole", "Pal"], 0: ["Ole"]}
  • Default: None (not required).
  1. ms1:
  • Description: specifies parameters for MS1 silico ions.
  • Type dict
  • Options: many sub-parameters (see MS1 Silico Parameters).
  1. ms2:
  • Description: specifies parameters for MS2 silico ions.
  • Type dict
  • Options: many sub-parameters (see MS2 Silico Parameters).
MS1 Silico Parameters

MS1 silico parameters define properties required for generating theoretical MS1 precursor ions.

  1. min_z:
  • Description: specifies minimum absolute charge of MS1 precursor ions.
  • Type: int
  • Options: Any integer value.
  • Default: default value specified by instrument. If no instrument is specified, default value == 1.
  1. max_z:
  • Description: specifies maximum absolute charge of MS1 precursor ions.
  • Type: int
  • Options: Any integer value >= min_z.
  • Default: None (not required). Instrument defaults can be specified in instrument config file.
  1. universal_sidechain_modifications:
  • Description: specifies whether all sidechains targeted for modification will be modified by one or more modifying agent.
  • Type: bool
  • Options: true or false.
  • Default: Default can be specified in polymer config file (see Polymer Configs). If not default is specified in config, default value == True.
  1. universal_terminal_modifications:
  • Description: specifies whether all termini targeted for modification will be modified by one or more modifying agent.
  • Type: bool
  • Options: true or false.
  • Default: Default can be specified in polymer config file (see Polymer Configs). If not default is specified in config, default value == True.
  1. max_neutral_losses:
  • Description: specifies cap on maximum number of neutral loss fragmentation events at MS1.
  • Type: int
  • Options: any integer value.
  • Default: Default can be specified in instrument config file (see Instrument Configs). If default is not specified in config, default value == None.
  1. adducts:
  • Description: specifies extrinsic ions present in analyte matrix that will affect MS1 ionization.
  • Type: List[str]
  • Options: list of ion strings. NOTE: all ions must be defined in Global Chemical Constants.
  • Default: varies with instrument and polymer class. If no instrument- or polymer-specific defaults, this must be specified.
MS2 Silico Parameters

MS2 silico parameters define properties required for generating theoretical MS2 product ions.

  1. fragment series:
  • Description: linear fragment types to be included in MS2 silico libraries.
  • Type: List[str]
  • Options: list of any valid MS2 fragment one letter codes in polymer config file (see Polymer Configs).
  • Defaults: defaults vary depending on specific instrument and polymer class. If no default is specified in instrument config, this MUST be supplied in input parameters.
  1. max_neutral_losses:
  • Description: specifies cap on maximum number of neutral loss fragmentation events at MS2.
  • Type: int
  • Options: any integer value.
  • Default: Default can be specified in instrument config file (see Instrument Configs). If default is not specified in config default value == None.
  1. signatures:
  • Description: specifies signature ion types to be included in MS2 silico libraries.
  • Type: List[str]
  • Options: list of any valid signature ion types in polymer config file.
  • Defaults: defaults vary depending on instrument and polymer class. If no default is specified in config files, default value == None.
  1. min_z:
  • Description: specifies minimum absolute charge of MS2 product ions.
  • Type: int
  • Options: Any integer value.
  • Default: default value specified by instrument. If no instrument is specified, no fallback default value is available so this must bepisode

Extractor parameters define properties required for screening observed MS and MS/MS data for theoretical silico ions, and also for filtering MS data prior to screening.

  1. error:
  • Description: specifies error threshold for matching theoretical ion m/z values to observed ions in MS/MS data.
  • Type: float
  • Options: Any float value between 0 and 1.
  • Default: Default value specified by instrument config. NOTE: no fallback default is available - must be specified in input parameters or instrument config.
  1. error_units:
  • Description: specifies units of error tolerance value.
  • Type: str
  • Options: "ppm" for relative error tolerance in parts per million, or "abs" for absolute error tolerance in mass units (u).
  • Default: Default value specified by instrument config. NOTE: no fallback default is available - must be specified in input parameters or instrument config.
  1. min_rt:
  • Description: specifies minimum retention time (in minutes) for data to be used for screening.
  • Type: float
  • Options: any valid float between 0 and length of acquisition run time.
  • Defaults: default value specified by chromatography config. If no chromatography is specified, default value == None (not required).
  1. min_rt:
  • Description: specifies maximum retention time (in minutes) for data to be used for screening.
  • Type: float
  • Options: any valid float between 0 and length of acquisition run time.
  • Defaults: default value specified by chromatography config. If no chromatography is specified, default value == None (not required).
  1. min_ms1_total_intensity:
  • Description: specifies minimum total intensity (taken as the sum of intensities in an EIC) for MS1 precursor ions to be confirmed.
  • Type: float
  • Options: Any float.
  • Defaults: Default value is specified by instrument config. If no instrument is specified, default value == None (not required).
  1. min_ms2_total_intensity:
  • Description: specifies minimum total intensity (taken as the sum of intensities in an EIC) for MS2 product ions to be confirmed.
  • Type: float
  • Options: Any float >= 0.
  • Defaults: Default value is specified by instrument config. If no instrument is specified, default value == None (not required).
  1. min_ms1_max_intensity:
  • Description: specifies minimum peak in intensity (taken as the most intensity signal in an EIC) for MS1 precursor ions to be confirmed.
  • Type: float
  • Options: Any float >= 0.
  • Default: Default value is specified by instrument config. If no instrument is specified, default value == None (not required).
  1. min_ms2_max_intensity:
  • Description: specifies minimum peak in intensity (taken as the most intensity signal in an EIC) for MS2 product ions to be confirmed.
  • Type: float
  • Options: Any float >= 0.
  • Default: Default value is specified by instrument config. If no instrument is specified, default value == None (not required).
  1. rt_units:
  • Description: retention time units in mzML files.
  • Type: str
  • Options: either "min" or "sec" for minutes and seconds, respectively.
  • Defaults: This value depends on the mass spec manufacturer, and should generally only be used in instrument config files. WARNING: please do not define this value in input parameters if possible. If you are unsure about the retention time units of your mass spec vendor, convert your mzML files to mzmlripper format directly and check the recorded retention times of the first and final spectra.
  1. pre_screen_filters:
  • Description: specifies retention time and intensity thresholds for pre-filtering ripper data before screening.
  • Type: Dict[str, float]
  • Options: several subparameters to define (see Pre-Screen Filters, below).
  • Default: None (not required).
Pre-Screen Filters

Pre-screen filters are used to remove irrelevant data from raw spectra before screening.

  1. min_ms1_max_intensity:
  • Description: specifies minimum peak in intensity in raw MS1 spectra for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Defaults: None (not required).
  1. min_ms2_max_intensity:
  • Description: specifies minimum peak in intensity in raw MS2 spectra for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Default: None (not required).
  1. min_ms1_total_intensity:
  • Description: specifies minimum total intensity in raw MS1 spectra for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Default: None (not required).
  1. min_ms2_total_intensity:
  • Description: specifies minimum total intensity in raw MS2 spectra for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Default: None (not required).
  1. min_rt:
  • Description: specifies minimum retention time for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Default: None (not required).
  1. max_rt:
  • Description: specifies maximum retention time for spectra to be included in later screening.
  • Type: float
  • Options: any valid float >= 0.
  • Default: None (not required).
  1. min_ms2_peak_abundance:
  • Description: specifies the minimum relative abundance for most intense sequence in an MS2 spectrum. If the relative intensity of the most intense match is less than min_ms2_peak_abundance, any fragments found in this spectrum will be discarded.
  • Type: float
  • Options: any float in range 0-100.

Postprocess Parameters

Postprocess parameters are used to define properties relevant for assigning sequence confidence based on observed data and how it compares to silico data, and also for any postprocessing such as molecular assembly calculations and spectrum plots.

  1. core_linear_series:
  • Description: core linear fragment types that are to be used to assign confidence for confirmed sequences from MS2 data.
  • Type: List[str]
  • Options: list of any valid fragment types in polymer config file. NOTE: all fragment types in core_linear_series must also be present in silico.ms2.fragment_series. However, not all fragments in silico.ms2.fragment_series have to be in core_linear_series. Fragment types specified in silico but not in core_linear_series will still be screened and recorded, but not used to assign confidence.
  1. exclude_fragments:
  • Description: specifies fragments that are to be excluded from confidence calculations, even if they have been confirmed.
  • Type: List[str]
  • Options: List of any valid fragment ids in core_linear_series (see below).
  • Defaults: None (not required).
  1. optional_core_fragments:
  • Description: core fragments that are to be used in confidence calculations only if confirmed. If absent from extracted data, these fragments will be ignored in confidence assignments and final confidence score will not be lower as a result of their absence. However, if present, they will be used to calculate confidence.
  • Type: List[str]
  • Options: list of any fragment ids from core fragments.
  • Default: Default depends on instrument and polymer class. If no default is specified in instrument config, default value == None (not required).
  1. essential_fragments:
  • Description: fragments that are essential for assigning a sequence's confidence score > 0.
  • Type: List[str]
  • Options: list of any valid fragment ids from core linear fragments.
  • Default: default depends on instrument and polymer config. If not specified in configs, default value == None (not required).
  1. dominant_signature_cap:
  • Description: specifies maximum confidence score for sequences missing one or more expected "dominant" signature ion at MS2.
  • Type: float
  • Options: Any float in range 0-100.
  • Default: Default value varies by instrument and polymer class. If not specified in configs, default value == 0 (not required).
  1. subsequence_weight:
  • Description: weighting value for mean continuous fragment coverage in final confidence score for confirmed sequences.
  • Type: float
  • Options: any float in range 0-1. 0 = confidence based entirely on % confirmed fragments, 1 = confidence based entirely on mean continuous fragment coverage.
  • Default: No default. This must be specified in input parameters file.
  1. rt_bin deprecated:
  • Description: minimum resolution (in minutes) between peaks in MS1 EICs.
  • Type: float
  • Options: Any valid float between 0 and length of acquisition time. NOTE: this is no longer used. It will be useful when revisiting quantification of individual sequences from EICs.
  1. ms2_rt_bin deprecated:
  • Description: minimum resolution (in minutes) between peaks in MS2 EICs.
  • Type: float
  • Options: Any valid float between 0 and length of acquisition time. NOTE: this is no longer used. It will be useful when revisiting quantification of individual sequences from EICs.
  1. spectral_assignment_plots:
  • Description: specifies whether to plot annotated MS2 spectra for confirmed sequences.#
  • Type: bool
  • Options: true or false. If true, plots will be saved as PNG images in output directory.
  • Default: false (not required).
  1. min_plot_confidence:
  • Description: specifies minimum confidence score for a sequence's spectral assignments to be plotted.
  • Type: float
  • Options: any valid float in range 0-100.
  • Default: 70.
  1. molecular_assembly:
  • Description: specifies parameters for calculating molecular assembly values for confirmed sequences.
  • Type: dict
  • Options: several subparameters (see Molecular Assembly Parameters).
  • Default: several defaults for subparameters.
Molecular Assembly Parameters

Defines properties relevant for calculating Molecular Assembly for confirmed sequences.

  1. min_confidence:
  • Description: specifies minimum confidence score of sequence assignment for an MA score to be calculated for the sequence.
  • Type: float
  • Options: any valid float in range 0-100.
  • Default: 70 (not required).
  1. consensus:
  • Description: specifies whether to calculate MA from consensus spectra or individual spectra.
  • Type: bool
  • Options: true or false.
  • Default: true (not required).
  1. combine_precursors:
  • Description: specifies whether to combine MS2 spectra from multiple unique sequence precursors before calcuating MA.
  • Type: bool
  • Options: true or false. NOTE: if true, this can only be done via consensus spectra and therefore consensus must also == true.
  • Default: false.
  1. min_peak_identity:
  • Description: specifies minimum % of spectra a peak must be found in to be included in consensus spectra.
  • Type: float
  • Options: any valid float in range 0-1. NOTE: be careful setting this value too high if combine_precursors == true (peak identity is likely to be lower for spectra with different precursors).
  • Default: 0.7 (not required).
  1. ppm_window:
  • Description: specifies window (in ppm, parts per million) for grouping precursors and MS2 product ions in consensus spectra.
  • Type: float
  • Options: any valid float >= 0.
  • Default: 5 (not required).

Polymer Configs

Polymer-specific configuration files define the full scope of possible ionization and fragmentation for an oligomer class. A proper configuration file should contain all information required to generate full possible sequence libraries for target oligomer class, as well as MS1 precursors and MS2 product ions. It should be divided into three sections:

  1. General Polymer Properties:

This defines properties required for generating MS1 precursor libraries.

  1. MS2 Fragmentation Properties:

This defines properties required for generating MS2 product ions.

  1. Modifications:

This defines any covalent modifications and their appropriate targets.

General Polymer Properties

  1. MONOMERS:
  • Description: defines monomer one-letter codes, their associated monoisotopic neutral masses and reactive functional groups.
  • Type: Dict[str, list]
  • Options: N/A
  • Example: {"A": [89.04768, [["amine", 1], ["carboxyl", 1]], "alanine"]} defines the amino acid monomer alanine, with a neutral monoisotopic mass of 89.04768, 1 reactive amine and 1 reactive carboxylic acid.
  1. MASS_DIFF:
  • Description: defines mass lost (or gained) upon addition of a monomer to an elongating chain.
  • Type: float or str
  • Options: any valid float corresponding to neutral monoisotopic mass difference in standard mass units (u) or a functional group string (e.g. "H2O") corresponding to the mass difference.
  1. ELONGATION:
  • Description: defines standard number of monomer additions per elongation event (in monomer units).
  • Type: int
  • Options: any valid integer >= 1.
  1. REACTIVITY_CLASSES:
  • Description: defines cross-reactivity of monomer functional groups.
  • Type: Dict[str, List[list]]
  • Options: keys must correspond to functional groups found in monomer library.
  • Example: The following example defines cross-reactivity of the "amine" functional group for an oligomer class, which in this case can react with either "carboxyl" or "hydroxyl" groups. The "amine" functional group is found in monomers "A", "B" and "C": {"amine": ["carboxyl", "hydroxyl"], ["A", "B", "C"]}
  1. SYMMETRY:
  • Description: determines whether termini are functionally equivalent (i.e. whether forward sequence == reverse sequence).
  • Type: bool
  • Options: true or false.
  • Example: SYMMETRY = true for peptides as they have distinct C- and N-termini.
  1. LOSS_PRODUCTS:
  • Description: specifies any neutral loss fragmentation events that can occur for specific monomer sidechains. NOTE: it is assumed that these fragmentation events can either be a product of in-source CID at MS1 or standard MS2 CID.
  • Type: Dict[str, list]
  • Options: keys must correspond to valid monomer one-letter codes, values must be lists of either monoisotopic neutral mass losses (float) or functional group strings corresponding to these neutral losses.
  • Example: {"N": ["NH3", "H2O"]} for monomer "N" with possible neutral loss fragmentations corresponding to loss of either an ammonia ("NH3") or water ("H2O") mass.
  1. IONIZABLE_SIDECHAINS:
  • Description: specifies non-backbone ionization sites that occur at specific monomer sidechains.
  • Type: Dict[str, dict]
  • Options: keys must correspond to valid monomer one-letter codes. Values must be dictionaries with keys "pos" and "neg" defining possible ionization events for positive and negative mode, respectively.
  • Example: In the following example, monomer "K" can be ionized via proton addition in positive mode and the monomer "D" can be ionized via proton abstraction in negative mode: {"K": { "pos": ["H", 1, 1], "neg": null }, "D": { "pos": null, "neg": ["-H", 1, 1] }}
  1. INTRINSICALLY_CHARGED_MONOMERS:
  • Description: this defines monomers which have an intrinsic, non-exchangeable charge.
  • Type: Dict[str, int]
  • Options: keys must be valid monomer one-letter codes, with values equivalent to intrinsic charge state due to non-exchangeable ions.
  • Example: for a monomer "Z" with intrinsic charge of -2: {"Z": -2}
  1. SIDE_CHAIN_CROSSLINKS:
  • Description: this defines any monomer-monomer crosslinks that can occur, and their effects on MS1 ionization and MS2 fragmentation.
  • Type: Dict[str, dict]
  • Options: keys must correspond to valid monomer one-letter codes. Key-Value pairs in the subdict are as follows:
    • monomers:
      • Description: defines other monomers that can form sidechain crosslinks with target monomer.
      • Type: List[str]
      • Options: list of valid monomer one-letter codes.
    • crosslink_massdiff:
      • Description: defines mass lost or gained upon crosslinking.
      • Type: float or str
      • Options: either a float corresponding to neutral monoisotopic mass diff or a string representing functional group mass diff.
    • permissible_crosslink_charges:
      • Description: defines permissible charge states for crosslinked moiety at the sidechain(s) of crosslinked monomers.
      • Type: List[int]
      • Options: list of any valid integer corresponding to permissible charge states.
    • disrupt_ms2:
      • Description: specifies whether crosslinking event disrupts standard linear fragmentation along backbone.
      • Type: bool
      • Options: true or false.
  • Example: The following example defines cross-linking events for the monomer "K" which, in its non-crosslinked state can be ionized at its sidechain (see IONIZABLE_SIDECHAINS). It can crosslink with monomers "E" and "D" via sidechain links. However, this type of crosslinking event does not disrupt standard linear MS2 fragmentation pathways. {"K": {"monomers": ["E", "D"], "crosslink_massdiff": "H2O", "permissible_crosslink_charges": [0], "disrupt_ms2": false}}
MS2 Fragmentation Properties

MS2 fragmentation properties are required for defining possible MS2 fragmentation pathways for an oligomer class. This includes both linear fragment series and signature ion fragments.

  1. FRAG_SERIES:
  • Description: dictionary that defines all linear fragmentation pathways. Linear fragmentation pathways are defined as any fragment series indexed stepwise along the oligomer backbone.
  • Type: Dict[str, dict]
  • Options: keys must correspond to fragment series one-letter codes. Properties of individual fragment series are defined in subparameters (see Defining FRAG_SERIES, below).
  • Example: see Defining FRAG_SERIES section.
  1. MS2_SIGNATURE_IONS:
  • Description: this defines any monomer-specific signature ions that may occur. NOTE: the same fragmentation events that produce linear fragment series can also produce signature ions. However, OLIGOSS considers these as separate events due to the diversity of possible signature ions.
  • Type: Dict[str, list]
  • Options: Keys correspond to signature ion str code, values lists of monomer one-letter codes and corresponding free signature m/z values.
  • Example: {"Im": ["F", 120.0813], ...} defines "Im" signature fragment for monomer "F" with m/z 120.0813.
  1. MODIFICATIONS:
  • Description: defines any covalent modifications and possible modification sites.
  • Type: Dict[str, dict]
  • Options: Keys must be strings corresponding to modification three-letter codes. Values are subdicts defining modification properties (see Defining MODIFICATIONS, below).
Defining FRAG_SERIES

The FRAG_SERIES dict is used to define properties relevant to linear fragment series (i.e. fragment series that are indexed stepwise along the oligomer backbone).

  1. default_linear:
  • Description: specifies default linear fragmentation pathways to be included in silico libraries depending on the mass spec fragmentation method used to acquire data.
  • Type: Dict[str, List[str]]
  • Options: keys must correspond to valid fragmentation methods defined in Instrument_Configs. Values are list of linear fragment series codes for linear fragment series that are produced via the specified fragmentation method.
  • Example: In the case of "HCD" fragmentation producing "a", "b" and "y" MS2 fragments: {"HCD": ["b", "y", "a"]}.
  • NOTE: there is redundancy with Instrument Configs. Linear fragment series can also be specified for individual oligomer classes in instrument config files. These can also be overwritten directly in input parameters file.
  1. default_core:
  • Description: specifies default core linear fragment series (i.e. linear series used in confidence assignments) depending on the mass spec fragmentation method used to acquire data.
  • Type: Dict[str, List[str]]
  • Options: keys must correspond to valid fragmentation methods defined in Instrument Configs. Values are lists of linear fragment series codes for linear fragment series that are produced via the specified fragmentation method and are required for assigning confidence scores.
  • Example: In the case of "HCD" fragmentation producing core fragment series "b" and "y": {"HCD": ["b", "y"]}.
  • NOTE: there is redundancy with Instrument Configs. Linear fragment series can also be specified for individual oligomer classes in instrument config files. These can also be overwritten directly in input parameters file.
  1. terminus:
  • Description: specifies "home" terminus from which linear fragment series is indexed.
  • Type: int
  • Options: either 0 or -1 for fragment series indexed from terminus 0 and -1, respectively.
  • NOTE: for oligomer classes with symmetry == False, terminus is irrelevant.
  1. mass_diff:
  • Description: specifies neutral mass difference between a fragment and its corresponding intact neutral sequence slice.
  • Type: str or float
  • Options: valid float corresponding to mass difference in mass units (u) or functional group string representing a neutral monoisotopic mass corresponding to mass (e.g. "OH", "H2O").
  1. fragmentation_unit:
  • Description: specifies increment of fragment indices when producing linear fragment series.
  • Type: Dict[str, [int or str]
  • Options: must be a key for "pos", "neg" defining fragmentation unit in positive and negative mode, respectively. Values must either be ints >= 0 or strings representing int value (most commonly "ELONGATION_UNIT" if fragmentation_unit == ELONGATION_UNIT).
  • Example: fragmentation_unit will equal 1 or ELONGATION_UNIT for the majority of oligomer classes. A possible exceptions to this would be for alternating copolymers with alternating backbone links:
  1. start:
  • Description: start position of fragment series relative to terminus.
  • Type: int
  • Options any valid integer >= 0.
  • Example: 0 for a fragment series that begins immediately at terminus, 1, 2, 3 for fragment series that begins 1, 2 or 3 indices away from terminus.
  1. end:
  • Description: end position of fragment series relative to other terminus (i.e. terminus 0 and -1 for terminus == 1 and terminus == 0, respectively).
  • Type: int
  • Options: any valid integer >= 0.
  • Example: 0 for a fragment series that terminates at final index on backbone, 1, 2, 3 for fragment series that terminates 1, 2 or 3 indexes away from final index on backbone.
  1. intrinsic_charge:
  • Description: defines any non-exchangeable ions associated with fragments of a particular series.
  • Type: Dict[str, int]
  • Options: keys must be "pos" and "neg" for positive and negative mode, respectively. Values must be integers representing intrinsic charge value.
  • Example: {"pos": 1, "neg": null} for a fragment series with intrinsic charge of 1 in positive mode but no intrinsic charge in negative mode.
  • NOTE: do not confuse this with intrinsic_adduct. By definition MS2 fragment series are charged by default. However, this can be a result of either non-exchangeable or exchangeable ions. intrinsic_charge defines charge state due to non-exchangeable ions.
  • NOTE: NOT_REQUIRED. This property does not need to be defined if intrinsic_adduct is defined. However, at least one of these properties must be defined to account for fragment charge.
  1. intrinsic_adduct:
  • Description: defines any exchangeable ions associated with fragments of a particular series.
  • Type: Dict[str, str]
  • Options: keys must be "pos" and "neg" for positive and negative mode, respectively. Values must correspond to strings representing adducts stored in Global Chemical Constants.
  • Example: {"pos": "H", "neg": "-H"} for a fragment series that is intrinsically protonated in positve mode but deprotonated in negative mode. -NOTE: this property should only be used to define exchangeable ions (i.e. ions that can be swapped for extrinsic ions in sample matrix). Do not confuse with non-exchangeable ions, which are defined in intrinsic_charge.
  1. exceptions:
  • Descriptions: for oligomer classes with mixed backbones (i.e. more than one backbone bond type that can be fragmented at MS2), fragmentation properties may differ depending on what type of bond is being fragmented at a particular index.
  • Type: Dict[str, dict]
  • Options: keys must be "pos" and "neg" to define exceptions to standard fragmentation rules in positive and negative mode, respectively. Values define exceptions to any combination of previously described MS2 fragmentation properties for linear fragment series.
  • Format: {mode (str): {func_group: {prop: {"positions": List[int], "start": int, "end": int, "exception_value: Value}}}}
    • mode == either "pos" or "neg" for positive or negative mode.
    • func_group == functional group that causes exception to standard fragmentation pathway.
    • prop == the property for which the exception may apply.
    • positions: defines list of indexes in subsequence at which exception applies. Some fragmentation exceptions only apply when the non-standard backbone link is in a particular position in the fragment subsequence.
    • start: defines start position at which exception applies, relative to home terminus.
    • end: specifies number of indices away from end terminus at which exception no longer applies
    • exception_value: the substituted value to use for the property if exception applies.
  • Example: The following example is for a fragment series with exception to mass_diff in cases where a bond between a "hydroxyA"-containing monomer is being fragmented. The exception applies when the "hydroxyA"-containing monomer occurs at the final index of the subsequence. The exception applies from the very first index of the fragment series but ends one index away from the end terminus: {"pos": {"hydroxyA": {"mass_diff": {"positions": [-1], "start": 0, "end": 1, "exception_value": 26.98709}}}}

Instrument Configs

Instrument configuration files are used to store information on mass spectrometers used routinely for experiments. These can define resolution (in terms of error tolerance for matching peaks), sensitivity (in terms of minimum intensity thresholds for detection), and fragmentation methods.

  1. error:
  • Description: this defines the default error threshold for an instrument when matching peaks.
  • Type: float
  • Options: any valid float >= 0. This can correspond to relative error threshold (parts per million, ppm) or absolute error threshold (mass units, u).
  1. error_units:
  • Descriptions: specifies units of default error.
  • Type: str
  • Options: either "ppm" or "abs" for relative and absolute error thresholding, respectively.
  1. rt_units:
  • Description: specifies the default retention time units in mzML files. This is a vendor-specific property outside the control of OLIGOSS (e.g. mzML files generated from Bruker mass specs have retention time units of seconds, while ThermoScientific mass spec units are in minutes).
  • Type: str
  • Options: either "min" or "sec" for seconds or minutes, respectively.
  • NOTE: if you are unsure about the retention time units in your mzML files, this is usually not specified in the raw mzML itself. Spectra in output rippers are sorted by retention time, so it should be straightforward to work out rt_units for your mass spec from the recorded retention times of the first and last spectra (assuming you know total acquisition time).
  1. min_ms1_max_intensity:
  • Description: specifies default minimum peak in intensity for accepting an MS1 EIC as valid.
  • Type: float
  • Options: any valid float >= 0.
  1. min_ms2_max_intensity:
  • Description: specifies default minimum peak in intensity for accepting an MS2 EIC as valid.
  • Type: float
  • Options: any valid float >= 0.
  1. fragmentation:
  • Description: specifies fragmentation methods available at every stage of tandem mass spectrometry for an instrument.
  • Type: Dict[str, List[str] or str]
  • Options: keys must include "ms1", "ms2" and (optionally) "msn" for defining fragmentation methods at MS1, MS2 and MS3+ levels respectively.
  • Example: for a mass spec with "neutral" fragmentation (i.e. is-CID) at MS1, and "HCD" and "CID" at MS2-n: {"ms1": "neutral", "ms2": ["HCD", "CID", "neutral"], "msn": ["HCD", "CID", "neutral"]}.
  1. pre_screen_filters:
  • Description: specifies default intensity thresholds for pre-filtering spectra before screening.
  • Type: Dict[str, float]
  • Options: keys:
    • min_ms1_max_intensity: specifies minimum intensity of base peak for MS1 spectra to be included in screening.
    • min_ms2_max_intensity: specifies minimum intensity of dominant ion for MS2 spectra to be included in screening.
  1. polymer_classes:
  • Description: defines default fragmentation ionization, fragmentation and some postprocessing parameters for individual oligomer classes when using the instrument.
  • Types: Dict[str, dict]
  • Options: keys must be valid polymer config aliases. Values are subdicts defining default parameters for silico_ms1, silico_ms2, extractors and postprocessing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oligoss-0.0.2.tar.gz (113.1 kB view hashes)

Uploaded Source

Built Distribution

oligoss-0.0.2-py3-none-any.whl (128.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page