Skip to main content

A data analysis package for PI-ICR Mass Spectrometry

Project description

piicrgmms

A package for implementing Gaussian mixture models as a data analysis tool in PI-ICR mass spectrometry experiments. piicrgmms was first developed in the Fall of 2020 to be used in PI-ICR experiments at the Canadian Penning Trap (CPT) mass spectrometer at Argonne National Laboratory (Lemont, IL, U.S.). It was originally published as GMMClusteringAlgorithms, but was repackaged as piicrgmms in preparation for the publication of an upcoming journal article about its use.

At piicrgmms' core is a modified version of the 'mixture' module from the package scikit-learn. The modified version, sklearn_mixture_piicr, retains all the same components as the original version. In addition, it contains two classes with restricted fitting algorithms: a Gaussian mixture model fit using the expectation-maximization algorithm where the phase dimension of the component means is not a parameter, and a Bayesian Gaussian Mixture fit where the number of components is not a parameter.

The rest of the package facilitates quick, intuitive use of the Gaussian mixture model algorithms through the use of 4 classes, with visualization methods for displaying results and debugging.

1. DataFrame

  • This class is responsible for processing the raw data from the position-sensitive micro-channel plate (PS-MCP) detector. It currently only works with List Mode (.lmf) files, which is the file type used by CoboldPC, the software used at the CPT to record data from the position-sensitive microchannel plate (PS-MCP). As attributes, it holds the processed data in both array and pandas DataFrame form, as well as any data cuts. CoboldPC is a product of RoentDek.

2. GaussianMixtureModel

  • This class fits Gaussian mixture models to the DataFrame object. As parameters, it takes:
    1. Cartesian/Polar coordinates
    2. Number of components to use
    3. Covariance matrix type
    4. Information criterion
  • Allows for 'strict' fits, i.e. fits where the number of components is specified.
  • Includes a progress bar when clustering data.

3. BayesianGaussianMixture

  • Exact same as the GaussianMixtureModel class, but uses the BayesianGaussianMixture class from scikit-learn instead of the GaussianMixtureModel class.
  • No progress bar.

4. PhaseFirstGaussianModel

  • Implements a fit where the phase dimension is fit to first, followed by a GMM fit to both spatial dimensions in which the phase dimension of the component means is fixed. This type of fit was found to work especially well with data sets in which there are many species.
  • Only works with Polar coordinates
  • Progress bar included.

Each model class also includes the ability to visualize results in several ways (clustering results, one-dimensional histograms, probability density functions) and the ability to copy fit results to the clipboard for pasting into an Excel spreadsheet.

Examples

DataFrame

As this package is designed to be used in PI-ICR experiments, and many such experiments already rely on RoentDek technology, it is assumed that data has already been collected and stored in .lmf files. The first step is to read and process the files, which can be done with the following code:

import piicrgmms.classes as pgc

file = 'C:\here\is\the\path\to\the_file.lmf'
df = pgc.DataFrame(file)
df.process_lmf()

After processing the .lmf file, the object 'df' will have additional attributes. One of these is 'data_array_', which is a numpy array containing the locations of the ion hits on the detector. It has shape (n_samples, 4), and the four columns correspond to the x-, y-, radius, and phase dimensions, in that order.

Other options, such as defining the trap center and data cuts, can be passed to the initialization of the DataFrame as keyword arguments. For example, to move the location of the trap center to (1, 1) in Cartesian coordinates, give the trap center location an uncertainty of 0.02, restrict the data set to shots on the PS-MCP in which there was at least 1 ion but less than 5, and output the phase dimension in radians:

center = (1, 1)
center_uncertainty = (0.02, 0.02)
ion_cut = (0, 5)
# Note that the ion_cut values are *not* inclusive.
df = pgc.DataFrame(file, center=center, center_unc=center_uncertainty), 
                   ion_cut=ion_cut, phase_units='rad')
df.process_lmf()

Alternatively, the line StartTime, EndTime = df.process_lmf() processes the .lmf file and outputs the times that data recording began and ended.

Following processing, the DataFrame object can export the data in more meaningful formats. The following block of code returns the 'data_array_' in the form of a pandas.ExcelWriter object, returns and shows a 2D histogram of the locations of the ion hits on the detector, and saves both objects:

spreadsheet = df.return_processed_data_excel()
spreadsheet.save()

fig, save_string = df.get_data_figure()
plt.savefig(save_sring)
plt.show()

GaussianMixtureModel / BayesianGaussianMixture / PhaseFirstGaussianModel

Once the data has been processed using the function df.process_lmf(), it is ready to be clustered. This is done with an eight-component Gaussian Mixture Model in Cartesian coordinates, for example, using the following code:

model = GaussianMixtureModel(n_components=8, coordinates='Cartesian')
model.cluster_data(data_frame_object=df)

Other clustering algorithms are used by changing the model that is initialized to either BayesianGaussianMixture() or PhaseFirstGaussianModel(). The function cluster_data gives the model object several important attributes, such as:

  • model.centers_array_, which is a numpy array of shape (n_components, 9) containing the locations of the cluster centers. Each row corresponds to one cluster, and from left to right the values in each row give the x-location, x uncertainty, y-location, y uncertainty, radius location, radius uncertainty, phase location, phase uncertainty, and overall cluster uncertainty (x unc. and y unc. added in quadrature) of a particular cluster.
  • model.ips_ is an array-like object of length (n_components,) where each value is the number of ions in that cluster. It should be noted that the order of clusters in each attribute is the same, such that the 0th row of model.centers_array_ corresponds to the 0th entry in model.ips_, and so on.
  • model.weights_, model.means_, model.covariances_ are the fit parameters of the Gaussian mixture model.
  • model.labels_ are the cluster assignment of each ion in the data set.

Other attributes from the fit are listed in the documentation.

The centers of the clusters don't have to be taken directly from the fit results. Instead, the centers of the clusters can be recalculated by running model.recalculate_centers_uncertainties(data_frame_object=df, indices=None), where the argument indices is either None if all centers are to be recalculated, or a list giving the cluster indices to change. This function works by taking the cluster assignments obtained from the model.cluster_data() command and fitting two one-dimensional Gaussians to each cluster in order to find the cluster centers and uncertainties. All attributes are then updated accordingly.

Other useful functions associated with the model classes that can be called following clustering are:

  • model.cluster_data_strict(), which is the exact same as model.cluster_data() except it forces the model to have the number of components given by the argument n_components.

  • fig, save_string = model.get_results_fig(data_frame_object=df), which returns a plot of the ion hits showing their cluster assignments, the cluster centers, and cluster center uncertainties. It also returns a suggested string to use when saving the figure.

  • fig, save_string = model.cluster_merger(data_frame_object=df), which activates a GUI that allows for clusters to be merged together in the event that the Gaussian mixture model fails to cluster the data set reasonably. After running, follow the prompts on the command line to go through this process.

    • WARNING: This function should not be used under typical circumstances, as it overrides the mathematically-supported results from the Gaussian mixture models. While the models aren't perfect, their results should not be thrown away lightly in the event that they don't agree with a preferred clustering outcome.

All together, a typical block of code that reads the .lmf, clusters the data, and outputs and saves an image looks something like this:

import piicrgmms.classes as pgc

# Set constants
xC = 1
yC = 1
xC_unc = 0.02
yC_unc = 0.02

center = (xC, yC)
center_unc = (xC_unc, yC_unc)

ion_cut = (0, 5)

file = 'C:\here\is\the\path\to\the_file.lmf'

df = pgc.DataFrame(file=file, center=center, center_unc=center_unc, 
                   ion_cut=ion_cut, phase_units='rad')
StartTime, EndTime = df.process_lmf()

model = pgc.GaussianMixtureModel(n_components=8)
model.cluster_data(data_frame_object=df)
fig, save_string = model.get_results_fig(data_frame_object=df)
plt.savefig(save_string, bbox_inches='tight')
plt.show()

Installation

Dependencies

piicrgmms requires:

  • Python (>=3.6)
  • scikit-learn (>=0.23.2)
  • pandas (>=1.2.0)
  • matplotlib (>=3.3.0)
  • lmfit (>=1.0.0)
  • joblib (>=1.0.0)
  • tqdm (>=4.56.0)
  • pillow (>=8.1.0)
  • webcolors(>=1.11.1)

User Installation

Assuming Python and pip have already been installed, decide whether you want a system-wide or local installation, and which Python distribution (e.g. Anaconda) you want to install under. Then, open the Command Prompt (for regular Python distribution) or the Prompt for another distribution (e.g. Anaconda Prompt for Anaconda), and run either:

  • pip install piicrgmms for a system-wide installation (works for regular Python distributions only), OR
  • pip install -U piicrgmms for a local installation.

If you want to install in a virtual environment instead, then navigate to the virtual environment's directory, activate the virtual environment, and install with the commands above.

Source code

You can check the latest source code with the command
git clone https://github.com/colinweber27/piicrgmms

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piicrgmms-0.1.4.tar.gz (76.4 kB view hashes)

Uploaded Source

Built Distribution

piicrgmms-0.1.4-py2.py3-none-any.whl (83.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page