Skip to main content

Semi-supervised Dimensionality Reduction for Multi-Class, Multi-Label Data

Project description

MCML

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use for single-cell datasets though the method can use any matrix as input.

MCML modules include the MCML and bMCML algorithms for dimensionality reduction, and MCML tools include functions for quantitative analysis of inter- and intra- distances between labeled groups and nearest neighbor metrics in the latent or ambient space. The modules are autoencoder-based neural networks with label-aware cost functions for weight optimization.

Briefly, MCML adapts the Neighborhood Component Analysis algorithm to utilize mutliple classes of labels for each observation (cell) to embed observations of the same labels close to each other. This essentially optimizes the latent space for k-Nearest Neighbors (KNN) classification.

bMCML demonstrates targeted reconstruction error, which optimizes for recapitulation of intra-label distances (the pairwise distances between cells within the same label).

tools include functions for inter- and intra-label distance calculations as well as metrics on the labels of n the k nearest neighbors of each observation. These can be performed on any latent or ambient space (matrix) input.

Requirements

You need Python 3.6 or later to run MCML. You can have multiple Python versions (2.x and 3.x) installed on the same system without problems.

In Ubuntu, Mint and Debian you can install Python 3 like this:

$ sudo apt-get install python3 python3-pip

For other Linux distributions, macOS and Windows, packages are available at

https://www.python.org/getit/

Quick start

MCML can be installed using pip:

$ python3 -m pip install -U MCML

If you want to run the latest version of the code, you can install from git:

$ python3 -m pip install -U git+git://github.com/pachterlab/MCML.git

Examples

Example data download:

$ wget --quiet https://caltech.box.com/shared/static/i66kelel9ouep3yw8bn2duudkqey190j
$ mv i66kelel9ouep3yw8bn2duudkqey190j mat.mtx
$ wget --quiet https://caltech.box.com/shared/static/dcmr36vmsxgcwneh0attqt0z6qm6vpg6
$ mv dcmr36vmsxgcwneh0attqt0z6qm6vpg6 metadata.csv

Extract matrix (obs x features) and labels for each obs:

>>> import pandas as pd
>>> import scipy.io as sio
>>> import numpy as np

>>> mat = sio.mmread('mat.mtx') #Is a centered and scaled matrix (scaling input is optional)
>>> mat.shape
(3850, 1999)

>>> meta = pd.read_csv('metadata.csv')
>>> meta.head()
 Unnamed: 0          sample_name  smartseq_cluster_id  smartseq_cluster  ... n_genes percent_mito pass_count_filter  pass_mito_filter
0  SM-GE4R2_S062_E1-50  SM-GE4R2_S062_E1-50                   46   Nr5a1_9|11 Rorb  ...    9772          0.0              True              True
1  SM-GE4SI_S356_E1-50  SM-GE4SI_S356_E1-50                   46   Nr5a1_9|11 Rorb  ...    8253          0.0              True              True
2  SM-GE4SI_S172_E1-50  SM-GE4SI_S172_E1-50                   46   Nr5a1_9|11 Rorb  ...    9394          0.0              True              True
3   LS-15034_S07_E1-50   LS-15034_S07_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10643          0.0              True              True
4   LS-15034_S28_E1-50   LS-15034_S28_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10550          0.0              True              True

>>> cellTypes = list(meta.smartseq_cluster)
>>> sexLabels = list(meta.sex_label)
>>> len(sexLabels)
3850



To run the MCML algorithm for dimensionality reduction (Python 3):

>>> from MCML.modules import MCML, bMCML

>>> mcml = MCML(n_latent = 50, epochs = 100) #Initialize MCML class

>>> latentMCML = mcml.fit(mat, np.array([cellTypes,sexLabels]) , fracNCA = 0.8 , silent = True) #Run MCML
>>> latentMCML.shape
(3850, 50)

This incorporates both the cell type and sex labels into the latent space construction. Use plotLosses() to view the loss function components over the training epochs.

>>> mcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs



To run the bMCML algorithm for dimensionality reduction (Python 3):

>>> bmcml = bMCML(n_latent = 50, epochs = 100) #Initialize bMCML class


>>> latentbMCML = bmcml.fit(mat, np.array(cellTypes), np.array(sexLabels), silent=True) #Run bMCML
>>> latentbMCML.shape
(3850, 50)

>>> bmcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs

bMCML is optimizing for the intra-distances of the sex labels i.e. the pairwise distances of cells in each sex for each cell type.

For both bMCML and MCML objects, fit() can be replaced with trainTest() to train the algorithms on a subset of the full data and apply the learned weights to the remaining test data. This offers a method assessing overfitting.



To use the metrics available in tools:

>>> from MCML import tools as tl

#Pairwise distances between centroids of cells in each label
>>> cDists = tl.getCentroidDists(mat, np.array(cellTypes)) 
>>> len(cDists)
784

#Avg pairwise distances between cells of *both* sexes, for each cell type
>>> interDists = tl.getInterVar(mat, np.array(cellTypes), np.array(sexLabels))  
>>> len(interDists)
27

#Avg pairwise distances between cells of the *same* sex, for each cell type
>>> intraDists = tl.getIntraVar(mat, np.array(cellTypes), np.array(sexLabels)) 
>>> len(intraDists)
53

#Fraction of neighbors for each cell with same label as cell itself (also returns which labels neighbors have)
>>> neighbor_fracs, which_labels = tl.frac_unique_neighbors(mat, np.array(cellTypes), metric = 1,neighbors = 30)

#Get nearest neighbors for any embedding
>>> orig_neigh = tl.getNeighbors(mat, n_neigh = 15, p=1)
>>> latent_neigh = tl.getNeighbors(latentMCML, n_neigh = 15, p=1)

#Get Jaccard distance between latent and ambient nearest neighbors
>>> jac_dists = tl.getJaccard(orig_neigh, latent_neigh)
>>>len(jac_dists)
3850



To see further details of all inputs and outputs for all functions use:

>>> help(MCML)
>>> help(bMCML)
>>> help(tl)

License

MCML is licensed under the terms of the BSD License (see the file LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MCML-0.0.1.tar.gz (15.0 kB view hashes)

Uploaded source

Built Distribution

MCML-0.0.1-py3-none-any.whl (13.2 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page