SSNMF contains class for (SS)NMF model and several multiplicative update methods to train different models.
Project description
SSNMF
SSNMF contains class for (SS)NMF model and several multiplicative update methods to train different models.
Documentation
The NMF model consists of the data matrix to be factorized, X, the factor matrices, A and S. Each model also consists of a label matrix, Y, classification factor matrix, B, and classification weight parameter, lam (although these three variables will be empty if Y is not input). These parameters define the objective function defining the model:
 X  AS_{F}^{2}
 D(XAS)
 X  AS_{F}^{2} + λ* Y  BS_{F}^{2}
 X  AS_{F}^{2} + λ * D(YBS)
 D(XAS) + λ* Y  BS_{F}^{2}
 D(XAS) + λ* D(YBS)

Parameters
 X : numpy array or torch.Tensor Data matrix of size m x n.
 k : int_ Number of topics.
 modelNum : int_, optional
Number indicating which of above models user intends to train (the default is 1).  A : numpy array or torch.Tensor, optional
Initialization for left factor matrix of X of size m x k (the default is a matrix with uniform random entries).  S : numpy array or torch.Tensor, optional
Initialization for right factor matrix of X of size k x n (the default is a matrix with uniform random entries).  Y : numpy array or torch.Tensor, optional
Label matrix of size p x n (default is None).  B : numpy array or torch.Tensor, optional
Initialization for left factor matrix of Y of size p x k (the default is a matrix with uniform random entries if Y is not None, None otherwise).  lam : float_, optional
Weight parameter for classification term in objective (the default is 1 if Y is not None, None otherwise).  W : numpy array or torch.Tensor, optional
Missing data indicator matrix of same size as X (the defaults is matrix of all ones).  L : numpy array or torch.Tensor, optional
Missing label indicator matrix of same size as Y (the default is matrix of all ones if Y is not None, None otherwise).  tol : float_, optional
Tolerance for relative error stopping criterion (i.e., method stops when difference between consecutive relative errors falls below top)  str : string, private
a flag to indicate whether this model is initialized by Numpy array or PyTorch tensor

Methods
 mult(numiters = 10, saveerrs = True)
Train the selected model via numiters multiplicative updates  accuracy()
Compute the classification accuracy of supervised model (using Y, B, and S)  fronorm(Z, D, S, M)
Compute Frobenius norm Z  DS_{F}
M is missing data indicator matrix of same size as Z (the defaults is matrix of all ones)  Idiv(Z, D, S, M)
Compute Idivergence D(ZDS)
M is missing data indicator matrix of same size as Z (the defaults is matrix of all ones)
 mult(numiters = 10, saveerrs = True)
Installation
To install SSNMF, run this command in your terminal:
$ pip install U ssnmf
This is the preferred method to install SSNMF, as it will always install the most recent stable release.
If you don't have pip installed, these installation instructions can guide you through the process.
Usage
First, import the ssnmf
package and the relevant class SSNMF
. We import numpy
, scipy
, and torch
for experimentation.
>>> import ssnmf
>>> from ssnmf import SSNMF
>>> import numpy as np
>>> import torch
>>> import scipy
>>> import scipy.sparse as sparse
>>> import scipy.optimize
SSNMF can take both Numpy array and PyTorch Tensor to initialize an (SS)NMF model.
If a model is initialized with PyTorch Tensor, GPU
may be utilized to run the model.
To use GPU
to run (SS)NMF, users should have PyTorch
installed on their devices. To test if the GPU
is available for their devices, run the following code. If it returns True
, then GPU
will be used to run this model, otherwise the CPU will be used.
>>> torch.cuda.is_available()
1. Training an unsupervised model without missing data using Numpy array.
Declare an unsupervised NMF model X  AS_{F}^{2} with data matrix X
and number of topics k
.
>>> X = np.random.rand(100,100)
>>> k = 10
>>> model = SSNMF(X,k,modelNum=1)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error X  AS_{F}^{2} / X  A_{0}S_{0}_{F}^{2}
>>> rel_error = model.fronorm(model.X, model.A, model.S, model.W)**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> print("the initial relative reconstruction error is ", rel_error)
Run the multiplicative updates method for this unsupervised model for N
iterations. This method tries to minimize the objective function X  AS_{F}
>>> N = 100
>>> [errs] = model.mult(numiters = N, saveerrs = True)
This method tries to updates the factor matrices N times. The actual number of updates depends on both N and the tolerance. You can see how many iterations that the model actually ran and how much the relative reconstruction error improves.
>>> size = errs.shape[0]
>>> print("number of iterations that this model runs: ", size)
>>> rel_error = errs[size  1]**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> print("the final relative reconstruction error is ", rel_error)
2. Training an unsupervised model without missing data using PyTorch tensor.
Declare an unsupervised NMF model D(XAS)
with data matrix X
and number of topics k
.
>>> d = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
>>> X = torch.rand(100, 100, dtype=torch.float, device=d)
>>> k = 10
>>> model = SSNMF(X,k,modelNum=2)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error D(XAS)/D(XA_{0}S_{0})
>>> rel_error = model.Idiv(model.X, model.A, model.S, model.W)/model.Idiv(model.X, A0, S0, model.W)
>>> print("the initial relative reconstruction error is ", rel_error)
Run the multiplicative updates method for this unsupervised model for N
iterations. This method tries to minimize the objective function D(XAS)
>>> N = 100
>>> [errs] = model.mult(numiters = N, saveerrs = True)
This method tries to updates the factor matrices N times. The actual number of updates depends on both N and the tolerance. You can see how many iterations that the model actually ran and how much the relative reconstruction error improves.
>>> size = errs.shape[0]
>>> print("number of iterations that this model runs: ", size)
>>> rel_error = errs[size  1]/model.Idiv(model.X, A0, S0, model.W)
>>> print("the final relative reconstruction error is ", rel_error)
3. Training a supervised model without missing data using Numpy array
We begin by generating some synthetic data for testing.
>>> labelmat = np.concatenate((np.concatenate((np.ones([1,10]),np.zeros([1,30])),axis=1),np.concatenate((np.zeros([1,10]),np.ones([1,10]),np.zeros([1,20])),axis=1),np.concatenate((np.zeros([1,20]),np.ones([1,10]),np.zeros([1,10])),axis=1),np.concatenate((np.zeros([1,30]),np.ones([1,10])),axis=1)))
>>> B = sparse.random(4,10,density=0.2).toarray()
>>> S = np.zeros([10,40])
>>> for i in range(40):
... S[:,i] = scipy.optimize.nnls(B,labelmat[:,i])[0]
>>> A = np.random.rand(40,10)
>>> X = A @ S
Declare a supervised SSNMF model X  AS_{F}^{2} + λ* Y  BS_{F}^{2} with data matrix X
, number of topics k
, label matrix Y
, and weight parameter λ.
>>> k = 10
>>> model = SSNMF(X,k,Y = labelmat,lam=100*np.linalg.norm(X,'fro'),modelNum=3)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error X  AS_{F}^{2} / X  A_{0}S_{0}_{F}^{2} and classification accuracy.
>>> rel_error = model.fronorm(model.X, model.A, model.S, model.W)**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> acc = model.accuracy()
>>> print("the initial relative reconstruction error is ", rel_error)
>>> print("the initial classifier's accuracy is ", acc)
Run the multiplicative updates method for this supervised model for N
iterations. This method tries to minimize the objective function X  AS_{F}^{2} + λ* Y  BS_{F}^{2} . This also saves the errors and accuracies in each iteration.
>>> N = 100
>>> [errs,reconerrs,classerrs,classaccs] = model.mult(numiters = N,saveerrs = True)
This method updates the factor matrices N times. You can see how much the relative reconstruction error and classification accuracy improves.
>>> size = reconerrs.shape[0]
>>> rel_error = reconerrs[size  1]**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> acc = classaccs[size  1]
>>> print("number of iterations that this model runs: ", size)
>>> print("the final relative reconstruction error is ", rel_error)
>>> print("the final classifier's accuracy is ", acc)
4. Training a supervised model without missing data using PyTorch tensor
Generating some synthetic data for testing.
>>> labelmat = np.concatenate((np.concatenate((np.ones([1,10]),np.zeros([1,30])),axis=1),np.concatenate((np.zeros([1,10]),np.ones([1,10]),np.zeros([1,20])),axis=1),np.concatenate((np.zeros([1,20]),np.ones([1,10]),np.zeros([1,10])),axis=1),np.concatenate((np.zeros([1,30]),np.ones([1,10])),axis=1)))
>>> B = sparse.random(4,10,density=0.2).toarray()
>>> S = np.zeros([10,40])
>>> for i in range(40):
... S[:,i] = scipy.optimize.nnls(B,labelmat[:,i])[0]
>>> A = np.random.rand(40,10)
>>> X = A @ S
Define a simple function to convert Numpy array to PyTorch tensor.
parameter m : the numpy array to be converted to PyTorch tensor
parameter device : device of the PyTorch tensor(e.g. GPU or CPU)
(Important notice : When use the function torch.from_numpy() to convert numpy array to PyTorch tensor, the data may lose precision. Here we use it only because the data is artificially generated to ensure X can be decomposed to A and S. If you apply ssnmf model using PyTorch on your own real data, you should store the data as PyTorch tensors to avoid precision loss)
>>> def getTensor(m, device):
>>> mt = torch.from_numpy(copy.deepcopy(m))
>>> mt = mt.type(torch.FloatTensor)
>>> mt = mt.to(device)
>>> return mt
Declare a supervised SSNMF model X  AS_{F}^{2} + λ*D(YBS) with data matrix X
, number of topics k
, label matrix Y
, and weight parameter λ.
>>> devise = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
>>> Xt = getTensor(X, devise)
>>> Yt = getTensor(labelmat, devise)
>>> k = 10
>>> model = SSNMF(Xt,k,Y = Yt,lam=100*torch.norm(Xt), modelNum=4)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error X  AS_{F}^{2} / X  A_{0}S_{0}_{F}^{2} and classification accuracy.
>>> rel_error = model.fronorm(model.X, model.A, model.S, model.W)**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> acc = model.accuracy()
>>> print("the initial relative reconstruction error is ", rel_error)
>>> print("the initial classifier's accuracy is ", acc)
Run the multiplicative updates method for this supervised model for N
iterations. This method tries to minimize the objective function X  AS_{F}^{2} + λ*D(YBS). This also saves the errors and accuracies in each iteration.
>>> N = 100
>>> [errs,reconerrs,classerrs,classaccs] = model.mult(numiters = N,saveerrs = True)
This method updates the factor matrices N times. You can see how much the relative reconstruction error and classification accuracy improves.
>>> size = reconerrs.shape[0]
>>> rel_error = reconerrs[size  1]**2/model.fronorm(model.X, A0, S0, model.W)**2
>>> acc = classaccs[size  1]
>>> print("number of iterations that this model runs: ", size)
>>> print("the final relative reconstruction error is ", rel_error)
>>> print("the final classifier's accuracy is ", acc)
5. Training a supervised model with missing data using Numpy array
Generating some synthetic data for testing.
>>> labelmat = np.concatenate((np.concatenate((np.ones([1,10]),np.zeros([1,30])),axis=1),np.concatenate((np.zeros([1,10]),np.ones([1,10]),np.zeros([1,20])),axis=1),np.concatenate((np.zeros([1,20]),np.ones([1,10]),np.zeros([1,10])),axis=1),np.concatenate((np.zeros([1,30]),np.ones([1,10])),axis=1)))
>>> B = sparse.random(4,10,density=0.2).toarray()
>>> S = np.zeros([10,40])
>>> for i in range(40):
... S[:,i] = scipy.optimize.nnls(B,labelmat[:,i])[0]
>>> A = np.random.rand(40,10)
>>> X = A @ S
Define a simple function to generate a W matirx(missing data indicator matrix ).
parameter X : the matrix that with missing data
parameter per : the percentage of missing data that X has(e.g. per=10 means 10% data of X is missing)
(Important notice: this function is just for showing people how to use the ssnmf model when there is missing data in X. In practical application, use your own missing data indicator matrix based on your real data)
>>> def getW(X, per):
>>> num = round(per/100 * X.shape[0] * X.shape[1])
>>> W = np.ones(shape = X.shape)
>>> row = [i for i in range(X.shape[0])]
>>> column = [i for i in range(X.shape[1])]
>>> index = random.sample(list(itertools.product(row, column)), num)
>>> for i in range(num):
>>> W[index[i][0]][index[i][1]] = 0
>>> return W
Declare a supervised SSNMF model D(XAS)_{F}^{2} + λ* Y  BS_{F}^{2} with data matrix X
, number of topics k
, label matrix Y
, missing data indicator matrix W
, and weight parameter λ.
>>> k = 10
>>> W0 = getW(X, 10)
>>> model = SSNMF(X,k,Y = labelmat,lam=100*np.linalg.norm(X,'fro'), W = W0, modelNum=5)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error D(XAS)/D(XA_{0}S_{0}) and classification accuracy.
>>> rel_error = model.Idiv(model.X, model.A, model.S, model.W)/model.Idiv(model.X, A0, S0, model.W)
>>> acc = model.accuracy()
>>> print("the initial relative reconstruction error is ", rel_error)
>>> print("the initial classifier's accuracy is ", acc)
Run the multiplicative updates method for this supervised model for N
iterations. This method tries to minimize the objective function D(XAS) + λ* Y  BS_{F}^{2} . This also saves the errors and accuracies in each iteration.
>>> N = 100
>>> [errs,reconerrs,classerrs,classaccs] = model.mult(numiters = N,saveerrs = True)
This method updates the factor matrices N times. You can see how much the relative reconstruction error and classification accuracy improves.
>>> size = reconerrs.shape[0]
>>> rel_error = reconerrs[size  1]/model.Idiv(model.X, A0, S0, model.W)
>>> acc = classaccs[size  1]
>>> print("number of iterations that this model runs: ", size)
>>> print("the final relative reconstruction error is ", rel_error)
>>> print("the final classifier's accuracy is ", acc)
6. Training a supervised model missing labels using PyTorch tensor
Generating some synthetic data for testing.
>>> labelmat = np.concatenate((np.concatenate((np.ones([1,10]),np.zeros([1,30])),axis=1),np.concatenate((np.zeros([1,10]),np.ones([1,10]),np.zeros([1,20])),axis=1),np.concatenate((np.zeros([1,20]),np.ones([1,10]),np.zeros([1,10])),axis=1),np.concatenate((np.zeros([1,30]),np.ones([1,10])),axis=1)))
>>> B = sparse.random(4,10,density=0.2).toarray()
>>> S = np.zeros([10,40])
>>> for i in range(40):
... S[:,i] = scipy.optimize.nnls(B,labelmat[:,i])[0]
>>> A = np.random.rand(40,10)
>>> X = A @ S
Define a simple function to generate an L matrix (Missing label indicator matrix)
Parameter Y : the label matrix with missing data.
Parameter per : the percentage of missing data that Y has(e.g. per=10 means 10% data of Y is missing)
Parameter device: device of the PyTorch tensor(e.g. GPU or CPU)
(Important notice: this function is just for showing people how to use the ssnmf model when there is missing data in label matrix Y. In practical application, use your own missing label indicator matrix based on your real data)
>>> def getL(Y, per):
>>> num = round(per/100 * Y.shape[1])
>>> L = np.ones(shape = Y.shape)
>>> column = [i for i in range(Y.shape[1])]
>>> index = random.sample(column, num)
>>> L[:,index] = 0
>>> return L
Declare a supervised SSNMF model D(XAS) + λ*D(YBS) with data matrix X
, number of topics k
, label matrix Y
, and weight parameter λ.
>>> devise = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
>>> Xt = getTensor(X, devise) ## getTensor() defined in section 4.Training a supervised model without missing data using PyTorch tensor
>>> Yt = getTensor(labelmat, devise)
>>> L0 = getL(Y, 10, device)
>>> k = 10
>>> model = SSNMF(Xt,k,Y = Yt,lam=100*torch.norm(Xt), L=L0, modelNum=6)
>>> A0 = model.A
>>> S0 = model.S
You may access the factor matrices initialized in the model, e.g., to check relative reconstruction error D(XAS)/D(XA_{0}S_{0}) and classification accuracy.
>>> rel_error = model.Idiv(model.X, model.A, model.S, model.W)/model.Idiv(model.X, A0, S0, model.W)
>>> acc = model.accuracy()
>>> print("the initial relative reconstruction error is ", rel_error)
>>> print("the initial classifier's accuracy is ", acc)
Run the multiplicative updates method for this supervised model for N
iterations. This method tries to minimize the objective function D(XAS) + λ*D(YBS). This also saves the errors and accuracies in each iteration.
>>> N = 100
>>> [errs,reconerrs,classerrs,classaccs] = model.mult(numiters = N,saveerrs = True)
This method updates the factor matrices N times. You can see how much the relative reconstruction error and classification accuracy improves.
>>> size = reconerrs.shape[0]
>>> rel_error = reconerrs[size  1]/model.Idiv(model.X, A0, S0, model.W)
>>> acc = classaccs[size  1]
>>> print("number of iterations that this model runs: ", size)
>>> print("the final relative reconstruction error is ", rel_error)
>>> print("the final classifier's accuracy is ", acc)
Citing
If you use our code in an academic setting, please consider citing the following paper.
J. Haddock, L. Kassab, S. Li, A. Kryshchenko, R. Grotheer, E. Sizikova, C. Wang, T. Merkh, R. W. M. A. Madushani, M. Ahn, D. Needell, and K. Leonard, "Semisupervised Nonnegative Matrix Factorization Models for Topic Modeling in Learning Tasks." Submitted, 2020.
Development
See CONTRIBUTING.md for information related to developing the code. © 2020 GitHub, Inc. Terms Privacy Security Status Help Contact GitHub Pricing API Training Blog About
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ssnmf1.0.3py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  6b6184424c48f1a1020f78625c7f12da05a07ecc3bad9ffe38a0cc11e2943355 

MD5  5812decfdb13957a8eb32017352adca3 

BLAKE2b256  6fcc2589dddcf908b32ba7fcc0ccc52db6000e7bcf1537acb983e87772fb06f8 