Skip to main content

Performs large-scale CMF on flexible data layouts.

Project description

Large-scale Collective Matrix Factorization (lsCMF)

This is a package implementing the data integration methodology described in "Large-scale Data Integration using Matrix Denoising and Geometric Factor Matching" (Held, 2024, arXiv:2405.10036 [stat.ME]).

Install development version

To install the development version of the package run

pip install git+https://github.com/cyianor/lscmf.git

Install stable version

To install the stable version of the package run

pip install lscmf

Usage example

A simple usage example is shown below:

import lscmf
from numpy.random import default_rng

# Control randomness
rng = default_rng(42)

# Simulate some data
# - `viewdims`: Dimensions of each view
# - `factor_scales`: The strength/singular value of each factor. 
#                    The diagonal of the D matrices in the paper.
# - `snr`: Signal-to-noise ratio of the noise added to each true signal
#
# The function below generates orthogonal matrices V_i and uses the
# supplied D_ij to form signal matrices V_i D_ij V_j^T. Noise with
# residual variance controlled by the signal-to-noise ratio is added.
xs_sim = lscmf.simulate(
    viewdims={0: 500, 1: 250, 2: 250},
    factor_scales={
        (0, 1): [3.0, 2.5, 2.0, 0.0, 0.0],
        (0, 2): [2.8, 0.0, 0.0, 2.0, 0.0],
        (1, 2): [1.2, 0.0, 5.0, 0.0, 1.1],
    },
    snr=1.0,
    rng=rng,
)

# `xs_sim` is a dictionary containing
# - "xs_truth", the true signal matrices
# - "xs", the noisy data
# - "vs", the simulated orthogonal factors

# Create the lscmf object and fit the model to data
est = lscmf.LargeScaleCMF().fit(xs_sim["xs"])

Estimates of model parameters are then contained in the LargeScaleCMF object. The estimated singular values can be accessed as shown below.

est.ds_
{(0,
  1): array([-2.98687823,  0.        ,  1.96015864,  0.        ,  2.47498787]),
 (0,
  2): array([-2.78861131,  1.96604697,  0.        ,  0.        ,  0.        ]),
 (1, 2): array([1.13272996, 0.        , 4.98917765, 1.00988163, 0.        ])}

The estimated factors can be accessed as follows, e.g., for view 0 the first factor is

import matplotlib.pyplot as plt

cos_angle = (est.vs_[1][:, 2] * xs_sim["vs"][1][:, 4]).sum()
print(f"Scalar product between estimated and true factor: {cos_angle:.2f}")

fig = plt.figure(figsize=(8, 3), dpi=100)
ax = fig.add_subplot(111)
# Negate integrated factor since it was estimated as -v instead of v
ax.hist((-est.vs_[1][:, 2]) - xs_sim["vs"][1][:, 4], bins=30)
ax.set_xlabel("Element-wise difference between estimated and true factor")
ax.set_ylabel("Frequency");
ax.set_title("View 1, Integrated factor 3, True factor 5");
Scalar product between estimated and true factor: 0.00

png

A raw graph-based interface exists as well.

# Create a view graph to hold the data layout
G = lscmf.ViewGraph()
# Add data
# - `names` need to be provided as an iterable.
#   These are in general arbitrary, however, in case of
#   repeated layers, each layer requires a unique name.
# - `xs` is an iterable to the input data in the same order as
#   `names`
# - `viewrels` is an iterable containing tuples describing the
#    relationships between views contained in data matrices.
G.add_data_from(["x01", "x02", "x12"], xs_sim["xs"].values(), [(0, 1), (0, 2), (1, 2)])

# Once data is added to the view graph, joint matrices for each
# view need to be formed and denoising needs to be performed.
# Different types of shrinkers can be used for denoising and they
# depend on the type of loss assumed for reconstruction of the
# signal. See Gavish and Donoho (2017) for details.
# In the paper, Frobenius loss is assumed, and therefore the resulting
# `FrobeniusShrinker` is used here.
lscmf.precompute(G, lscmf.FrobeniusShrinker)

# Finally, matching of factors for each view and merging of the
# factor match graphs is performed. This function returns
# the final merged factor match graph.
H = lscmf.match_factors(G)

Matches in the factor match graph can be investigated. MatchNodes contain a data_edge, which is a MultiEdge(name, viewrel) corresponding to an input matrix during view graph construction, and a factor which is the factor in the data_edge. Keys of the dictionary are integrated factors.

In the example below, MatchNode(data_edge=MultiEdge(x01, (0, 1)), factor=0) is the first factor in data matrix $X_{01}$ which is being associated with integrated factor 0.

The numbering of integrated factors is arbitrary and may be non-consecutive.

H.graph
defaultdict(set,
            {0: {MatchNode(data_edge=MultiEdge(x01, (0, 1)), factor=0),
              MatchNode(data_edge=MultiEdge(x02, (0, 2)), factor=0),
              MatchNode(data_edge=MultiEdge(x12, (1, 2)), factor=1)},
             2: {MatchNode(data_edge=MultiEdge(x02, (0, 2)), factor=1)},
             1: {MatchNode(data_edge=MultiEdge(x01, (0, 1)), factor=2),
              MatchNode(data_edge=MultiEdge(x12, (1, 2)), factor=0)},
             3: {MatchNode(data_edge=MultiEdge(x12, (1, 2)), factor=2)},
             4: {MatchNode(data_edge=MultiEdge(x01, (0, 1)), factor=1)}})

Reconstruction of $D_{ij}$ and $V_i$ is performed in LargeScaleCMF.fit() and it is recommended to use that interface unless the graph interface is required for a specific reason.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lscmf-0.2.1.post1.tar.gz (58.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lscmf-0.2.1.post1-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file lscmf-0.2.1.post1.tar.gz.

File metadata

  • Download URL: lscmf-0.2.1.post1.tar.gz
  • Upload date:
  • Size: 58.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for lscmf-0.2.1.post1.tar.gz
Algorithm Hash digest
SHA256 5ee2b3415548ccf4790d3948da26084e76945d4753ddb4271ab3488b67df3442
MD5 74d5b755af86153db1af7f383056075e
BLAKE2b-256 fe99fa81a7ed2e7d9817129aadb6e306ad7a39c20a2a8e83e8962a3f7c5633e1

See more details on using hashes here.

File details

Details for the file lscmf-0.2.1.post1-py3-none-any.whl.

File metadata

  • Download URL: lscmf-0.2.1.post1-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for lscmf-0.2.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 df32a966db05b5722a1f6708b4f745b536d94db9840d4cc84041765e4825c3e6
MD5 b21c015b2a4c3998c71021666a8e8ac3
BLAKE2b-256 7b773670e220f1d6a3debee9b344f74c848c66fa7131a2173219edc76348b8ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page