Skip to main content

Dimensionality reduction by preserving clusters and correlations.

Project description

PCC - Dimensionality reduction with very high global structure preservation

License: MIT Build Status

PCC

PrePrint: https://arxiv.org/abs/2503.07609 Authors: Jacob Gildenblat, Jens Pahnke

pip install pccdr

from pcc import PCUMAP
pcumap_embedding = PCUMAP(device='cuda').fit_transform(X)

⭐ This is a python package for dimensionality reduction (DR) with high global structure preservation.

⭐ That means that unlike in popular DR methods like UMAP, the distances between transformed points - will actually mean something.

⭐ Use PCUMAP for simply enhancing the widely used UMAP method with global structure preservation.

⭐ Or use it with our own PCC objective that resutls with extremely high global structure preservation, and competitive local structure.

(For spearman correlation support, install torchsort (pip install torchsort))

A few visual examples

Image Description
Fashion MNIST An example on the Fashion-Mnist dataset
MSI An application on Mass Spectometry Imaging
Macosko single cell dataset An application illustarting the global structure preservation on the Macosko single cell dataset compared to UMAP

PCC is built on the idea of sampling reference points, meausring distances of all data points from the reference points, and maximizing the correlations of these distances in the high dimensional data, and the transformed low dimensional data.

Usage examples

See examples/macosko.ipynb for more detailed explanation and usage examples.

There are two modes:

Plugging into UMAP, for getting a meaningful transformation where distances between points mean something

Here we use the excellent recent TorchDR library, and add plug in our objective into UMAP.

from pcc import PCUMAP
pcumap_embedding = PCUMAP(device='cuda', n_components=2).fit_transform(X)

PCC as a standalone DR method with a multi task objective

This optimizes a local structure preservation multi task objective that tries to predict which clusters points belong to, as well as global structure preservation loss that maximizes corerlations between distances of all points to sampled reference points.

First, lets cluster the points with different clustering models:

np.random.seed(0)

clusters = []
n_clusters_list = [4, 8, 16, 32, 64]
for n_clusters in n_clusters_list:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init="auto")
    cluster_labels = kmeans.fit_predict(X)
    clusters.append(cluster_labels)

Then we can call PCC:

pcc_reducer = PCC(n_components=2, num_epochs=2000, num_points=1000, pearson=True, 
                  spearman=False, beta=5, k_epoch=2)
pcc_embedding = pcc_reducer.fit_transform(X, clusters)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pccdr-1.0.3.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pccdr-1.0.3-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file pccdr-1.0.3.tar.gz.

File metadata

  • Download URL: pccdr-1.0.3.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for pccdr-1.0.3.tar.gz
Algorithm Hash digest
SHA256 bb30e803359ed4748232f422b875ea24c9947f75efb8b84bc9064feacb773233
MD5 d9dd67a60839ae5a18f7fe644756ea8a
BLAKE2b-256 b4572d9a0c8f6a19f7f379ce36944a035cdbfcf67c4c3589c2e149cabc3d4f3b

See more details on using hashes here.

File details

Details for the file pccdr-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: pccdr-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for pccdr-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b228427754edc47385d82ebae24e0c1ab99a2fd95e44959dbe880687ecbf35df
MD5 739d0db7e871361353c045d96dfb4132
BLAKE2b-256 84cbfb894ed508d5945dae610881ca29e2be1f4183391fe45936184fd780e05b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page