Skip to main content

Tensorized (and parallelizable) pytorch implementation of the algorithm for intrinsic dimension estimation : MLE and TWO-NN

Project description

Tensorized (and parallelizable) pytorch implementation of the algorithm for intrinsic dimension estimation :

1. Maximum Likelihood Estimation appoach

Calculates intrinsic dimension of the provided data points with the Maximum Likelihood Estimation appoach.

References:

  • [1] Elizaveta Levina and Peter J Bickel. Maximum likelihood estimation of intrinsic dimension. \

In Advances in neural information processing systems, pp. 777–784, 2005. \

https://www.stat.berkeley.edu/~bickel/mldim.pdf \

  • [2] David J.C. MacKay and Zoubin Ghahramani. Comments on ‘maximum likelihood estimation of intrinsic dimension’

by e. levina and p. bickel (2004), 2005. http://www.inference.org.uk/mackay/dimension/

  • [3] THE INTRINSIC DIMENSION OF IMAGES AND ITS IMPACT ON LEARNING, Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, Tom Goldstein \

https://openreview.net/pdf?id=XJk19XzGq2J

One of the main approaches to intrinsic dimension estimation is to examine a neighborhood around each point in the dataset, and compute the Euclidean distance to the $k^{th}$ nearest neighbor. Assuming that density is constant within small neighborhoods, the Maximum Likelihood Estimation (MLE) of [1] uses a Poisson process to model the number of points found by random sampling within a given radius around each sample point. By relating the rate of this process to the surface area of the sphere, the likelihood equations yield an estimate of the ID at a given point $x$ as:

$$\hat{m}k(x) = \bigg[ \frac{1}{k-1} \sum{j=1}^{k-1} log \frac{T_k(x)}{T_j(x)} \bigg]^{-1}$$

where $T_j(x)$ is the Euclidean ($l_2$) distance from $x$ to its $j^{th}$ nearest neighbor. [1] propose to average the local estimates at each point to obtain a global estimate ($n$ is the number of sample)

$$\hat{m}k = \frac{1}{n} \sum{i=1}^{n} \hat{m}_k(x_i)$$

[2] suggestion a correction based on averaging of inverses

$$\hat{m}k = \bigg[ \frac{1}{n} \sum{i=1}^{n} \hat{m}k(x_i)^{-1} \bigg]^{-1} = \bigg[ \frac{1}{n(k-1)} \sum{i=1}^{n} \sum_{j=1}^{k-1} log \frac{T_k(x_i)}{T_j(x_i)} \bigg]^{-1}$$

2. TWO-NN

Calculates intrinsic dimension of the provided data points with the TWO-NN algorithm.

References:

  • E. Facco, M. d’Errico, A. Rodriguez & A. Laio \

Estimating the intrinsic dimension of datasets by a minimal neighborhood information \

https://doi.org/10.1038/s41598-017-11873-y

2-NN algorithm :

  1. Compute the pairwise distances for each point in the dataset $i = 1, …, N$.

  2. For each point $i$ find the two shortest distances $r_1$ and $r_2$.

  3. For each point $i$ compute $µ_i = \frac{r_1}{r_2 }$

  4. Compute the empirical cumulate $F^{emp}(μ)$ by sorting the values of $μ$ in an ascending order through a permutation $σ$, then define $F^{emp}(μ_{σ(i)}) = \frac{1}{N}$

  5. Fit the points of the plane given by coordinates ${(log(μ_i), −log(1−F^{emp}(μ_i))) \ | \ i=1,..., N}$ with a straight line passing through the origin.

3. Installation

pip install intrinsics_dimension

4. Get started

from intrinsics_dimension mle_id, twonn_numpy, twonn_pytorch



n, dim = 512, 1024

data = torch.randn(n, dim)

d1 = mle_id(data, k=2, averaging_of_inverses = False)

d2 = mle_id(data, k=2, averaging_of_inverses = True)

d3 = twonn_numpy(data.numpy(), return_xy=False)

d4 = twonn_pytorch(data, return_xy=False)

print(d1, d2, d3, d4)



data[1] = data[0] # make distance(data[1], data[0]) = 0

d1 = mle_id(data, k=2, averaging_of_inverses = False)

d2 = mle_id(data, k=2, averaging_of_inverses = True)

d3 = twonn_numpy(data.numpy(), return_xy=False)

d4 = twonn_pytorch(data, return_xy=False)

print(d1, d2, d3, d4)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intrinsics_dimension-0.1.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

intrinsics_dimension-0.1.0-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file intrinsics_dimension-0.1.0.tar.gz.

File metadata

  • Download URL: intrinsics_dimension-0.1.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for intrinsics_dimension-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e0aa07b6a54f0909fbe8aeb8b05db5fec297bde356d3bc9547383f746afe9774
MD5 b2c467bad4ed6b069f9f3710a803e378
BLAKE2b-256 95ebfbabf5552392a4767b6319985ee1482c280cf364abcf9cf1f91647ad8622

See more details on using hashes here.

File details

Details for the file intrinsics_dimension-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for intrinsics_dimension-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf760e14d32b249c89c03183ec5a88918bde33f0514a489b7580babb5a65edcc
MD5 3bd8e942a04d3a5a309f99ec18c139d7
BLAKE2b-256 93aeb3d0b40bfbfec0ca239613514c9ef3fb4f61e4b4eff57d42fb686a0242ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page