Skip to main content

Estimate the optimal number of components in PCA-based dimension reduction.

Project description

Estimate the optimal number of components in a PCA, using the SHEM procedure: Split-Half Eigenvector Matching.

The get_n_components function estimates the true (or "generating") number of principal components. While scree/elbow/knee criteria for the eigenvalues curve is common, this is known to be a very fallible heuristic. The rationale of this alternative procedure is that true principal components should be found in random split halves of the data. The estimate is therefore based on measuring the similarity of eigenvectors between a set of split-halves; i.e., the procedure doesn't use the shape of the eigenvalue curve. Instead, a separation is made between components with high versus low split-half similarity.

Detail: For an nd-array X, with shape == (nObservations, nVariables), a number of random splits are performed. For each split separately, a PCA is performed, via eigendecomposition of the covariance matrix of X. Each of the first split's eigenvectors is matched to the most-similar of the second split's eigenvectors. Similarity is measured via the dot product. The vector of similarities is sorted from high to low, and the vectors are averaged over all random splits. Finally, the optimal seperation between the high versus low similarities is determined by a basic between-within variance criterion. An estimated zero components is possible.

Usage:

O = teg_get_best_n.get_n_components(X)

This returns a dictionary with the estimated number of components in O['nComponents'], as well as the eigenvalues (O['eigenvalues']) and eigenvectors (O['eigenvectors']).

Example.py contains tests with simulated data to check how well the true number of latent variables, used to generate simulated data, is recovered.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teg_get_best_n-0.0.3.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

teg_get_best_n-0.0.3-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file teg_get_best_n-0.0.3.tar.gz.

File metadata

  • Download URL: teg_get_best_n-0.0.3.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for teg_get_best_n-0.0.3.tar.gz
Algorithm Hash digest
SHA256 d04b4a2ff6a54f28ffb1b8c8a42db1959bc4ea63a617b3924135d3363a2cbac7
MD5 a86c9d81651ae3b4d04d245f75b0f31e
BLAKE2b-256 ef2a8f78b7011ea09c261c897278476cd70906441c3c3b8142e2aeea7f9dc8c8

See more details on using hashes here.

File details

Details for the file teg_get_best_n-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for teg_get_best_n-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b86dc2ae994064f1b5825a12f2e7f8afae80577a5f47d6afa2c6766823be1c0a
MD5 46712f3d23f4f1e477efe865d2c76e96
BLAKE2b-256 413ef07852328a0006fdb9f7aa162d0eb9efa2005fb9f02018697b2ac80ebc47

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page