Skip to main content

GPU-accelerated Incremental PCA using PyTorch, with sklearn-compatible API

Project description

Incremental PCA for PyTorch

PyPI version Python 3.8+ License: MIT

Incremental Principal Component Analysis (PCA) using PyTorch: This package provides a scikit-learn compatible API for performing PCA. This allows for PCA to be performed on datasets that are too large to fit in memory.

Features

  • GPU Acceleration: Perform PCA on GPUs for significant speedups on large datasets
  • Memory Efficient: Process data in batches to handle datasets larger than available RAM/VRAM ("out of core")
  • sklearn Compatible: Drop-in replacement with familiar fit, transform, fit_transform API
  • Streaming Support: Use partial_fit for online learning from data streams
  • Lazy arrays / numpy memmap Support: Efficiently process arrays on disk and memory-mapped files

Installation

pip install incremental-pca-torch

From Source

git clone https://github.com/RichieHakim/incremental_pca.git
cd incremental_pca
pip install -e ".[dev]"

Quick Start

import numpy as np
from incremental_pca_torch import IncrementalPCA

# Create some data
X = np.random.randn(10000, 500).astype(np.float32)

# Fit incrementally using GPU
ipca = IncrementalPCA(
    n_components=50, 
    batch_size=256, 
    device='cuda'  # Use 'cpu' if no GPU available
)
ipca.fit(X)

# Transform new data
X_transformed = ipca.transform(X)
print(f"Reduced shape: {X_transformed.shape}")  # (10000, 50)

# Reconstruct data
X_reconstructed = ipca.inverse_transform(X_transformed)

Streaming Data with partial_fit

# For streaming or very large datasets
ipca = IncrementalPCA(n_components=50, device='cuda')

# Process data in chunks
for chunk in data_generator():
    ipca.partial_fit(chunk)

# Use the fitted model
X_transformed = ipca.transform(new_data)

Using with Memory-Mapped Arrays

import numpy as np

# Memory-mapped files work seamlessly
X_mmap = np.load('large_data.npy', mmap_mode='r')

ipca = IncrementalPCA(n_components=50, batch_size=256, device='cuda')
ipca.fit(X_mmap)  # Loads only one batch at a time

API Reference

IncrementalPCA

IncrementalPCA(
    n_components=None,     # Number of components (default: min(n_samples, n_features))
    whiten=False,          # Scale components to unit variance
    batch_size=128,        # Samples per batch for fit/transform
    device='cpu',          # 'cpu', 'cuda', 'cuda:0', 'mps', etc.
    dtype=torch.float32,   # torch.float32 or torch.float64
    whiten_eps=1e-7,       # Numerical stability for whitening
    verbose=False,         # Show progress bars
)

Methods

Method Description
fit(X) Fit the model to data X in batches
partial_fit(X) Incrementally update model with a single batch
transform(X) Project data onto principal components
inverse_transform(X) Reconstruct data from components
fit_transform(X) Fit and transform in one call

Attributes (after fitting)

Attribute Description
components_ Principal axes, shape (n_components, n_features)
mean_ Per-feature mean, shape (n_features,)
explained_variance_ Variance per component
explained_variance_ratio_ Fraction of total variance per component
n_samples_seen_ Total samples processed

Benchmarks

Benchmarks comparing against sklearn.decomposition.IncrementalPCA on CPU.

Configuration: 10,000 samples × 500 features → 50 components

Fit Performance

Batch Size Torch (s) sklearn (s) Speedup
64 0.708 0.663 0.94x
128 0.581 0.579 1.00x
256 0.670 0.612 0.91x
512 0.699 0.633 0.91x
1024 0.585 0.548 0.94x
2048 0.535 0.480 0.90x

Transform Performance

Batch Size Torch (s) sklearn (s) Speedup
64 0.008 0.028 3.64x
512 0.011 0.028 2.47x
1024 0.007 0.028 3.72x
2048 0.013 0.028 2.08x

Note: On CPU, performance is comparable to sklearn. The main advantage of this package is GPU acceleration, which provides significant speedups for large datasets.

Algorithm

This implementation uses the incremental SVD algorithm from Ross et al. (2008), which:

  1. Updates running statistics using Welford's algorithm for numerically stable online mean and variance computation
  2. Constructs an augmented matrix combining previous components with new centered data
  3. Performs SVD on the augmented matrix to update components
  4. Applies deterministic sign flipping for reproducibility

The algorithm matches sklearn's IncrementalPCA implementation exactly (verified via comprehensive test suite).

Testing

Run the test suite:

pytest tests/ -v

The test suite includes:

  • Comparison against sklearn PCA (full-batch mode)
  • Comparison against sklearn IncrementalPCA (various batch sizes)
  • Batch size sensitivity tests
  • Whitening tests
  • Numerical stability tests
  • Edge case handling

License

MIT License - see LICENSE for details.

References

  • Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1), 125-141.
  • scikit-learn IncrementalPCA documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

incremental_pca_torch-0.1.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

incremental_pca_torch-0.1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file incremental_pca_torch-0.1.0.tar.gz.

File metadata

  • Download URL: incremental_pca_torch-0.1.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for incremental_pca_torch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6cff1a1b4fc5571ab61d3c88f4029f71a8a81ef05baae9c9804045019e0e448a
MD5 74fd38924e9117a8119984ee03b2e38b
BLAKE2b-256 b1801902f03564642810cfceb1d91b6b5cb834af416185e6c1cbb090fd98fde1

See more details on using hashes here.

File details

Details for the file incremental_pca_torch-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for incremental_pca_torch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d871f62dc3d18f84a089566398d63f08be785d8b4746e42b91b5f638fe7d921
MD5 dd9ffdcb3be5a6a8e2be111be258ab55
BLAKE2b-256 dd99e1dcbffd1b42993332e0744acb8667347b8c5cb8d8a51e8fde3db2cb20e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page