GPU-accelerated Incremental PCA using PyTorch, with sklearn-compatible API
Project description
Incremental PCA for PyTorch
Incremental Principal Component Analysis (PCA) using PyTorch: This package provides a scikit-learn compatible API for performing PCA. This allows for PCA to be performed on datasets that are too large to fit in memory.
Features
- GPU Acceleration: Perform PCA on GPUs for significant speedups on large datasets
- Memory Efficient: Process data in batches to handle datasets larger than available RAM/VRAM ("out of core")
- sklearn Compatible: Drop-in replacement with familiar
fit,transform,fit_transformAPI - Streaming Support: Use
partial_fitfor online learning from data streams - Lazy arrays /
numpymemmap Support: Efficiently process arrays on disk and memory-mapped files
Installation
pip install incremental-pca-torch
From Source
git clone https://github.com/RichieHakim/incremental_pca.git
cd incremental_pca
pip install -e ".[dev]"
Quick Start
import numpy as np
from incremental_pca_torch import IncrementalPCA
# Create some data
X = np.random.randn(10000, 500).astype(np.float32)
# Fit incrementally using GPU
ipca = IncrementalPCA(
n_components=50,
batch_size=256,
device='cuda' # Use 'cpu' if no GPU available
)
ipca.fit(X)
# Transform new data
X_transformed = ipca.transform(X)
print(f"Reduced shape: {X_transformed.shape}") # (10000, 50)
# Reconstruct data
X_reconstructed = ipca.inverse_transform(X_transformed)
Streaming Data with partial_fit
# For streaming or very large datasets
ipca = IncrementalPCA(n_components=50, device='cuda')
# Process data in chunks
for chunk in data_generator():
ipca.partial_fit(chunk)
# Use the fitted model
X_transformed = ipca.transform(new_data)
Using with Memory-Mapped Arrays
import numpy as np
# Memory-mapped files work seamlessly
X_mmap = np.load('large_data.npy', mmap_mode='r')
ipca = IncrementalPCA(n_components=50, batch_size=256, device='cuda')
ipca.fit(X_mmap) # Loads only one batch at a time
API Reference
IncrementalPCA
IncrementalPCA(
n_components=None, # Number of components (default: min(n_samples, n_features))
whiten=False, # Scale components to unit variance
batch_size=128, # Samples per batch for fit/transform
device='cpu', # 'cpu', 'cuda', 'cuda:0', 'mps', etc.
dtype=torch.float32, # torch.float32 or torch.float64
whiten_eps=1e-7, # Numerical stability for whitening
verbose=False, # Show progress bars
)
Methods
| Method | Description |
|---|---|
fit(X) |
Fit the model to data X in batches |
partial_fit(X) |
Incrementally update model with a single batch |
transform(X) |
Project data onto principal components |
inverse_transform(X) |
Reconstruct data from components |
fit_transform(X) |
Fit and transform in one call |
Attributes (after fitting)
| Attribute | Description |
|---|---|
components_ |
Principal axes, shape (n_components, n_features) |
mean_ |
Per-feature mean, shape (n_features,) |
explained_variance_ |
Variance per component |
explained_variance_ratio_ |
Fraction of total variance per component |
n_samples_seen_ |
Total samples processed |
Benchmarks
Benchmarks comparing against sklearn.decomposition.IncrementalPCA on CPU.
Configuration: 10,000 samples × 500 features → 50 components
Fit Performance
| Batch Size | Torch (s) | sklearn (s) | Speedup |
|---|---|---|---|
| 64 | 0.708 | 0.663 | 0.94x |
| 128 | 0.581 | 0.579 | 1.00x |
| 256 | 0.670 | 0.612 | 0.91x |
| 512 | 0.699 | 0.633 | 0.91x |
| 1024 | 0.585 | 0.548 | 0.94x |
| 2048 | 0.535 | 0.480 | 0.90x |
Transform Performance
| Batch Size | Torch (s) | sklearn (s) | Speedup |
|---|---|---|---|
| 64 | 0.008 | 0.028 | 3.64x |
| 512 | 0.011 | 0.028 | 2.47x |
| 1024 | 0.007 | 0.028 | 3.72x |
| 2048 | 0.013 | 0.028 | 2.08x |
Note: On CPU, performance is comparable to sklearn. The main advantage of this package is GPU acceleration, which provides significant speedups for large datasets.
Algorithm
This implementation uses the incremental SVD algorithm from Ross et al. (2008), which:
- Updates running statistics using Welford's algorithm for numerically stable online mean and variance computation
- Constructs an augmented matrix combining previous components with new centered data
- Performs SVD on the augmented matrix to update components
- Applies deterministic sign flipping for reproducibility
The algorithm matches sklearn's IncrementalPCA implementation exactly (verified via comprehensive test suite).
Testing
Run the test suite:
pytest tests/ -v
The test suite includes:
- Comparison against sklearn
PCA(full-batch mode) - Comparison against sklearn
IncrementalPCA(various batch sizes) - Batch size sensitivity tests
- Whitening tests
- Numerical stability tests
- Edge case handling
License
MIT License - see LICENSE for details.
References
- Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77(1), 125-141.
- scikit-learn IncrementalPCA documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file incremental_pca_torch-0.1.0.tar.gz.
File metadata
- Download URL: incremental_pca_torch-0.1.0.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cff1a1b4fc5571ab61d3c88f4029f71a8a81ef05baae9c9804045019e0e448a
|
|
| MD5 |
74fd38924e9117a8119984ee03b2e38b
|
|
| BLAKE2b-256 |
b1801902f03564642810cfceb1d91b6b5cb834af416185e6c1cbb090fd98fde1
|
File details
Details for the file incremental_pca_torch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: incremental_pca_torch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d871f62dc3d18f84a089566398d63f08be785d8b4746e42b91b5f638fe7d921
|
|
| MD5 |
dd9ffdcb3be5a6a8e2be111be258ab55
|
|
| BLAKE2b-256 |
dd99e1dcbffd1b42993332e0744acb8667347b8c5cb8d8a51e8fde3db2cb20e1
|