Skip to main content

Adaptive PCA with parallel scaling and dimensionality reduction

Project description

AdaptivePCA

AdaptivePCA is a Python library introducing a novel approach to optimize Principal Component Analysis (PCA) for large datasets through an adaptive mechanism. By intelligently selecting the most suitable scaling method—choosing between StandardScaler and MinMaxScaler—and combining it with PCA, AdaptivePCA innovatively preserves the most significant information while reducing dimensionality. With a unique auto-stop feature that halts component selection upon reaching the specified variance threshold, AdaptivePCA ensures computational efficiency. This adaptive selection strategy enables efficient identification of the optimal principal components, positioning AdaptivePCA as a unique tool to enhance machine learning model performance and streamline data visualization tasks.

Features

  • Adaptive Scaling Selection: Dynamically selects between StandardScaler and MinMaxScaler to identify the most effective scaling method, optimizing information retention during dimensionality reduction.
  • Automatic Component Optimization: Automatically adjusts the number of principal components to achieve a specified variance threshold, preserving maximum data variance with minimal components.
  • Efficient Parallel Processing: Leverages parallel computation to accelerate scaling and component evaluation, enhancing performance on large datasets.
  • Early Stop for Efficiency: Stops further component evaluation once the specified variance threshold is reached, making the process computationally efficient.
  • Seamless Integration: Easily integrates into data science workflows, enhancing compatibility with machine learning pipelines and data visualization tasks.

Installation

Instal from Pypi repository:

pip install adaptivepca

Clone this repository and install the package using pip:

git clone https://github.com/nqmn/adaptivepca.git
cd adaptivepca
pip install .

Usage

import pandas as pd
from adaptivepca import AdaptivePCA

# Load your data (example)
data = pd.read_csv("your_dataset.csv")
X = data.drop(columns=['Label'])  # Features
y = data['Label']  # Target variable (Optional)

# Initialize and fit AdaptivePCA
# Make sure to use cleaned dataset. Eg. remove missing, etc.
adaptive_pca = AdaptivePCA(variance_threshold=0.95, max_components=50, scaler_test=True, verbose=1)
X_reduced = adaptive_pca.fit_transform(X)

Performance Comparison: AdaptivePCA vs. Traditional PCA

Speed

AdaptivePCA leverages parallel processing to evaluate scaling and PCA component selection concurrently. In our tests, AdaptivePCA achieved up to a 95% reduction in processing time compared to the traditional PCA method. This is especially useful when working with high-dimensional data, where traditional methods may take significantly longer due to sequential grid search.

Explained Variance

Both AdaptivePCA and traditional PCA achieve similar levels of explained variance, with AdaptivePCA dynamically selecting the number of components based on a defined variance threshold. Traditional PCA, on the other hand, requires manual parameter tuning, which can be time-consuming.

Effect Size

Using Cohen's d and statistical tests, we observed significant effect sizes in processing time, favoring AdaptivePCA. In practical terms, this means that AdaptivePCA provides substantial improvements in performance while maintaining equivalent or higher levels of accuracy in explained variance coverage.

Parameters

  • variance_threshold: float, default=0.95
    The cumulative variance explained threshold to determine the optimal number of components.

  • max_components: int, default=10
    The maximum number of components to consider. Set to 50 for comprehensive evaluation.

  • scaler_test: bool, default=True
    Added flexibility in scaling, which reduces runtime when scaling isn't required. Added on version 1.0.3

  • verbose: int, default=0
    Added parameter verbose to control the level of output: verbose=1: Provides detailed output, displaying all component-wise explained variance scores for each scaler. verbose=0: Suppresses intermediate output, showing only the final best configuration found after processing all scalers. Useful for debugging or fine-tuning PCA settings, with a default value of 0. Added on version 1.0.6

Methods

  • fit(X): Fits the AdaptivePCA model to the data X.
  • transform(X): Transforms the data X using the fitted PCA model.
  • fit_transform(X): Fits and transforms the data in one step.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request to discuss your changes.

Acknowledgments

This project makes use of the scikit-learn, numpy, and pandas libraries for data processing and machine learning.

Version Update Log

  • 1.0.3 - Added flexibility in scaling, fix error handling when max_components exceeding the available number of features or samples.
  • 1.0.6 - Added Parameter verbose as an argument to init, with a default value of 0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptivepca-1.0.6.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

adaptivepca-1.0.6-py3-none-any.whl (6.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page