Skip to main content

Adaptive feature reduction system that intelligently determines the optimal preprocessing and dimensionality reduction approach.

Project description

ASPIRE (Adaptive Scaler and PCA with Intelligent REduction)

Previously known as AdaptivePCA

ASPIRE is an enhanced preprocessing and dimensionality reduction system that intelligently adapts to data characteristics through statistical analysis. The model combines adaptive scaling selection with optimized Principal Component Analysis (PCA) to provide an efficient and robust feature reduction solution with minimal computational costs.

ASPIRE represents a significant advancement in automated feature engineering, offering a robust solution for dimensionality reduction while maintaining data integrity and model performance.

Core Functionality

ASPIRE employs a two-stage adaptive approach:

  1. Intelligent Preprocessing

    • Comprehensive preprocessing handling; numeric features, missing values, infinity and nan.
    • Performs feature-wise normality testing using Shapiro-Wilk test
    • Automatic selection of the optimal scaler based on data distribution
  2. Dynamic Dimensionality Reduction

    • Determines the optimal number of PCA components while maintaining a specified variance threshold
    • Early stops to ensure computational efficiency
    • Adapts to dataset dimensions and characteristics
    • Provides comprehensive validation of the reduction effectiveness

Overall Design Pattern

Data → Preprocessing → Scaler Selection → PCA Optimization → Validation → Prediction

Key Advantages

  • Automation: Eliminates manual preprocessing decisions through data-driven selection
  • Adaptivity: Adjusts preprocessing and reduction strategies based on data characteristics
  • Efficiency: Optimizes computational resources while maintaining data integrity
  • Validation: Includes built-in performance comparison framework
  • Transparency: Provides detailed insights into selection decisions and performance metrics

Installation

Instal from Pypi repository:

pip install adaptivepca

Clone this repository and install the package using pip:

git clone https://github.com/nqmn/adaptivepca.git
cd adaptivepca
pip install .

Usage

import pandas as pd
from adaptivepca import AdaptivePCA

# Load your data (example)
data = pd.read_csv("your_dataset.csv")
X = data.drop(columns=['Label'])  # Features
y = data['Label']  # Target variable (Optional)

# Initialize and fit the model to determine the optimal scaler and PCA configuration
adaptive_pca = AdaptivePCA(variance_threshold=0.95, max_components=50, varince_ratio=0.5, normality_ratio=5, verbose=1)
adaptive_pca.fit(X)

# Optional - Validate with a classifier with full and reduced dataset performance
classifier = adaptive_pca.validate_with_classifier(X, y, classifier=DecisionTreeClassifier, test_size=0.2, cv=5)

# Optional - Run prediction with classifier, show output of confusion matrix, classification report, inference time, fpr, far, specificity, auc-roc, mcc
adaptive_pca.predict_with_classifier(X, y)

# Optional - View model configurations
adaptive_pca.view_config()

# Optional - Export the model in joblib format
adaptive_pca.export_model("your_model_name.joblib", classifier)

Key Components

Initialization Parameters

  • variance_threshold: Target explained variance (default: 0.95)
  • max_components: Maximum PCA components to consider (default: 50)
  • variance_ratio: Variance ratio threshold (default: 5.0)
  • normality_ratio: P-value threshold for Shapiro-Wilk test (default: 0.05)
  • verbose: Logging detail level (default: 0)

Methods

  • fit(X): Fits the AdaptivePCA model to the data X.
  • transform(X): Transforms the data X using the fitted PCA model. - fit_transform(X): Fits and transforms the data in one step.
  • validate_with_classifier(X, y, classifier=None, cv=5, test_size=0.2): Tests model performance.
  • predict_with_classifier(X, y): Makes predictions using trained classifier.
  • export_model(model_name, classifier): Saves model to file.
  • view_config(): Shows current configuration.

Main Algorithms Flow

1. Data Preprocessing

Input: DataFrame X
Output: Clean DataFrame

# Clean the data
Keep only numeric columns
For each column:
    Replace infinities with max/min values
    Fill missing values with column mean
Remove constant columns (variance = 0)

2. Scaler Selection

Input: DataFrame X
Output: Best scaler for the data

For each column:
    Take sample of up to 5000 points
    Test if data is normal (Shapiro-Wilk test)
    Count if normal or not normal

If more normal columns:
    Return StandardScaler
Else:
    Return MinMaxScaler

3. PCA Optimal Component Selection

Input: Scaled data X
Output: Optimal number of components

Set max_components = min(50, number of features)
Try components from 1 to max_components:
    Calculate explained variance
    If variance >= threshold (default 95%):
        Return current number of components

4. Main Fitting Process

Input: DataFrame X
Output: Fitted model

Clean the data
Choose and fit best scaler
Scale the data
Find best number of components
Save configuratio

4. Transform Data

Input: New DataFrame X
Output: Reduced data

Clean the data using saved settings
Scale data using saved scaler
Reduce dimensions using saved components
Return reduced data

5. Validation

Input: Data X, Labels y
Output: Performance metrics

# Compare original vs reduced data
Train model on original data
Train model on reduced data
Compare:
    - Accuracy
    - Speed
    - Memory usage

6. Prediction

Input: New data X
Output: Predictions

Clean the data
Apply scaling
Reduce dimensions
Make predictions
Return results and metrics

7. Save Model

Input: Model name, Trained model
Output: Saved file

Collect:
    - Parameters
    - Scaler
    - Components
    - Column names
Save everything to file

Key Features

Adaptivity Mechanisms

  1. Scaler Selection:

    • Based on feature-wise normality tests
    • Considers data distribution characteristics
    • Defaults sensibly for edge cases
  2. Component Selection:

    • Adapts to data dimensions
    • Respects variance threshold
    • Limits maximum components
  3. Validation:

    • Supports both cross-validation and train-test split
    • Compares performance with original data
    • Measures computational efficiency gains

Error Handling

  • Handles zero-variance features
  • Manages missing values through mean imputation
  • Validates presence of numeric columns
  • Ensures fit before transform

Complexity Analysis

  • Time Complexity: O(n * d^2) for PCA computation
  • Space Complexity: O(n * d) for data storage Where n = samples, d = features

Use Cases

ASPIRE is particularly valuable for:

  • Machine learning pipelines requiring automated preprocessing
  • High-dimensional data analysis
  • Feature engineering optimization
  • Model performance enhancement
  • Exploratory data analysis

Technical Foundation

The system integrates:

  • Statistical testing for data distribution analysis
  • Adaptive scaling techniques
  • Principal Component Analysis
  • Machine learning validation frameworks
  • Performance optimization methods

Performance Comparison: AdaptivePCA vs. Traditional PCA Optimization (GridSearch)

Speed

AdaptivePCA leverages parallel processing to evaluate scaling and PCA component selection concurrently. In our tests, AdaptivePCA achieved up to a 95% reduction in processing time compared to the traditional PCA method. This is especially useful when working with high-dimensional data, where traditional methods may take significantly longer due to sequential grid search.

Explained Variance

Both AdaptivePCA and traditional PCA achieve similar levels of explained variance, with AdaptivePCA dynamically selecting the number of components based on a defined variance threshold. Traditional PCA, on the other hand, requires manual parameter tuning, which can be time-consuming.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request to discuss your changes.

Acknowledgments

This project makes use of the scikit-learn, numpy, and pandas libraries for data processing and machine learning.

Version Update Log

  • 1.0.3 - Added flexibility in scaling, fix error handling when max_components exceeding the available number of features or samples.
  • 1.0.6 - Added Parameter verbose as an argument to init, with a default value of 0.
  • 1.1.0 - Added validation, prediction with classifier, clean up the code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptivepca-1.1.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adaptivepca-1.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file adaptivepca-1.1.0.tar.gz.

File metadata

  • Download URL: adaptivepca-1.1.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for adaptivepca-1.1.0.tar.gz
Algorithm Hash digest
SHA256 7937658a25ffbb82a37422844adbbc02723a877fbbf367246335d95fcbbcc12c
MD5 6582e59b2780a6bcdf13b57a29f2579c
BLAKE2b-256 d984f9d68f41c2a772d48dc5ea84036a3d538084d7151c99265c0c208aecea23

See more details on using hashes here.

File details

Details for the file adaptivepca-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: adaptivepca-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for adaptivepca-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03a2da0f93eae8187ac4feb4ef93e497874226f5be417e34ea6f1a78abb8e8a7
MD5 4ecae2ce9788a240b58b6f11c61732be
BLAKE2b-256 9badb44dffe9328ec883754043d8eb44f024e3a61d8b71827d5995094c087189

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page