An advanced PCA implementation with adaptive feature scaling and preprocessing

These details have not been verified by PyPI

Project links

Homepage

Project description

ASPIRE (Adaptive Scaler and PCA with Intelligent REduction)

Previously known as AdaptivePCA

ASPIRE is an advanced implementation of Principal Component Analysis (PCA) that provides intelligent feature scaling, comprehensive preprocessing, and built-in validation capabilities. It automatically adapts to your data's characteristics to deliver optimal dimensionality reduction results.

Core Functionality

AdaptivePCA employs a comprehensive preprocessing and analysis approach:

Intelligent Preprocessing
- Comprehensive data cleaning and preprocessing
- Handles outliers using IQR method
- Manages infinity values and missing data
- Feature-wise normality testing using Shapiro-Wilk test
- Automatic selection between StandardScaler and MinMaxScaler
- Class imbalance detection and handling via SMOTE
Dynamic Dimensionality Reduction
- Determines optimal number of PCA components based on variance threshold
- Considers eigenvalue thresholds for component selection
- Adapts to dataset characteristics
- Built-in validation framework

The algorithm's key innovation lies in its adaptive nature, particularly in:

Automatic selection between StandardScaler and MinMaxScaler based on feature distributions
Dynamic component selection based on cumulative variance threshold
Integrated preprocessing pipeline with outlier handling and missing value imputation
Automatic class imbalance detection and correction
Comprehensive validation framework with efficiency metrics

This implementation provides an end-to-end solution for dimensionality reduction while handling common data challenges automatically.

Overall Design Pattern

Data â†’ Preprocessing â†’ Scaler Selection â†’ PCA Optimization â†’ Validation â†’ Prediction

Dependencies

numpy>=1.19.0
pandas>=1.2.0
scikit-learn>=0.24.0
lightgbm>=3.0.0
imbalanced-learn>=0.8.0
scipy>=1.6.0

Installation

Install dependencies:

pip install scikit-learn numpy pandas lightgbm scipy imbalanced-learn

Instal from Pypi repository:

pip install adaptivepca

Clone this repository and install the package using pip:

git clone https://github.com/nqmn/adaptivepca.git
cd adaptivepca
pip install .

Usage

Basic Usage

# Load your data
data = pd.read_csv("your_dataset.csv")
X = data.drop(['Label'])
y = data['Label']

# Initialize AdaptivePCA
adaptive_pca = AdaptivePCA()
X_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)
adaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)
adaptive_pca.validate_with_classifier(X_preprocessed, y_preprocessed)
adaptive_pca.predict_with_classifier(X_preprocessed, y_preprocessed)
adaptive_pca.export_model('your_model_name.joblib')

Advanced Usage

import pandas as pd
from adaptivepca import AdaptivePCA
from sklearn.tree import DecisionTreeClassifier

# Load your data
data = pd.read_csv("your_dataset.csv")
X = data.drop(columns=['Label'])  # Features
y = data['Label']  # Target variable

# Initialize AdaptivePCA
adaptive_pca = AdaptivePCA(
    variance_threshold=0.95,
    max_components=50,
    min_eigenvalue_threshold=1e-4,
    normality_ratio=0.05,
    verbose=1
)
# Run Preprocessing
X_preprocessed, y_preprocessed, smote_applied = adaptive_pca.preprocess_data(X, y)

# Fit AdaptivePCA
adaptive_pca.fit(X_preprocessed, y_preprocessed, smote_applied)


# Optional - Validate with a classifier with full and reduced dataset performance
adaptive_pca.validate_with_classifier(X, y, classifier=DecisionTreeClassifier(), test_size=0.2, cv=5)

# Optional - Run prediction with classifier, show output of confusion matrix, classification report,
#  inference time, fpr, far, specificity, auc-roc, mcc
adaptive_pca.predict_with_classifier(X, y)

# Optional - Export the model in joblib format
adaptive_pca.export_model("your_model_name.joblib")

Key Components

Initialization Parameters

variance_threshold: Minimum cumulative explained variance (default: 0.95)
max_components: Maximum PCA components to consider (default: 50)
min_eigenvalue_threshold: Minimum eigenvalue cutoff (default: 1e-4)
normality_ratio: P-value threshold for Shapiro-Wilk test (default: 0.05)
verbose: Logging detail level (default: 0)

Preprocessing Pipeline

Data Cleaning

Selection of numeric columns only
Handles outliers using IQR methods (clips values outside 1.5*IQR)
Replaces infinitiy values with finite extremes
Imputes missing values using mean strategy

Feature Scaling Selection

Perform Shapiro-Wilk test on each feature
Counts features better suited for StandardScaler and MinMaxScaler
Applies majority voting to select final scaler

Class Balance Handling

Perform chi-squared test for class imbalance
Applies SMOTE if significant imbalance detected (p<0.05)

PCA Optimization Algorithm

Find optimal components meeting variance threshold: max variance_threshold and min_eigenvalue_threshold

Validation Framework

Classification validation on full and reduced dataset
Performance metrics: Accuracy comparison, time efficiency, ROC-AUC score, detailed classification report

Key Mathematical Components

Feature normality testing

Shapiro-Wilk test for normality

Class imbalance detection

Chi-squared test for class balance

Methods

fit(X): Fits the AdaptivePCA model to the data X.
preprocess_data(X): Run preprocessing pipeline.
validate_with_classifier(X, y, classifier=None, cv=5, test_size=0.2): Tests model performance.
predict_with_classifier(X, y): Makes predictions using trained classifier.
export_model(model_name, classifier): Saves model to file.

Use Cases

ASPIRE is particularly valuable for:

Machine learning pipelines requiring automated preprocessing
High-dimensional data analysis
Feature engineering optimization
Model performance enhancement
Exploratory data analysis

Technical Foundation

The system integrates:

Statistical testing for data distribution analysis
Adaptive scaling techniques
Principal Component Analysis
Machine learning validation frameworks
Performance optimization methods

Performance Comparison: AdaptivePCA vs. Traditional PCA Optimization (GridSearch)

Speed

AdaptivePCA adaptively selects the optimal configuration based on data-driven rules, which is less computationally intense than the exhaustive search performed by grid search. In our tests, AdaptivePCA achieved up to a 90% reduction in processing time compared to the traditional PCA method. This is especially useful when working with high-dimensional data, where traditional methods may take significantly longer due to sequential grid search.

Explained Variance

Both AdaptivePCA and traditional PCA achieve similar levels of explained variance, with AdaptivePCA dynamically selecting the number of components based on a defined variance threshold. Traditional PCA, on the other hand, requires manual parameter tuning, which can be time-consuming.

Performance on Different Dataset (Full & Reduced Dataset)

Most datasets maintain high accuracy, with reduced datasets achieving similar scores to full datasets in nearly all cases. Additionally, the reduced datasets significantly decrease processing time, with time reductions ranging from 1.85% to 58.03%. This indicates that reduced datasets can offer substantial efficiency benefits, especially for larger datasets.

Dataset	Score (Acc)	Time (s)	Gain (%)
insdn_ddos_binary_01.ds (full)	1.000000	1.5492	-
insdn_ddos_binary_01.ds (reduced)	1.000000	0.6502	58.03
hldddosdn_hlddos_combine_binary.ds (full)	1.000000	30.3948	-
hldddosdn_hlddos_combine_binary.ds (reduced)	1.000000	14.4875	52.34
cicddos2019_tcpudp_combine_d1_binary_rus.ds (full)	1.000000	1.6453	-
cicddos2019_tcpudp_combine_d1_binary_rus.ds (reduced)	1.000000	0.7371	55.20
mendeley_ddos_sdn_binary_19.ds (full)	1.000000	0.9839	-
mendeley_ddos_sdn_binary_19.ds (reduced)	0.942738	0.9355	4.93
Wednesday-workingHours.pcap_ISCX.csv (full)	0.921126	39.7610	-
Wednesday-workingHours.pcap_ISCX.csv (reduced)	0.970010	28.8390	27.47
LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (full)	0.999982	0.7314	-
LR-HR DDoS 2024 Dataset for SDN-Based Networks.csv (reduced)	0.999982	0.5131	29.84
dataset_sdn.csv (full)	1.000000	1.0547	-
dataset_sdn.csv (reduced)	0.932359	1.0352	1.85

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request to discuss your changes.

Acknowledgments

This project makes use of the scikit-learn, numpy, and pandas libraries for data processing and machine learning.

Version Update Log

1.0.3 - Added flexibility in scaling, fix error handling when max_components exceeding the available number of features or samples.
1.0.6 - Added Parameter verbose as an argument to init, with a default value of 0.
1.1.0 - Added validation, prediction with classifier, clean up the code.
1.1.3 - Revamped the code. Refer to description above.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1.3

Oct 31, 2024

1.1.0

Oct 29, 2024

1.0.6

Oct 27, 2024

1.0.3

Oct 27, 2024

1.0.2

Oct 26, 2024

1.0.1

Oct 26, 2024

1.0.0

Oct 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptivepca-1.1.3.tar.gz (14.7 kB view details)

Uploaded Oct 31, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adaptivepca-1.1.3-py3-none-any.whl (11.8 kB view details)

Uploaded Oct 31, 2024 Python 3

File details

Details for the file adaptivepca-1.1.3.tar.gz.

File metadata

Download URL: adaptivepca-1.1.3.tar.gz
Upload date: Oct 31, 2024
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for adaptivepca-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`ae9a9cf9a0d172461a57fdce4937db721b8ffc704c37f9aaf311e08f29d31b2d`
MD5	`376f8cdeaca362e3386593ae9ec4a697`
BLAKE2b-256	`9e7e5be9f87760316af236adfb445ab1eb33c4c8a0a72641dc3cc38a31691f9c`

See more details on using hashes here.

File details

Details for the file adaptivepca-1.1.3-py3-none-any.whl.

File metadata

Download URL: adaptivepca-1.1.3-py3-none-any.whl
Upload date: Oct 31, 2024
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for adaptivepca-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0eaa8c26db49f7031843479aea18bb6d8ef38078c1ec0f19ed0394484e2f4160`
MD5	`1c5b0a7e023db0d3b895f97aeed8400d`
BLAKE2b-256	`9491aa8190c2476eb883a5e754d5486a6b90b9ba79bc123709887a762a6b05e8`

See more details on using hashes here.

adaptivepca 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ASPIRE (Adaptive Scaler and PCA with Intelligent REduction)

Core Functionality

Overall Design Pattern

Dependencies

Installation

Usage

Basic Usage

Advanced Usage

Key Components

Initialization Parameters

Preprocessing Pipeline

Data Cleaning

Feature Scaling Selection

Class Balance Handling

PCA Optimization Algorithm

Validation Framework

Key Mathematical Components

Feature normality testing

Class imbalance detection

Methods

Use Cases

Technical Foundation

Performance Comparison: AdaptivePCA vs. Traditional PCA Optimization (GridSearch)

Speed

Explained Variance

Performance on Different Dataset (Full & Reduced Dataset)

License

Contributing

Acknowledgments

Version Update Log

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes