A Python library for autoencoder-based dimensionality reduction with a scikit-learn compatible interface.
Project description
NCDR - Neural Component Dimensionality Reduction
A Python library for autoencoder-based dimensionality reduction with a scikit-learn compatible interface.
Features
- Scikit-learn Compatible: Implements
fit(),transform(),predict(), andfit_transform()methods - PyTorch Lightning Backend: Deep learning framework with automatic GPU support
- Configurable Architecture: Customizable encoder/decoder layers, activation functions, and training parameters
- Automatic Standardization: Optional feature scaling for improved training stability
- Validation Support: Built-in train/validation splits for monitoring training progress
- Multiple Activation Functions: Support for ReLU, Tanh, Sigmoid, LeakyReLU, ELU, and GELU
Installation
uv add pyncdr
# or
pip install pyncdr
Dependencies
- Python ≥ 3.12
- PyTorch Lightning ≥ 2.5.5
- scikit-learn ≥ 1.7.2
- torch ≥ 2.0.0
- numpy ≥ 1.21.0
Quick Start
from ncdr import NCDR
from sklearn.datasets import make_classification
import numpy as np
# Generate sample data
X, _ = make_classification(n_samples=1000, n_features=50, n_informative=30, random_state=42)
# Create and train autoencoder
ncdr = NCDR(
hidden_dims=[64, 32, 16], # Encoder layer sizes
latent_dim=8, # Bottleneck dimension
max_epochs=50, # Training epochs
random_state=42
)
# Fit and transform data
X_reduced = ncdr.fit_transform(X)
print(f"Original shape: {X.shape}, Reduced shape: {X_reduced.shape}")
# Reconstruct original data
X_reconstructed = ncdr.predict(X)
reconstruction_error = np.mean((X - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {reconstruction_error:.4f}")
Advanced Usage
Custom Architecture
ncdr = NCDR(
hidden_dims=[128, 64, 32, 16], # Deep architecture
latent_dim=10, # 10D latent space
activation="tanh", # Tanh activation
dropout_rate=0.2, # 20% dropout for regularization
learning_rate=1e-3, # Learning rate
weight_decay=1e-4, # L2 regularization
batch_size=64, # Batch size
max_epochs=100, # Training epochs
validation_split=0.2, # 20% validation split
standardize=True, # Feature standardization
random_state=42, # Reproducibility
trainer_kwargs={"accelerator": "gpu", "devices": 1} # GPU training
)
Integration with Scikit-learn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('ncdr', NCDR(latent_dim=5, max_epochs=50, standardize=False))
])
# Split data
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
# Fit pipeline and transform
X_train_reduced = pipeline.fit_transform(X_train)
X_test_reduced = pipeline.transform(X_test)
Dimensionality Reduction Workflow
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Reduce to 2D for visualization
ncdr = NCDR(latent_dim=2, max_epochs=100, random_state=42)
X_2d = ncdr.fit_transform(X)
# Plot results
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis')
plt.title('NCDR: Iris Dataset (2D)')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
# Compare reconstruction quality
X_reconstructed = ncdr.predict(X)
mse_per_feature = np.mean((X - X_reconstructed) ** 2, axis=0)
plt.subplot(1, 2, 2)
plt.bar(range(len(mse_per_feature)), mse_per_feature)
plt.title('Reconstruction Error by Feature')
plt.xlabel('Feature Index')
plt.ylabel('MSE')
plt.tight_layout()
plt.show()
API Reference
NCDR Class
Parameters
- hidden_dims (list of int, default=[64, 32]): Hidden layer dimensions for encoder
- latent_dim (int, default=10): Dimension of latent space
- learning_rate (float, default=1e-3): Learning rate for optimization
- activation (str, default='relu'): Activation function ('relu', 'tanh', 'sigmoid', 'leaky_relu', 'elu', 'gelu')
- dropout_rate (float, default=0.0): Dropout rate for regularization
- weight_decay (float, default=0.0): L2 regularization weight decay
- batch_size (int, default=32): Training batch size
- max_epochs (int, default=100): Maximum training epochs
- validation_split (float, default=0.2): Fraction of data for validation
- standardize (bool, default=True): Whether to standardize features
- random_state (int, optional): Random seed for reproducibility
- trainer_kwargs (dict, optional): Additional PyTorch Lightning Trainer arguments
Methods
- fit(X, y=None): Train the autoencoder on data X
- transform(X): Transform data to latent representation
- fit_transform(X, y=None): Fit and transform in one step
- inverse_transform(X): Reconstruct data from latent representation
- predict(X): Reconstruct input data (alias for encode→decode)
- score(X, y=None): Return negative reconstruction MSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
encdr-1.0.0.tar.gz
(70.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
encdr-1.0.0-py3-none-any.whl
(8.4 kB
view details)
File details
Details for the file encdr-1.0.0.tar.gz.
File metadata
- Download URL: encdr-1.0.0.tar.gz
- Upload date:
- Size: 70.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed003598a98efa183730615150dca9b0c49b9d63e8f8267081704e02f9554d8
|
|
| MD5 |
00f7bbcae8610d3aa980677e182f6725
|
|
| BLAKE2b-256 |
adc3200304027b2b13aaba831fc4d2612de8c5a28fe64ee59a51bc49ef608485
|
File details
Details for the file encdr-1.0.0-py3-none-any.whl.
File metadata
- Download URL: encdr-1.0.0-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94321fedb29b0f4feb9c1751a970be3ea77a9a8960228758c3f95bd2661c9930
|
|
| MD5 |
0651dc72e3929833309c72d596d21a35
|
|
| BLAKE2b-256 |
fcabf4251a3b6e1659f3f1537906da05b69f3c5e3004288f10c1f12dbf2f7a29
|