A professional Python toolkit for estimating intrinsic dimensionality and computing Dimensionality Reduction Ratio (DRR) metrics
Project description
Intrinsic Dimension Analysis with DRR Metrics
A professional Python toolkit for estimating the intrinsic dimensionality of datasets and computing Dimensionality Reduction Ratio (DRR) metrics. This implementation is based on the correlation function approach from Levina & Bickel (2005) with enhancements for large-scale dataset processing.
๐ Quick Start
# Install the package
pip install drr
# Process all datasets from configuration file
drr batch datasets.txt
# Process a single dataset
drr single data/config/Apache_AllMeasurements.csv
# Use custom parameters with debug logging
drr --log-level DEBUG batch datasets.txt --max-samples 5000 --metric euclidean
๐ Table of Contents
- Overview
- Features
- Installation
- Usage
- Dataset Configuration
- Algorithm Details
- DRR Metrics
- API Reference
- Results
- Contributing
๐ Overview
This toolkit implements the Levina-Bickel correlation function method for intrinsic dimension estimation, enhanced with:
- DRR (Dimensionality Reduction Ratio) metric:
DRR = 1 - (I/R) - Large dataset handling with intelligent sampling strategies
- Batch processing capabilities for multiple datasets
- Professional logging and error handling
- Resume functionality for interrupted processing jobs
What is Intrinsic Dimension?
The intrinsic dimension of a dataset is the minimum number of parameters needed to represent the data without significant information loss. While a dataset might exist in a high-dimensional space (raw dimension R), its true complexity might be much lower (intrinsic dimension I).
What is DRR?
Dimensionality Reduction Ratio (DRR) quantifies how much dimensionality reduction is possible:
DRR = 1 - (I/R)- High DRR (>0.5): Significant dimensionality reduction possible
- Low DRR (<0.3): Dataset complexity is close to its raw dimensionality
โจ Features
Core Capabilities
- ๐ฌ Intrinsic dimension estimation using correlation function analysis
- ๐ DRR metric computation for dataset complexity analysis
- ๐๏ธ Batch processing of multiple datasets from configuration files
- ๐ Large dataset optimization with multi-level sampling
- ๐ง Resume functionality for interrupted processing jobs
Technical Features
- ๐๏ธ Professional architecture with modular design
- ๐ Comprehensive logging with configurable levels
- ๐ก๏ธ Robust error handling and validation
- ๐ Progress tracking and status reporting
- ๐ CSV results export with detailed metrics
Data Processing
- ๐งน Automatic preprocessing (categorical encoding, missing value handling)
- ๐ฏ Goal variable detection and removal
- ๐ Distance metric selection (L1, L2, Euclidean, Manhattan, Cosine)
- ๐ Intelligent sampling for datasets >50K rows
๐ ๏ธ Installation
๐ ๏ธ Installation
From PyPI (Recommended)
# Install the latest stable version
pip install drr
# Install with development dependencies
pip install drr[dev]
# Install with all optional dependencies
pip install drr[all]
From Source
# Clone the repository
git clone https://github.com/andre-motta/dimensionality_reduction_ratio.git
cd dimensionality_reduction_ratio
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e .[dev]
Prerequisites
- Python 3.11+
- pip (Python package installer)
Verify Installation
# Test the command-line interface
drr --help
# Or if installed from source
cd src
python -m drr --help
Dependencies
This project uses the following key libraries:
- Click: Modern command-line interface framework
- NumPy: Numerical computing library
- Pandas: Data manipulation and analysis
- SciPy: Scientific computing library
- Matplotlib: Plotting library
๐ Usage
Command Line Interface
Batch Processing
Process multiple datasets from a configuration file:
drr batch datasets.txt
With custom parameters:
drr --log-level DEBUG batch datasets.txt \
--max-samples 5000 \
--metric euclidean \
--data-root data
Single Dataset Processing
Process an individual dataset:
drr single data/config/Apache_AllMeasurements.csv
With custom parameters:
drr single data/config/Apache_AllMeasurements.csv \
--max-samples 3000 \
--metric manhattan
Global Options
--log-level: Logging level (DEBUG,INFO,WARNING,ERROR)--log-file: Optional log file path
Batch Command Options
datasets_file: Path to configuration file listing datasets to process--data-root: Root directory for dataset files (default:../data)--max-samples: Maximum samples for large datasets (default: 2000)--metric: Distance metric (l1,l2,euclidean,manhattan,cosine)
Single Command Options
dataset_path: Path to the dataset file to process--max-samples: Maximum samples for large datasets (default: 2000)--metric: Distance metric (l1,l2,euclidean,manhattan,cosine)
Python API
Single Dataset Analysis
import drr
# Simple usage with convenience function
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # Your dataset
original_dims, intrinsic_dim, drr_value = drr.estimate_intrinsic_dimension(data)
print(f"Raw dimensions: {original_dims}")
print(f"Intrinsic dimension: {intrinsic_dim}")
print(f"DRR: {drr_value:.3f}")
# Advanced usage with classes
estimator = drr.IntrinsicDimensionEstimator(max_samples=2000, distance_metric='euclidean')
processor = drr.DataProcessor()
# Process dataset from file
data, metadata = processor.process_dataset('data/config/Apache_AllMeasurements.csv')
original_dims, intrinsic_dim, drr_value = estimator.estimate(data)
Batch Processing
import drr
# Initialize batch processor
processor = drr.BatchProcessor(
results_file="results/my_results.csv",
max_samples=2000,
distance_metric='manhattan'
)
# Process all datasets
results = processor.process_datasets_from_file('datasets.txt')
print(f"Processed {results['successful']} datasets successfully")
๐ Dataset Configuration
The datasets.txt file defines which datasets to process using a hierarchical structure:
Format
# Configuration section
config
Apache_AllMeasurements
HSMGP_num
SQL_AllMeasurements
# Classification datasets
classify
breastcancer
diabetes
german
# Software measurement datasets
mvn
training_set/mvn_training
test_set/mvn_test
Rules
- Section headers have no indentation
- Dataset names are indented (spaces or tabs)
- Comments start with
# - File paths are relative to
data_rootdirectory - CSV extension is automatically added
๐ฌ Algorithm Details
Correlation Function Method
The algorithm estimates intrinsic dimension using the correlation function approach:
- Distance Computation: Calculate pairwise distances between data points
- Correlation Function:
C(r) = (2 * I) / (n * (n-1))where I is the number of pairs with distance โค r - Log-Log Analysis: Fit linear regression to
log(C(r))vslog(r) - Dimension Estimation: The slope approximates the intrinsic dimension
๐ DRR Metrics
Understanding DRR Values
DRR = 1 - (I/R) where:
- I: Intrinsic dimension (estimated)
- R: Raw dimension (number of features)
- DRR: Dimensionality Reduction Ratio
Interpretation Guidelines
| DRR Range | Interpretation | Example Dataset Type |
|---|---|---|
| 0.0 - 0.2 | Low reduction potential | Behavior/performance data |
| 0.2 - 0.4 | Moderate reduction | Mixed datasets |
| 0.4 - 0.6 | Good reduction potential | Configuration data |
| 0.6 - 1.0 | High reduction potential | Highly correlated features |
๐ Results
Sample Output
===============================================
RESULTS FOR: Apache_AllMeasurements.csv
===============================================
Original Dimensions (R): 43
Intrinsic Dimension (I): 12
DRR (1 - I/R): 0.721
Data Quality: 72.1% dimensionality reduction
===============================================
๐๏ธ Directory Structure
dimensionality_reduction_ratio/
โโโ src/ # Source code modules
โ โโโ main.py # Command-line entry point
โ โโโ intrinsic_dimension.py # Core algorithm
โ โโโ data_processor.py # Data preprocessing
โ โโโ batch_processor.py # Batch processing
โโโ config/ # Configuration files
โ โโโ datasets.txt # Dataset configuration
โ โโโ test_datasets.txt # Test configuration
โโโ data/ # Dataset files
โโโ results/ # Output files
โโโ logs/ # Log files
โโโ examples/ # Usage examples
โ โโโ example_usage.py # API usage examples
โโโ README.md # This documentation
๐งช Testing
Validate Installation
# Test the command-line interface
drr --help
drr batch --help
drr single --help
# Test with sample data
drr single data/optimize/config/SS-A.csv
# Test batch processing (small subset)
drr batch config/test_dataset.txt
๐ Repository
GitHub Repository: https://github.com/andre-motta/dimensionality_reduction_ratio
For questions or support, please open an issue in the repository or contact the maintainers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file drr-1.0.1.tar.gz.
File metadata
- Download URL: drr-1.0.1.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
830fead39d2b0b163a2bf62e9a83e9b266a9a4d3b60588c2b54c7158cc8e4753
|
|
| MD5 |
6de5fb9fe21eb8d8d98bd2c32e46c523
|
|
| BLAKE2b-256 |
4926e1d1101135fdff47db9c98a1e802028f883f44959a0c648056f4290c22a4
|
Provenance
The following attestation bundles were made for drr-1.0.1.tar.gz:
Publisher:
publish.yml on andre-motta/dimensionality_reduction_ratio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
drr-1.0.1.tar.gz -
Subject digest:
830fead39d2b0b163a2bf62e9a83e9b266a9a4d3b60588c2b54c7158cc8e4753 - Sigstore transparency entry: 457027524
- Sigstore integration time:
-
Permalink:
andre-motta/dimensionality_reduction_ratio@9de75852513daeab072a86326faf3c5795b15aee -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/andre-motta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9de75852513daeab072a86326faf3c5795b15aee -
Trigger Event:
push
-
Statement type:
File details
Details for the file drr-1.0.1-py3-none-any.whl.
File metadata
- Download URL: drr-1.0.1-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e9c2f1b78387f6658699f56bd8c2e799de18832f07345ebc449d81de6f26bae
|
|
| MD5 |
a919f1cb523eec3b7da4cd8cefbde7cf
|
|
| BLAKE2b-256 |
72eb2aa8c77b67624a0eaee8e203335b6ad6152f90e85b2f93c6641c2ac4cc47
|
Provenance
The following attestation bundles were made for drr-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on andre-motta/dimensionality_reduction_ratio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
drr-1.0.1-py3-none-any.whl -
Subject digest:
1e9c2f1b78387f6658699f56bd8c2e799de18832f07345ebc449d81de6f26bae - Sigstore transparency entry: 457027558
- Sigstore integration time:
-
Permalink:
andre-motta/dimensionality_reduction_ratio@9de75852513daeab072a86326faf3c5795b15aee -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/andre-motta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9de75852513daeab072a86326faf3c5795b15aee -
Trigger Event:
push
-
Statement type: