A professional Python toolkit for estimating intrinsic dimensionality and computing Dimensionality Reduction Ratio (DRR) metrics
Project description
DRR: Dimensionality Reduction Ratio Toolkit
"Less Noise, More Signal: DRR for Better Optimizations of SE Tasks"
A research-backed approach to predicting when lightweight algorithms suffice
A professional Python toolkit for estimating intrinsic dimensionality and computing Dimensionality Reduction Ratio (DRR) metrics. This implementation is based on cutting-edge research from NC State University showing that DRR can predict when simple algorithms outperform complex AI methods by orders of magnitude.
๐ฏ Research Background
This toolkit implements the methodology from our research paper "Less Noise, More Signal: DRR for Better Optimizations of SE Tasks" which demonstrates that:
- 89% of Software Engineering datasets satisfy the DRR threshold for simplified optimization
- Simple methods can be 100x faster than state-of-the-art optimizers when DRR > 1/3
- SE data has lower intrinsic complexity (median 3.1 dimensions) compared to general ML data (median 5 dimensions)
๐ฌ What is DRR?
The Dimensionality Reduction Ratio (DRR) is a metric that quantifies how much dimensionality reduction is possible in a dataset:
DRR = 1 - (I/R)
Where:
- I = Intrinsic dimension (estimated using correlation function analysis)
- R = Raw dimension (number of original features)
Research Finding: The 1/3 Threshold
Our research shows that when DRR > 1/3, simple algorithms can achieve the same performance as complex state-of-the-art optimizers but run two orders of magnitude faster.
๐ Quick Start
# Install the package
pip install drr
# Process all datasets from configuration file
drr batch datasets.txt
# Process a single dataset
drr single data/config/Apache_AllMeasurements.csv
# Use custom parameters with debug logging
drr --log-level DEBUG batch datasets.txt --max-samples 5000 --metric euclidean
๐ Table of Contents
- Overview
- Features
- Installation
- Usage
- Dataset Configuration
- Algorithm Details
- DRR Metrics
- API Reference
- Results
- Contributing
๐ Overview
This toolkit implements the Levina-Bickel correlation function method for intrinsic dimension estimation, enhanced with:
- DRR (Dimensionality Reduction Ratio) metric:
DRR = 1 - (I/R) - Large dataset handling with intelligent sampling strategies
- Batch processing capabilities for multiple datasets
- Professional logging and error handling
- Resume functionality for interrupted processing jobs
What is Intrinsic Dimension?
The intrinsic dimension of a dataset is the minimum number of parameters needed to represent the data without significant information loss. While a dataset might exist in a high-dimensional space (raw dimension R), its true complexity might be much lower (intrinsic dimension I).
What is DRR?
Dimensionality Reduction Ratio (DRR) quantifies how much dimensionality reduction is possible:
DRR = 1 - (I/R)- High DRR (>0.5): Significant dimensionality reduction possible
- Low DRR (<0.3): Dataset complexity is close to its raw dimensionality
โจ Features
Core Capabilities
- ๐ฌ Intrinsic dimension estimation using correlation function analysis
- ๐ DRR metric computation for dataset complexity analysis
- ๐๏ธ Batch processing of multiple datasets from configuration files
- ๐ Large dataset optimization with multi-level sampling
- ๐ง Resume functionality for interrupted processing jobs
Technical Features
- ๐๏ธ Professional architecture with modular design
- ๐ Comprehensive logging with configurable levels
- ๐ก๏ธ Robust error handling and validation
- ๐ Progress tracking and status reporting
- ๐ CSV results export with detailed metrics
Data Processing
- ๐งน Automatic preprocessing (categorical encoding, missing value handling)
- ๐ฏ Goal variable detection and removal
- ๐ Distance metric selection (L1, L2, Euclidean, Manhattan, Cosine)
- ๐ Intelligent sampling for datasets >50K rows
๐ ๏ธ Installation
๐ ๏ธ Installation
From PyPI (Recommended)
# Install the latest stable version
pip install drr
# Install with development dependencies
pip install drr[dev]
# Install with all optional dependencies
pip install drr[all]
From Source
# Clone the repository
git clone https://github.com/andre-motta/dimensionality_reduction_ratio.git
cd dimensionality_reduction_ratio
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e .[dev]
Prerequisites
- Python 3.11+
- pip (Python package installer)
Verify Installation
# Test the command-line interface
drr --help
# Or if installed from source
cd src
python -m drr --help
Dependencies
This project uses the following key libraries:
- Click: Modern command-line interface framework
- NumPy: Numerical computing library
- Pandas: Data manipulation and analysis
- SciPy: Scientific computing library
- Matplotlib: Plotting library
๐ Usage
Command Line Interface
Batch Processing
Process multiple datasets from a configuration file:
drr batch datasets.txt
With custom parameters:
drr --log-level DEBUG batch datasets.txt \
--max-samples 5000 \
--metric euclidean \
--data-root data
Single Dataset Processing
Process an individual dataset:
drr single data/config/Apache_AllMeasurements.csv
With custom parameters:
drr single data/config/Apache_AllMeasurements.csv \
--max-samples 3000 \
--metric manhattan
Global Options
--log-level: Logging level (DEBUG,INFO,WARNING,ERROR)--log-file: Optional log file path
Batch Command Options
datasets_file: Path to configuration file listing datasets to process--data-root: Root directory for dataset files (default:../data)--max-samples: Maximum samples for large datasets (default: 2000)--metric: Distance metric (l1,l2,euclidean,manhattan,cosine)
Single Command Options
dataset_path: Path to the dataset file to process--max-samples: Maximum samples for large datasets (default: 2000)--metric: Distance metric (l1,l2,euclidean,manhattan,cosine)
Python API
Single Dataset Analysis
import drr
# Simple usage with convenience function
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # Your dataset
original_dims, intrinsic_dim, drr_value = drr.estimate_intrinsic_dimension(data)
print(f"Raw dimensions: {original_dims}")
print(f"Intrinsic dimension: {intrinsic_dim}")
print(f"DRR: {drr_value:.3f}")
# Advanced usage with classes
estimator = drr.IntrinsicDimensionEstimator(max_samples=2000, distance_metric='euclidean')
processor = drr.DataProcessor()
# Process dataset from file
data, metadata = processor.process_dataset('data/config/Apache_AllMeasurements.csv')
original_dims, intrinsic_dim, drr_value = estimator.estimate(data)
Batch Processing
import drr
# Initialize batch processor
processor = drr.BatchProcessor(
results_file="results/my_results.csv",
max_samples=2000,
distance_metric='manhattan'
)
# Process all datasets
results = processor.process_datasets_from_file('datasets.txt')
print(f"Processed {results['successful']} datasets successfully")
๐ Dataset Configuration
The datasets.txt file defines which datasets to process using a hierarchical structure:
Format
# Configuration section
config
Apache_AllMeasurements
HSMGP_num
SQL_AllMeasurements
# Classification datasets
classify
breastcancer
diabetes
german
# Software measurement datasets
mvn
training_set/mvn_training
test_set/mvn_test
Rules
- Section headers have no indentation
- Dataset names are indented (spaces or tabs)
- Comments start with
# - File paths are relative to
data_rootdirectory - CSV extension is automatically added
๐ฌ Algorithm Details
Fractal-Based Intrinsic Dimension Estimation
This toolkit implements the correlation function method used in our research, which leverages fractal geometry concepts to estimate intrinsic dimensionality:
- Distance Analysis: Calculate pairwise distances between data points
- Correlation Function: For radius R, compute
C(r) = (2 * I) / (n * (n-1))where I is the number of pairs with distance โค r - Fractal Dimension: Analyze how the number of points scales with distance radius
- Gradient Analysis: The maximum gradient of log(C(r)) vs log(r) approximates the intrinsic dimension
Key Advantages
- Accuracy: More precise than traditional PCA-based methods
- Robustness: Less sensitive to noise and outliers
- Scalability: Efficient for large datasets through intelligent sampling
- Research-Validated: Proven effective on 24+ real-world datasets
The Research Breakthrough
Our methodology fixes critical errors in previous approaches:
- Previous work suggested simple algorithms when I < 4
- Our research found many counter-examples to this threshold
- New threshold: DRR > 1/3 provides much more accurate predictions
๐ DRR Metrics & Research Insights
Understanding DRR Values
DRR = 1 - (I/R) where:
- I: Intrinsic dimension (estimated)
- R: Raw dimension (number of features)
- DRR: Dimensionality Reduction Ratio
Research-Based Interpretation Guidelines
| DRR Range | Algorithm Recommendation | Performance Insight | SE Data Examples |
|---|---|---|---|
| > 0.67 | Use simple methods | 100x faster, same quality | Software configuration (SS-B, SS-D) |
| 0.33 - 0.67 | Simple methods often sufficient | 10-50x speedup possible | Most SE optimization tasks |
| < 0.33 | Complex methods may be needed | Intrinsic complexity requires sophisticated algorithms | General ML datasets |
Key Research Findings
๐ SE vs General ML Data:
- Median SE intrinsic dimensionality: 3.1 dimensions
- Median general ML intrinsic dimensionality: 5.0 dimensions
- Conclusion: SE problems are inherently less complex
๐ Performance Implications:
- 89% of SE datasets satisfy DRR > 1/3 threshold
- Simple algorithms (30 samples) perform as well as complex ones (3000 samples)
- Speedup: 2 orders of magnitude (seconds vs 20 minutes)
๐ Results
Sample Output
===============================================
RESULTS FOR: Apache_AllMeasurements.csv
===============================================
Original Dimensions (R): 43
Intrinsic Dimension (I): 12
DRR (1 - I/R): 0.721
Data Quality: 72.1% dimensionality reduction
===============================================
๐๏ธ Directory Structure
dimensionality_reduction_ratio/
โโโ src/ # Source code modules
โ โโโ main.py # Command-line entry point
โ โโโ intrinsic_dimension.py # Core algorithm
โ โโโ data_processor.py # Data preprocessing
โ โโโ batch_processor.py # Batch processing
โโโ config/ # Configuration files
โ โโโ datasets.txt # Dataset configuration
โ โโโ test_datasets.txt # Test configuration
โโโ data/ # Dataset files
โโโ results/ # Output files
โโโ logs/ # Log files
โโโ examples/ # Usage examples
โ โโโ example_usage.py # API usage examples
โโโ README.md # This documentation
๐งช Testing
Validate Installation
# Test the command-line interface
drr --help
drr batch --help
drr single --help
# Test with sample data
drr single data/optimize/config/SS-A.csv
# Test batch processing (small subset)
drr batch config/test_dataset.txt
๏ฟฝ Citation
If you use this toolkit in your research, please cite our paper:
@article{lustosa2025drr,
title={Less Noise, More Signal: DRR for Better Optimizations of SE Tasks},
author={Andre Lustosa and Tim Menzies},
journal={arXiv preprint arXiv:2503.21086},
year={2025},
url={https://arxiv.org/abs/2503.21086}
}
๏ฟฝ๐ Research Links
- ๐ Paper: arXiv:2503.21086
- ๐ป Research Code: GitHub Repository
- ๐๏ธ Institution: North Carolina State University, Department of Computer Science
- ๐ฅ Authors: Andre Lustosa, Tim Menzies (Fellow, IEEE)
๐ Repository
GitHub Repository: https://github.com/andre-motta/dimensionality_reduction_ratio
For questions, issues, or contributions, please visit the repository or contact the maintainers.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file drr-1.0.2.tar.gz.
File metadata
- Download URL: drr-1.0.2.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
109d4a57861cc8efb59f30fda1efa0902a5a1f38ae6e98106f069870ca284335
|
|
| MD5 |
b2da274a1963ea29169b39acd1badfa1
|
|
| BLAKE2b-256 |
15d0b72ad9a844bc42e3082fd2b074effa36249cc783cb356749aa539b42fc43
|
Provenance
The following attestation bundles were made for drr-1.0.2.tar.gz:
Publisher:
publish.yml on andre-motta/dimensionality_reduction_ratio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
drr-1.0.2.tar.gz -
Subject digest:
109d4a57861cc8efb59f30fda1efa0902a5a1f38ae6e98106f069870ca284335 - Sigstore transparency entry: 1592661605
- Sigstore integration time:
-
Permalink:
andre-motta/dimensionality_reduction_ratio@3591201cb861aa022fe884fdf7bcef2c03000be3 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/andre-motta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3591201cb861aa022fe884fdf7bcef2c03000be3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file drr-1.0.2-py3-none-any.whl.
File metadata
- Download URL: drr-1.0.2-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9953ad0c936c5e8df69d756185c2fb1ac969177cb1cc66c1f0fb0205946a7f9a
|
|
| MD5 |
f61a039a045bcd99856a2d3b31c51db8
|
|
| BLAKE2b-256 |
a4038e64a2dd33d1b06b4a4ff09cd2a6ab3a8a3ace8aa4606f34c44afbd2f671
|
Provenance
The following attestation bundles were made for drr-1.0.2-py3-none-any.whl:
Publisher:
publish.yml on andre-motta/dimensionality_reduction_ratio
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
drr-1.0.2-py3-none-any.whl -
Subject digest:
9953ad0c936c5e8df69d756185c2fb1ac969177cb1cc66c1f0fb0205946a7f9a - Sigstore transparency entry: 1592661682
- Sigstore integration time:
-
Permalink:
andre-motta/dimensionality_reduction_ratio@3591201cb861aa022fe884fdf7bcef2c03000be3 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/andre-motta
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3591201cb861aa022fe884fdf7bcef2c03000be3 -
Trigger Event:
push
-
Statement type: