Skip to main content

A professional Python toolkit for estimating intrinsic dimensionality and computing Dimensionality Reduction Ratio (DRR) metrics

Project description

Intrinsic Dimension Analysis with DRR Metrics

CI Python 3.11+ License: Unlicense Code style: black Coverage: 82%

A professional Python toolkit for estimating the intrinsic dimensionality of datasets and computing Dimensionality Reduction Ratio (DRR) metrics. This implementation is based on the correlation function approach from Levina & Bickel (2005) with enhancements for large-scale dataset processing.

๐Ÿš€ Quick Start

# Install the package
pip install drr

# Process all datasets from configuration file
drr batch datasets.txt

# Process a single dataset
drr single data/config/Apache_AllMeasurements.csv

# Use custom parameters with debug logging
drr --log-level DEBUG batch datasets.txt --max-samples 5000 --metric euclidean

๐Ÿ“‹ Table of Contents

๐Ÿ” Overview

This toolkit implements the Levina-Bickel correlation function method for intrinsic dimension estimation, enhanced with:

  • DRR (Dimensionality Reduction Ratio) metric: DRR = 1 - (I/R)
  • Large dataset handling with intelligent sampling strategies
  • Batch processing capabilities for multiple datasets
  • Professional logging and error handling
  • Resume functionality for interrupted processing jobs

What is Intrinsic Dimension?

The intrinsic dimension of a dataset is the minimum number of parameters needed to represent the data without significant information loss. While a dataset might exist in a high-dimensional space (raw dimension R), its true complexity might be much lower (intrinsic dimension I).

What is DRR?

Dimensionality Reduction Ratio (DRR) quantifies how much dimensionality reduction is possible:

  • DRR = 1 - (I/R)
  • High DRR (>0.5): Significant dimensionality reduction possible
  • Low DRR (<0.3): Dataset complexity is close to its raw dimensionality

โœจ Features

Core Capabilities

  • ๐Ÿ”ฌ Intrinsic dimension estimation using correlation function analysis
  • ๐Ÿ“Š DRR metric computation for dataset complexity analysis
  • ๐Ÿ—‚๏ธ Batch processing of multiple datasets from configuration files
  • ๐Ÿ“ˆ Large dataset optimization with multi-level sampling
  • ๐Ÿ”ง Resume functionality for interrupted processing jobs

Technical Features

  • ๐Ÿ—๏ธ Professional architecture with modular design
  • ๐Ÿ“ Comprehensive logging with configurable levels
  • ๐Ÿ›ก๏ธ Robust error handling and validation
  • ๐Ÿ”„ Progress tracking and status reporting
  • ๐Ÿ“Š CSV results export with detailed metrics

Data Processing

  • ๐Ÿงน Automatic preprocessing (categorical encoding, missing value handling)
  • ๐ŸŽฏ Goal variable detection and removal
  • ๐Ÿ“ Distance metric selection (L1, L2, Euclidean, Manhattan, Cosine)
  • ๐Ÿ”€ Intelligent sampling for datasets >50K rows

๐Ÿ› ๏ธ Installation

๐Ÿ› ๏ธ Installation

From PyPI (Recommended)

# Install the latest stable version
pip install drr

# Install with development dependencies
pip install drr[dev]

# Install with all optional dependencies
pip install drr[all]

From Source

# Clone the repository
git clone https://github.com/andre-motta/dimensionality_reduction_ratio.git
cd dimensionality_reduction_ratio

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e .[dev]

Prerequisites

  • Python 3.11+
  • pip (Python package installer)

Verify Installation

# Test the command-line interface
drr --help

# Or if installed from source
cd src
python -m drr --help

Dependencies

This project uses the following key libraries:

  • Click: Modern command-line interface framework
  • NumPy: Numerical computing library
  • Pandas: Data manipulation and analysis
  • SciPy: Scientific computing library
  • Matplotlib: Plotting library

๐Ÿ“– Usage

Command Line Interface

Batch Processing

Process multiple datasets from a configuration file:

drr batch datasets.txt

With custom parameters:

drr --log-level DEBUG batch datasets.txt \
    --max-samples 5000 \
    --metric euclidean \
    --data-root data

Single Dataset Processing

Process an individual dataset:

drr single data/config/Apache_AllMeasurements.csv

With custom parameters:

drr single data/config/Apache_AllMeasurements.csv \
    --max-samples 3000 \
    --metric manhattan

Global Options

  • --log-level: Logging level (DEBUG, INFO, WARNING, ERROR)
  • --log-file: Optional log file path

Batch Command Options

  • datasets_file: Path to configuration file listing datasets to process
  • --data-root: Root directory for dataset files (default: ../data)
  • --max-samples: Maximum samples for large datasets (default: 2000)
  • --metric: Distance metric (l1, l2, euclidean, manhattan, cosine)

Single Command Options

  • dataset_path: Path to the dataset file to process
  • --max-samples: Maximum samples for large datasets (default: 2000)
  • --metric: Distance metric (l1, l2, euclidean, manhattan, cosine)

Python API

Single Dataset Analysis

import drr

# Simple usage with convenience function
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]  # Your dataset
original_dims, intrinsic_dim, drr_value = drr.estimate_intrinsic_dimension(data)

print(f"Raw dimensions: {original_dims}")
print(f"Intrinsic dimension: {intrinsic_dim}")
print(f"DRR: {drr_value:.3f}")

# Advanced usage with classes
estimator = drr.IntrinsicDimensionEstimator(max_samples=2000, distance_metric='euclidean')
processor = drr.DataProcessor()

# Process dataset from file
data, metadata = processor.process_dataset('data/config/Apache_AllMeasurements.csv')
original_dims, intrinsic_dim, drr_value = estimator.estimate(data)

Batch Processing

import drr

# Initialize batch processor
processor = drr.BatchProcessor(
    results_file="results/my_results.csv",
    max_samples=2000,
    distance_metric='manhattan'
)

# Process all datasets
results = processor.process_datasets_from_file('datasets.txt')
print(f"Processed {results['successful']} datasets successfully")

๐Ÿ“ Dataset Configuration

The datasets.txt file defines which datasets to process using a hierarchical structure:

Format

# Configuration section
config
    Apache_AllMeasurements
    HSMGP_num
    SQL_AllMeasurements

# Classification datasets  
classify
    breastcancer
    diabetes
    german

# Software measurement datasets
mvn
    training_set/mvn_training
    test_set/mvn_test

Rules

  1. Section headers have no indentation
  2. Dataset names are indented (spaces or tabs)
  3. Comments start with #
  4. File paths are relative to data_root directory
  5. CSV extension is automatically added

๐Ÿ”ฌ Algorithm Details

Correlation Function Method

The algorithm estimates intrinsic dimension using the correlation function approach:

  1. Distance Computation: Calculate pairwise distances between data points
  2. Correlation Function: C(r) = (2 * I) / (n * (n-1)) where I is the number of pairs with distance โ‰ค r
  3. Log-Log Analysis: Fit linear regression to log(C(r)) vs log(r)
  4. Dimension Estimation: The slope approximates the intrinsic dimension

๐Ÿ“Š DRR Metrics

Understanding DRR Values

DRR = 1 - (I/R) where:

  • I: Intrinsic dimension (estimated)
  • R: Raw dimension (number of features)
  • DRR: Dimensionality Reduction Ratio

Interpretation Guidelines

DRR Range Interpretation Example Dataset Type
0.0 - 0.2 Low reduction potential Behavior/performance data
0.2 - 0.4 Moderate reduction Mixed datasets
0.4 - 0.6 Good reduction potential Configuration data
0.6 - 1.0 High reduction potential Highly correlated features

๐Ÿ“ˆ Results

Sample Output

===============================================
RESULTS FOR: Apache_AllMeasurements.csv
===============================================
Original Dimensions (R): 43
Intrinsic Dimension (I): 12
DRR (1 - I/R): 0.721
Data Quality: 72.1% dimensionality reduction
===============================================

๐Ÿ—‚๏ธ Directory Structure

dimensionality_reduction_ratio/
โ”œโ”€โ”€ src/                      # Source code modules
โ”‚   โ”œโ”€โ”€ main.py              # Command-line entry point
โ”‚   โ”œโ”€โ”€ intrinsic_dimension.py  # Core algorithm
โ”‚   โ”œโ”€โ”€ data_processor.py    # Data preprocessing
โ”‚   โ””โ”€โ”€ batch_processor.py   # Batch processing
โ”œโ”€โ”€ config/                   # Configuration files
โ”‚   โ”œโ”€โ”€ datasets.txt         # Dataset configuration
โ”‚   โ””โ”€โ”€ test_datasets.txt    # Test configuration
โ”œโ”€โ”€ data/                     # Dataset files
โ”œโ”€โ”€ results/                  # Output files
โ”œโ”€โ”€ logs/                     # Log files
โ”œโ”€โ”€ examples/                 # Usage examples
โ”‚   โ””โ”€โ”€ example_usage.py     # API usage examples
โ””โ”€โ”€ README.md                # This documentation

๐Ÿงช Testing

Validate Installation

# Test the command-line interface
drr --help
drr batch --help 
drr single --help

# Test with sample data
drr single data/optimize/config/SS-A.csv

# Test batch processing (small subset)
drr batch config/test_dataset.txt

๐Ÿ”— Repository

GitHub Repository: https://github.com/andre-motta/dimensionality_reduction_ratio

For questions or support, please open an issue in the repository or contact the maintainers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drr-1.0.1.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

drr-1.0.1-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file drr-1.0.1.tar.gz.

File metadata

  • Download URL: drr-1.0.1.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for drr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 830fead39d2b0b163a2bf62e9a83e9b266a9a4d3b60588c2b54c7158cc8e4753
MD5 6de5fb9fe21eb8d8d98bd2c32e46c523
BLAKE2b-256 4926e1d1101135fdff47db9c98a1e802028f883f44959a0c648056f4290c22a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for drr-1.0.1.tar.gz:

Publisher: publish.yml on andre-motta/dimensionality_reduction_ratio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file drr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: drr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for drr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e9c2f1b78387f6658699f56bd8c2e799de18832f07345ebc449d81de6f26bae
MD5 a919f1cb523eec3b7da4cd8cefbde7cf
BLAKE2b-256 72eb2aa8c77b67624a0eaee8e203335b6ad6152f90e85b2f93c6641c2ac4cc47

See more details on using hashes here.

Provenance

The following attestation bundles were made for drr-1.0.1-py3-none-any.whl:

Publisher: publish.yml on andre-motta/dimensionality_reduction_ratio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page