Genetic Algorithm for Unsupervised Feature Selection for Clustering
Project description
GAUFS
GAUFS (Genetic Algorithm for Unsupervised Feature Selection) is a Python library for unsupervised feature selection designed to identify the most relevant features for clustering without requiring labeled data. It combines genetic algorithms with clustering experiments to perform dimensionality reduction while simultaneously estimating the optimal number of clusters.
This library accompanies the research work presented in the paper:
GAUFS: Genetic Algorithm for Unsupervised Feature Selection for Clustering
Note: To reproduce the results presented in the paper and the experimental setup used for comparison with alternative methods, please use the paper-reproducibility branch of this repository.
Table of Contents
- Key Features
- Installation
- Quick Start: Basic Gaufs Usage
- How GAUFS Works
- Main Configuration Parameters for GAUFS
- Output Files
- Synthetic Data Generators
- Custom Fitness
- Examples
- Documentation
- Project Structure
- Acknowledgments
- License
- Library Authors and Contact Information
- Support
Key Features
- Fully Unsupervised: No labeled data required for feature selection
- Automatic Cluster Estimation: Simultaneously identifies optimal features and number of clusters
- Flexible architecture: GAUFS can work with custom clustering algorithms and evaluation metrics, allowing optimization of internal metrics (without relying on labels) and optionally external metrics when true labels are available for evaluation.
- Synthetic data generators: Includes the Spheres and Corners generators introduced in the paper, designed for testing feature selection under controlled clustering scenarios and benchmarking.
- Comprehensive Output: Automatic generation of plots, CSV files, and JSON results
- Reproducible: Seed-based random state control for consistent results
Installation
GAUFS is available on PyPI and can be installed using pip:
pip install gaufs
Requirements:
- Python
>=3.11,<3.14 - numpy
>=2.4.0,<3.0.0 - pandas
>=2.3.3,<3.0.0 - scipy
>=1.16.3,<2.0.0 - matplotlib
>=3.10.8,<4.0.0 - scikit-learn
>=1.8.0,<2.0.0 - DEAP
>=1.4.3,<2.0.0(used for the genetic algorithm)
Quick Start: Basic Gaufs Usage
import pandas as pd
from gaufs import Gaufs
# Load your unlabeled data
data = pd.read_csv('your_data.csv')
# Initialize GAUFS with default parameters
gaufs = Gaufs(unlabeled_data=data)
# Run the complete algorithm
optimal_solution, fitness = gaufs.run()
# Extract results
selected_features = optimal_solution[0] # Binary list (1=selected, 0=not selected)
optimal_clusters = optimal_solution[1] # Optimal number of clusters
print(f"Selected {sum(selected_features)} out of {len(selected_features)} features")
print(f"Optimal number of clusters: {optimal_clusters}")
print(f"Fitness score: {fitness}")
How GAUFS Works
GAUFS operates in two main phases:
1. Genetic Search Phase
- Runs multiple independent genetic algorithm executions
- Each execution evolves feature subsets across different numbers of clusters
- Evaluates clustering quality using the specified metric (default: Silhouette Score)
- Computes variable significance scores based on selection frequency and quality
2. Variable Weight Analysis Phase
- Analyzes results from all genetic searches
- Combines fitness values and significance thresholds using weighted averaging
- Applies exponential decay to importance differences
- Automatically selects the optimal feature subset and number of clusters
- Outputs metrics graphs to help users make more informed decisions when balancing dimensionality reduction and cluster quality.
The algorithm produces comprehensive outputs including:
- Selected feature subset
- Optimal number of clusters
- Fitness scores and significance metrics
- Visualization plots (2D and 3D)
- Detailed CSV files and JSON dictionaries
Main Configuration Parameters for GAUFS
| Parameter | Type | Default | Description |
|---|---|---|---|
seed |
int | None | Random seed for reproducibility. Default: random integer between 0 and 10000 if None. |
unlabeled_data |
pd.DataFrame or None | None | Input dataset without labels. If None, creates empty DataFrame. |
num_genetic_executions |
int | 1 | Number of independent Genetic Algorithm runs. Must be ≥ 1. |
ngen |
int | 150 (auto 150 if num_vars ≤ 100, else 300) |
Number of generations per GA execution. Must be ≥ 1. |
npop |
int | 1500 (auto 1500 if num_vars ≤ 100, else 7000) |
Population size. Must be ≥ 1. |
cxpb |
float | 0.8 | Crossover probability for genetic operations. Range: [0.0, 1.0]. |
cxpb_rest_of_genes |
float | 0.5 | Crossover probability for the rest of generations after initial ones. Range: [0.0, 1.0]. |
mutpb |
float | 0.1 | Mutation probability for genetic operations. Range: [0.0, 1.0]. |
convergence_generations |
int | 50 | Generations without improvement before early stopping. Must be ≥ 1. |
hof_size |
int or None | None | Hall of Fame size (absolute number of best solutions to retain). Overrides hof_alpha_beta if provided. Must be ≥ 1 or None. |
hof_alpha_beta |
tuple | (0.1, 0.2) | (alpha, beta) used for automatic Hall of Fame size calculation if hof_size is None. Range: [0.0, 1.0], beta ≥ alpha. |
clustering_method |
ClusteringExperiment | HierarchicalExperiment(linkage='ward') | Clustering algorithm instance. Must implement ClusteringExperiment. |
evaluation_metric |
EvaluationMetric | SilhouetteScore() | Metric for evaluating clustering quality. Must implement EvaluationMetric. |
cluster_number_search_band |
tuple | (2, 26) | Range of cluster numbers to explore as (min_inclusive, max_exclusive). Must satisfy 2 ≤ min < max ≤ number of samples. |
fitness_weight_over_threshold |
float | 0.5 | Weight for fitness vs threshold in variable importance computation. Range: [0.0, 1.0]. |
exponential_decay_factor |
float | 1.0 | Exponential decay factor for automatic solution selector. 0 means no decay. Formula: δ_i / (1 + (N / exp(exponential_decay_factor * i))). Must be ≥ 0.0. |
max_number_selections_for_ponderation |
int or None | 2 * num_vars | Max selections from Hall of Fame for weight computation. Must be ≥ 1 or None. |
verbose |
bool | True | Whether to print logs during execution. |
generate_genetics_log_files |
bool | True | Whether to generate log files with GA execution details. |
graph_evolution |
bool | True | Whether to generate graphs of best and average fitness during GA evolution. |
generate_files_with_results |
bool | True | Whether to generate files with results and plots. |
output_directory |
str or None | "./out/" if None | Path to store generated files including plots. |
Output Files
All outputs are automatically saved under the specified output_directory (default ./out/), organized by GA run and type of analysis.
GA Execution Folders
Each independent GA run with a specific random seed creates a folder named GA_Seed_<seed>/ containing:
fitness_evolution.png– Evolution of fitness across generations.genetic_algorithm_log.txt– Detailed log of the GA execution.hall_of_fame.txt– Best solutions found during the run.hall_of_fame_counter.txt– Frequency count of hall-of-fame solutions.
Results Folder
The results/ folder contains aggregated analysis and visualizations:
analysis_by_number_of_variables.png: This key file helps users make informed decisions when balancing dimensionality reduction and clustering quality.3D_plot_vars_clusters_fitness.png– 3D plot of variables, clusters, and fitness values.dictionaries_variables_weight_analysis.json– Variable selections importances and related metrics as described in the paper.optimal_variable_selection.csv– Selected optimal subset of features.optimal_variable_selection_and_number_of_clusters.txt– Recommended feature subset and number of clusters.variable_significances.csv– Weight of each variable.
Comparison Plots
comparison_fitness_vs_given_metric.png– Shows the fitness values of solutions compared to a target metric (e.g., AMI). Generated withget_plot_comparing_solution_with_another_metric.
Synthetic Data Generators
In addition, GAUFS provides two types of synthetic data generators for clustering benchmarking, as presented in the paper.
Note: Points within each cluster are scattered around the cluster center, either following a normal distribution or a uniform distribution within a maximum radius.
DataSpheres Generator
Generates ball-shaped clusters with centers distributed across the feature space:
from gaufs import DataGenerator
# Generate ball-shaped clusters
data_balls = DataGenerator.generate_data_spheres(
num_useful_features=5,
num_clusters=4,
num_samples_per_cluster=200,
num_dummy_unif=10, # Add 10 uniform noise features
num_dummy_beta=5, # Add 5 beta-distributed noise features
seed=42
)
DataCorners Generator
Creates simplex-structured clusters whose centers form orthogonal vertices:
# Generate simplex-structured clusters
data_corners = DataGenerator.generate_data_corners(
num_useful_features=3, # Will create 4 clusters (n+1)
num_samples_per_cluster=150,
num_dummy_unif=5,
seed=42
)
Key Differences:
- DataSpheres: Clusters can are placed in a grid in the feature space - good for general clustering scenarios
- DataCorners: Clusters form a simplex structure - useful for testing dimensionality reduction and feature selection as clusters are well-separated when projected onto useful dimensions and none of the num_useful_features is redundant.
Custom Fitness
Clustering Algorithms
GAUFS provides built-in clustering algorithms and supports custom implementations through class extension.
Available clustering methods:
HierarchicalExperiment(default) - Agglomerative clustering with Ward, Complete, Average or Single linkageKmeansExperiment- K-means clustering- You can extend the
ClusteringExperimentbase class to integrate any clustering algorithm.
Evaluation Metrics
GAUFS supports both internal and external metrics for evaluating clustering quality, and allows custom metric implementation.
Internal Metrics (unsupervised - don't require true labels):
SilhouetteScore(default)CalinskiHarabaszScoreDaviesBouldinScoreDaviesBouldinScoreForMaximizationDunnScoreSSEScoreSSEScoreForMaximization
External Metrics (supervised - require true labels for evaluation):
AdjustedRandIndexScoreAdjustedMutualInformationScoreNMIScoreVMeasureScoreFowlkesMallowsScoreFScoreHScoreChi2DobPertScore
Key difference: Internal metrics optimize clustering without labels (true unsupervised learning), while external metrics are used for validation and comparison when ground truth is available.
Note: You can extend the EvaluationMetric base class to implement custom metrics.
Examples
Two comprehensive demo scripts are provided to illustrate GAUFS capabilities:
Demo 1: Basic Usage with Corner Distribution (demo/demo1.py)
This example demonstrates the standard GAUFS workflow using synthetic data with a simplex (corner) structure:
-
Data characteristics:
- 4 useful clustering features
- 2 uniform noise features + 2 beta-distributed noise features
- 3 clusters forming a corner/simplex structure
- 50 samples per cluster (150 total)
-
Workflow:
- Generates synthetic data using
DataGenerator.generate_data_corners() - Runs GAUFS with default settings (unsupervised mode)
- Compares results against ground truth using Adjusted Mutual Information
- Produces visualization plots and analysis outputs
- Generates synthetic data using
Demo 2: Advanced Configuration with Spherical Clusters (demo/demo2.py)
This example showcases GAUFS in a supervised scenario with custom configuration:
-
Data characteristics:
- 2 useful clustering features
- 4 clusters with spherical distribution
- 1 uniform noise feature + 1 beta-distributed noise feature
- 50 samples per cluster (200 total)
-
Advanced features demonstrated:
- Custom clustering method (
KmeansExperiment) - External evaluation metric (AMI with known labels)
- Tighter cluster search range (3–5 clusters). As explained in the paper, we recommend not reducing the cluster search range to a single value, even when the number of true labels is known.
- Comparison with alternative metrics (NMI)
- Custom clustering method (
To run the demos:
# Clone the repository
git clone https://github.com/salva24/GAUFS.git
cd GAUFS
# Install the package
pip install -e .
# Run demo 1 (corners and basic usage)
python demo/demo1.py
# Run demo 2 (spheres and advanced configuration)
python demo/demo2.py
Both demos generate comprehensive outputs including plots, analysis files, and performance metrics in the examples\out\ directory.
Documentation
GAUFS includes comprehensive Sphinx documentation. To build the documentation locally:
# Clone the repository
git clone https://github.com/salva24/GAUFS.git
cd GAUFS
# Install the package with documentation dependencies
pip install -e ".[docs]"
# Build the HTML documentation
python -m sphinx -b html docs/source docs/build/html
# Open the documentation in your browser
# On Linux/Mac:
open docs/build/html/index.html
# On Windows:
start docs/build/html/index.html
Project Structure
gaufs/
├── src/ # Source code folder
│ └── gaufs/ # Main package
│ ├── __init__.py # Main API exports
│ ├── gaufs.py # Core GAUFS algorithm
│ ├── data_generator.py # Synthetic data generators
│ ├── clustering_experiments/ # Clustering implementations
│ │ ├── __init__.py
│ │ ├── base.py # Base class
│ │ ├── hierarchical.py # Hierarchical clustering
│ │ └── kmeans.py # K-means clustering
│ ├── evaluation_metrics/ # Evaluation metrics
│ │ ├── __init__.py
│ │ ├── base.py # Base class
│ │ ├── external.py # External metrics (ARI, AMI, etc.)
│ │ ├── internal.py # Internal metrics (Silhouette, etc.)
│ │ └── utils.py # Private utility functions
│ ├── genetic_search.py # Private Genetic Algorithm implementation
│ └── utils.py # Private helper functions and functions to read csv files
├── tests/ # Tests
│ └── test_main.py # Execution test
├── examples/ # Demo examples
│ ├── demo1.py # First demo script
│ ├── datasets/ # Folder for datasets
│ └── out/ # Folder for GAUFS output and results
├── docs/ # Documentation
├── .github/ # GitHub workflows
│ └── workflows/
│ ├── publish.yml # Publishing workflow
│ └── tests-build.yml # CI tests workflow
├── LICENSE # Apache 2.0 License
├── NOTICE.txt # Attribution information
├── README.md # This file
├── pyproject.toml # Package setup
└── .gitignore # Git ignore rules
Acknowledgments
This work has been developed by researchers from MINERVA AI-Lab, Institute of Computer Engineering, University of Seville, Spain.
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Additional attribution and authorship information is provided in the NOTICE file.
Library Authors and Contact Information
Author: Salvador de la Torre Gonzalez
Email: delatorregonzalezsalvador at gmail.com
Co-authors:
- Antonio Bello Castro
- José M. Núñez Portero
Support
For questions, issues, or feature requests of this open-source software:
- Open an issue on GitHub
- Contact the author via email
Happy Clustering! 🧬📊
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gaufs-1.1.3.tar.gz.
File metadata
- Download URL: gaufs-1.1.3.tar.gz
- Upload date:
- Size: 49.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
910bdeca1c7a27b4532cb7ffa6203f33ffa42d49f518dac0208637d267db2651
|
|
| MD5 |
bddb809281183044341d38b8f1fa3c80
|
|
| BLAKE2b-256 |
4ac7f732b7cda3bb80cb51ef5a1d41c3a3632272446dc198d89ede4ee8ba30a0
|
Provenance
The following attestation bundles were made for gaufs-1.1.3.tar.gz:
Publisher:
publish.yml on salva24/GAUFS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gaufs-1.1.3.tar.gz -
Subject digest:
910bdeca1c7a27b4532cb7ffa6203f33ffa42d49f518dac0208637d267db2651 - Sigstore transparency entry: 1391318623
- Sigstore integration time:
-
Permalink:
salva24/GAUFS@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe -
Branch / Tag:
refs/tags/v1.1.3 - Owner: https://github.com/salva24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe -
Trigger Event:
push
-
Statement type:
File details
Details for the file gaufs-1.1.3-py3-none-any.whl.
File metadata
- Download URL: gaufs-1.1.3-py3-none-any.whl
- Upload date:
- Size: 53.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac0cf52a25f26f44d5e35f4fa85d0a2df3cf5e45d1234368f6c6ebff73d609bf
|
|
| MD5 |
26ae2adb856bbf6497a1cdbc151b8300
|
|
| BLAKE2b-256 |
e70b2d6ec42e7ade9c3440fcd1b1084e3a02e970ddab986749aadf2c806334f3
|
Provenance
The following attestation bundles were made for gaufs-1.1.3-py3-none-any.whl:
Publisher:
publish.yml on salva24/GAUFS
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gaufs-1.1.3-py3-none-any.whl -
Subject digest:
ac0cf52a25f26f44d5e35f4fa85d0a2df3cf5e45d1234368f6c6ebff73d609bf - Sigstore transparency entry: 1391318687
- Sigstore integration time:
-
Permalink:
salva24/GAUFS@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe -
Branch / Tag:
refs/tags/v1.1.3 - Owner: https://github.com/salva24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe -
Trigger Event:
push
-
Statement type: