Genetic Algorithm for Unsupervised Feature Selection for Clustering

Project description

GAUFS

GAUFS (Genetic Algorithm for Unsupervised Feature Selection) is a Python library for unsupervised feature selection designed to identify the most relevant features for clustering without requiring labeled data. It combines genetic algorithms with clustering experiments to perform dimensionality reduction while simultaneously estimating the optimal number of clusters.

This library accompanies the research work presented in the paper:

GAUFS: Genetic Algorithm for Unsupervised Feature Selection for Clustering

Note: To reproduce the results presented in the paper and the experimental setup used for comparison with alternative methods, please use the paper-reproducibility branch of this repository.

Key Features
Installation
Quick Start: Basic Gaufs Usage
How GAUFS Works
Main Configuration Parameters for GAUFS
Output Files
Synthetic Data Generators
- DataSpheres Generator
- DataCorners Generator
Custom Fitness
- Clustering Algorithms
- Evaluation Metrics
Examples
- Demo 1: Basic Usage with Corner Distribution
- Demo 2: Advanced Configuration with Spherical Clusters
Documentation
Project Structure
Acknowledgments
License
Library Authors and Contact Information
Support

Key Features

Fully Unsupervised: No labeled data required for feature selection
Automatic Cluster Estimation: Simultaneously identifies optimal features and number of clusters
Flexible architecture: GAUFS can work with custom clustering algorithms and evaluation metrics, allowing optimization of internal metrics (without relying on labels) and optionally external metrics when true labels are available for evaluation.
Synthetic data generators: Includes the Spheres and Corners generators introduced in the paper, designed for testing feature selection under controlled clustering scenarios and benchmarking.
Comprehensive Output: Automatic generation of plots, CSV files, and JSON results
Reproducible: Seed-based random state control for consistent results

Installation

GAUFS is available on PyPI and can be installed using pip:

pip install gaufs

Requirements:

Python >=3.11,<3.14
numpy >=2.4.0,<3.0.0
pandas >=2.3.3,<3.0.0
scipy >=1.16.3,<2.0.0
matplotlib >=3.10.8,<4.0.0
scikit-learn >=1.8.0,<2.0.0
DEAP >=1.4.3,<2.0.0 (used for the genetic algorithm)

Quick Start: Basic Gaufs Usage

import pandas as pd
from gaufs import Gaufs

# Load your unlabeled data
data = pd.read_csv('your_data.csv')

# Initialize GAUFS with default parameters
gaufs = Gaufs(unlabeled_data=data)

# Run the complete algorithm
optimal_solution, fitness = gaufs.run()

# Extract results
selected_features = optimal_solution[0]  # Binary list (1=selected, 0=not selected)
optimal_clusters = optimal_solution[1]   # Optimal number of clusters

print(f"Selected {sum(selected_features)} out of {len(selected_features)} features")
print(f"Optimal number of clusters: {optimal_clusters}")
print(f"Fitness score: {fitness}")

How GAUFS Works

GAUFS operates in two main phases:

1. Genetic Search Phase

Runs multiple independent genetic algorithm executions
Each execution evolves feature subsets across different numbers of clusters
Evaluates clustering quality using the specified metric (default: Silhouette Score)
Computes variable significance scores based on selection frequency and quality

2. Variable Weight Analysis Phase

Analyzes results from all genetic searches
Combines fitness values and significance thresholds using weighted averaging
Applies exponential decay to importance differences
Automatically selects the optimal feature subset and number of clusters
Outputs metrics graphs to help users make more informed decisions when balancing dimensionality reduction and cluster quality.

The algorithm produces comprehensive outputs including:

Selected feature subset
Optimal number of clusters
Fitness scores and significance metrics
Visualization plots (2D and 3D)
Detailed CSV files and JSON dictionaries

Main Configuration Parameters for GAUFS

Parameter	Type	Default	Description
`seed`	int	None	Random seed for reproducibility. Default: random integer between 0 and 10000 if None.
`unlabeled_data`	pd.DataFrame or None	None	Input dataset without labels. If None, creates empty DataFrame.
`num_genetic_executions`	int	1	Number of independent Genetic Algorithm runs. Must be ≥ 1.
`ngen`	int	150 (auto 150 if `num_vars` ≤ 100, else 300)	Number of generations per GA execution. Must be ≥ 1.
`npop`	int	1500 (auto 1500 if `num_vars` ≤ 100, else 7000)	Population size. Must be ≥ 1.
`cxpb`	float	0.8	Crossover probability for genetic operations. Range: [0.0, 1.0].
`cxpb_rest_of_genes`	float	0.5	Crossover probability for the rest of generations after initial ones. Range: [0.0, 1.0].
`mutpb`	float	0.1	Mutation probability for genetic operations. Range: [0.0, 1.0].
`convergence_generations`	int	50	Generations without improvement before early stopping. Must be ≥ 1.
`hof_size`	int or None	None	Hall of Fame size (absolute number of best solutions to retain). Overrides `hof_alpha_beta` if provided. Must be ≥ 1 or None.
`hof_alpha_beta`	tuple	(0.1, 0.2)	`(alpha, beta)` used for automatic Hall of Fame size calculation if `hof_size` is None. Range: [0.0, 1.0], beta ≥ alpha.
`clustering_method`	ClusteringExperiment	HierarchicalExperiment(linkage='ward')	Clustering algorithm instance. Must implement `ClusteringExperiment`.
`evaluation_metric`	EvaluationMetric	SilhouetteScore()	Metric for evaluating clustering quality. Must implement `EvaluationMetric`.
`cluster_number_search_band`	tuple	(2, 26)	Range of cluster numbers to explore as (min_inclusive, max_exclusive). Must satisfy 2 ≤ min < max ≤ number of samples.
`fitness_weight_over_threshold`	float	0.5	Weight for fitness vs threshold in variable importance computation. Range: [0.0, 1.0].
`exponential_decay_factor`	float	1.0	Exponential decay factor for automatic solution selector. 0 means no decay. Formula: δ_i / (1 + (N / exp(exponential_decay_factor * i))). Must be ≥ 0.0.
`max_number_selections_for_ponderation`	int or None	2 * num_vars	Max selections from Hall of Fame for weight computation. Must be ≥ 1 or None.
`verbose`	bool	True	Whether to print logs during execution.
`generate_genetics_log_files`	bool	True	Whether to generate log files with GA execution details.
`graph_evolution`	bool	True	Whether to generate graphs of best and average fitness during GA evolution.
`generate_files_with_results`	bool	True	Whether to generate files with results and plots.
`output_directory`	str or None	"./out/" if None	Path to store generated files including plots.

Output Files

All outputs are automatically saved under the specified output_directory (default ./out/), organized by GA run and type of analysis.

GA Execution Folders

Each independent GA run with a specific random seed creates a folder named GA_Seed_<seed>/ containing:

fitness_evolution.png – Evolution of fitness across generations.
genetic_algorithm_log.txt – Detailed log of the GA execution.
hall_of_fame.txt – Best solutions found during the run.
hall_of_fame_counter.txt – Frequency count of hall-of-fame solutions.

Results Folder

The results/ folder contains aggregated analysis and visualizations:

analysis_by_number_of_variables.png: This key file helps users make informed decisions when balancing dimensionality reduction and clustering quality.
3D_plot_vars_clusters_fitness.png – 3D plot of variables, clusters, and fitness values.
dictionaries_variables_weight_analysis.json – Variable selections importances and related metrics as described in the paper.
optimal_variable_selection.csv – Selected optimal subset of features.
optimal_variable_selection_and_number_of_clusters.txt – Recommended feature subset and number of clusters.
variable_significances.csv – Weight of each variable.

Comparison Plots

comparison_fitness_vs_given_metric.png – Shows the fitness values of solutions compared to a target metric (e.g., AMI). Generated with get_plot_comparing_solution_with_another_metric.

Synthetic Data Generators

In addition, GAUFS provides two types of synthetic data generators for clustering benchmarking, as presented in the paper.

Note: Points within each cluster are scattered around the cluster center, either following a normal distribution or a uniform distribution within a maximum radius.

DataSpheres Generator

Generates ball-shaped clusters with centers distributed across the feature space:

from gaufs import DataGenerator

# Generate ball-shaped clusters
data_balls = DataGenerator.generate_data_spheres(
    num_useful_features=5,
    num_clusters=4,
    num_samples_per_cluster=200,
    num_dummy_unif=10,    # Add 10 uniform noise features
    num_dummy_beta=5,     # Add 5 beta-distributed noise features
    seed=42
)

DataCorners Generator

Creates simplex-structured clusters whose centers form orthogonal vertices:

# Generate simplex-structured clusters
data_corners = DataGenerator.generate_data_corners(
    num_useful_features=3,  # Will create 4 clusters (n+1)
    num_samples_per_cluster=150,
    num_dummy_unif=5,
    seed=42
)

Key Differences:

DataSpheres: Clusters can are placed in a grid in the feature space - good for general clustering scenarios
DataCorners: Clusters form a simplex structure - useful for testing dimensionality reduction and feature selection as clusters are well-separated when projected onto useful dimensions and none of the num_useful_features is redundant.

Custom Fitness

Clustering Algorithms

GAUFS provides built-in clustering algorithms and supports custom implementations through class extension.

Available clustering methods:

HierarchicalExperiment (default) - Agglomerative clustering with Ward, Complete, Average or Single linkage
KmeansExperiment - K-means clustering
You can extend the ClusteringExperiment base class to integrate any clustering algorithm.

Evaluation Metrics

GAUFS supports both internal and external metrics for evaluating clustering quality, and allows custom metric implementation.

Internal Metrics (unsupervised - don't require true labels):

SilhouetteScore (default)
CalinskiHarabaszScore
DaviesBouldinScore
DaviesBouldinScoreForMaximization
DunnScore
SSEScore
SSEScoreForMaximization

External Metrics (supervised - require true labels for evaluation):

AdjustedRandIndexScore
AdjustedMutualInformationScore
NMIScore
VMeasureScore
FowlkesMallowsScore
FScore
HScore
Chi2
DobPertScore

Key difference: Internal metrics optimize clustering without labels (true unsupervised learning), while external metrics are used for validation and comparison when ground truth is available.

Note: You can extend the EvaluationMetric base class to implement custom metrics.

Examples

Two comprehensive demo scripts are provided to illustrate GAUFS capabilities:

Demo 1: Basic Usage with Corner Distribution (`demo/demo1.py`)

This example demonstrates the standard GAUFS workflow using synthetic data with a simplex (corner) structure:

Data characteristics:
- 4 useful clustering features
- 2 uniform noise features + 2 beta-distributed noise features
- 3 clusters forming a corner/simplex structure
- 50 samples per cluster (150 total)
Workflow:
- Generates synthetic data using DataGenerator.generate_data_corners()
- Runs GAUFS with default settings (unsupervised mode)
- Compares results against ground truth using Adjusted Mutual Information
- Produces visualization plots and analysis outputs

Demo 2: Advanced Configuration with Spherical Clusters (`demo/demo2.py`)

This example showcases GAUFS in a supervised scenario with custom configuration:

Data characteristics:
- 2 useful clustering features
- 4 clusters with spherical distribution
- 1 uniform noise feature + 1 beta-distributed noise feature
- 50 samples per cluster (200 total)
Advanced features demonstrated:
- Custom clustering method (KmeansExperiment)
- External evaluation metric (AMI with known labels)
- Tighter cluster search range (3–5 clusters). As explained in the paper, we recommend not reducing the cluster search range to a single value, even when the number of true labels is known.
- Comparison with alternative metrics (NMI)

To run the demos:

# Clone the repository
git clone https://github.com/salva24/GAUFS.git
cd GAUFS

# Install the package
pip install -e .

# Run demo 1 (corners and basic usage)
python demo/demo1.py

# Run demo 2 (spheres and advanced configuration)
python demo/demo2.py

Both demos generate comprehensive outputs including plots, analysis files, and performance metrics in the examples\out\ directory.

Documentation

GAUFS includes comprehensive Sphinx documentation. To build the documentation locally:

# Clone the repository
git clone https://github.com/salva24/GAUFS.git
cd GAUFS

# Install the package with documentation dependencies
pip install -e ".[docs]"

# Build the HTML documentation
python -m sphinx -b html docs/source docs/build/html

# Open the documentation in your browser
# On Linux/Mac:
open docs/build/html/index.html
# On Windows:
start docs/build/html/index.html

Project Structure

gaufs/
├── src/                               # Source code folder
│   └── gaufs/                         # Main package
│       ├── __init__.py                # Main API exports
│       ├── gaufs.py                   # Core GAUFS algorithm
│       ├── data_generator.py          # Synthetic data generators
│       ├── clustering_experiments/    # Clustering implementations
│       │   ├── __init__.py
│       │   ├── base.py                # Base class
│       │   ├── hierarchical.py        # Hierarchical clustering
│       │   └── kmeans.py              # K-means clustering
│       ├── evaluation_metrics/        # Evaluation metrics
│       │   ├── __init__.py
│       │   ├── base.py                # Base class
│       │   ├── external.py            # External metrics (ARI, AMI, etc.)
│       │   ├── internal.py            # Internal metrics (Silhouette, etc.)
│       │   └── utils.py               # Private utility functions
│       ├── genetic_search.py          # Private Genetic Algorithm implementation
│       └── utils.py                   # Private helper functions and functions to read csv files
├── tests/                             # Tests
│   └── test_main.py                   # Execution test
├── examples/                          # Demo examples
│   ├── demo1.py                       # First demo script
│   ├── datasets/                      # Folder for datasets
│   └── out/                           # Folder for GAUFS output and results
├── docs/                              # Documentation
├── .github/                           # GitHub workflows
│   └── workflows/
│       ├── publish.yml                # Publishing workflow
│       └── tests-build.yml            # CI tests workflow
├── LICENSE                            # Apache 2.0 License
├── NOTICE.txt                         # Attribution information
├── README.md                          # This file
├── pyproject.toml                     # Package setup
└── .gitignore                         # Git ignore rules

Acknowledgments

This work has been developed by researchers from MINERVA AI-Lab, Institute of Computer Engineering, University of Seville, Spain.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Additional attribution and authorship information is provided in the NOTICE file.

Library Authors and Contact Information

Author: Salvador de la Torre Gonzalez
Email: delatorregonzalezsalvador at gmail.com

Co-authors:

Antonio Bello Castro
José M. Núñez Portero

Support

For questions, issues, or feature requests of this open-source software:

Open an issue on GitHub
Contact the author via email

Happy Clustering! 🧬📊

Project details

Release history Release notifications | RSS feed

1.1.4

May 2, 2026

This version

1.1.3

Apr 26, 2026

1.1.2

Jan 21, 2026

1.1.1

Jan 17, 2026

1.1.0

Jan 17, 2026

1.0.0

Jan 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaufs-1.1.3.tar.gz (49.1 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gaufs-1.1.3-py3-none-any.whl (53.6 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file gaufs-1.1.3.tar.gz.

File metadata

Download URL: gaufs-1.1.3.tar.gz
Upload date: Apr 26, 2026
Size: 49.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gaufs-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`910bdeca1c7a27b4532cb7ffa6203f33ffa42d49f518dac0208637d267db2651`
MD5	`bddb809281183044341d38b8f1fa3c80`
BLAKE2b-256	`4ac7f732b7cda3bb80cb51ef5a1d41c3a3632272446dc198d89ede4ee8ba30a0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gaufs-1.1.3.tar.gz:

Publisher: publish.yml on salva24/GAUFS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gaufs-1.1.3.tar.gz
- Subject digest: 910bdeca1c7a27b4532cb7ffa6203f33ffa42d49f518dac0208637d267db2651
- Sigstore transparency entry: 1391318623
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: salva24/GAUFS@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/salva24
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe
- Trigger Event: push

File details

Details for the file gaufs-1.1.3-py3-none-any.whl.

File metadata

Download URL: gaufs-1.1.3-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 53.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gaufs-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac0cf52a25f26f44d5e35f4fa85d0a2df3cf5e45d1234368f6c6ebff73d609bf`
MD5	`26ae2adb856bbf6497a1cdbc151b8300`
BLAKE2b-256	`e70b2d6ec42e7ade9c3440fcd1b1084e3a02e970ddab986749aadf2c806334f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gaufs-1.1.3-py3-none-any.whl:

Publisher: publish.yml on salva24/GAUFS

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gaufs-1.1.3-py3-none-any.whl
- Subject digest: ac0cf52a25f26f44d5e35f4fa85d0a2df3cf5e45d1234368f6c6ebff73d609bf
- Sigstore transparency entry: 1391318687
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: salva24/GAUFS@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/salva24
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3f04aecd7b8250d64a11c1ecc5434bfc5cb85abe
- Trigger Event: push

gaufs 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GAUFS

Table of Contents

Key Features

Installation

Quick Start: Basic Gaufs Usage

How GAUFS Works

1. Genetic Search Phase

2. Variable Weight Analysis Phase

Main Configuration Parameters for GAUFS

Output Files

GA Execution Folders

Results Folder

Comparison Plots

Synthetic Data Generators

DataSpheres Generator

DataCorners Generator

Custom Fitness

Clustering Algorithms

Evaluation Metrics

Examples

Demo 1: Basic Usage with Corner Distribution (demo/demo1.py)

Demo 2: Advanced Configuration with Spherical Clusters (demo/demo2.py)

Documentation

Project Structure

Acknowledgments

License

Library Authors and Contact Information

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Demo 1: Basic Usage with Corner Distribution (`demo/demo1.py`)

Demo 2: Advanced Configuration with Spherical Clusters (`demo/demo2.py`)