Skip to main content

A package for genetic algorithm-based feature selection on Random Forest Classifier

Project description

GA Optimization

A package for genetic algorithm-based feature selection.

Overview

GA Optimization is a Python package that leverages genetic algorithms to perform feature selection. It helps in identifying the most relevant features for a machine learning model, thereby improving model performance and reducing overfitting.

How It Works

Genetic Algorithms (GAs) are inspired by the process of natural selection and are used to find approximate solutions to optimization and search problems. Here's a brief overview of how the GA in this package works:

Initialization

The GA starts by initializing a population of individuals. Each individual represents a potential solution and is encoded as a list of binary values (genes). Each gene in the individual corresponds to a feature in the dataset. A value of 1 indicates that the feature is selected, while a value of 0 indicates that the feature is not selected.

Fitness Evaluation

The fitness of each individual is evaluated using a fitness function. In this package, the fitness function trains a Random Forest classifier using only the selected features and evaluates its accuracy on the training data. The accuracy score is used as the fitness value.

Selection

Selection is the process of choosing individuals from the current population to create offspring for the next generation. This package uses tournament selection, where a subset of individuals is chosen at random, and the best individual from this subset is selected.

Crossover (Recombination)

Crossover is the process of combining two parent individuals to create offspring. This package uses two-point crossover, where two points are selected on the parent individuals' genes, and the genes between these points are swapped.

Mutation

Mutation introduces random changes to an individual's genes to maintain genetic diversity within the population. This package uses flip-bit mutation, where each gene has a probability of being flipped (i.e., changed from 0 to 1 or from 1 to 0).

Genetic Algorithm Process

The GA iterates through the following steps for a fixed number of generations or until a stopping criterion is met:

  1. Fitness Evaluation: Evaluate the fitness of each individual in the population.
  2. Selection: Select individuals to create offspring.
  3. Crossover: Apply crossover to create offspring.
  4. Mutation: Apply mutation to the offspring.
  5. Replacement: Replace the old population with the new offspring.

Using Random Forest as a Classifier

In this package, a Random Forest classifier is used as the underlying model to evaluate the fitness of each individual. Random Forest is an ensemble learning method that constructs multiple decision trees and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is chosen for its robustness and ability to handle large datasets with high dimensionality.

Installation

You can install the package via pip:

bash pip install ga_optimization

Alternatively, you can clone the repository and install the dependencies:

git clone https://github.com/suraj385/ga_optimization.git cd ga_optimization pip install -r requirements.txt

Usage

Here is a basic example of how to use the GA Optimization package:

import sys import os import pandas as pd from optimization.feature_extraction import prepare_data, setup_deap, run_ga, save_selected_data

Parameters

data_path = "train_data.csv" target_column = "Class" sample_frac = 0.2 # Adjust the fraction of data to use for training population_size = 10 #adjust accordingly to your requirement generations = 2 #just fro testing purpose , you can use more !

Load the original data

original_data = pd.read_csv(data_path)

Sample the data

sampled_data = original_data.sample(frac=sample_frac, random_state=42)

Prepare the data

X, y = prepare_data(sampled_data, target_column)

Setup DEAP

toolbox = setup_deap(X)

Run the genetic algorithm to select features

selected_columns = run_ga(toolbox, X, y, sampled_data.columns, population_size=population_size, generations=generations)

Save the data with selected features applied to the original data

save_selected_data(original_data, selected_columns, target_column, output_path="train_data_selected1.csv")

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ga_optimization-0.1.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ga_optimization-0.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file ga_optimization-0.1.tar.gz.

File metadata

  • Download URL: ga_optimization-0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for ga_optimization-0.1.tar.gz
Algorithm Hash digest
SHA256 849d980c30d0b453bb059f8b795d5441761d49fa7af7a1cdb951b0de181e1498
MD5 839b9242032ac0a9fc84a53a0af6487e
BLAKE2b-256 f55cf3775deb670ad4782c9b14b2d73a381114358b3c6a2d7caffe52b8e64c78

See more details on using hashes here.

File details

Details for the file ga_optimization-0.1-py3-none-any.whl.

File metadata

  • Download URL: ga_optimization-0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for ga_optimization-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d9be7718cfd0913a766609658edf6c67de2d13b57e6c9f306a7c28cff8f7035c
MD5 9d03beb772b9db7ff1bcc12a5dbd7005
BLAKE2b-256 19c00a53bd9051eb0228dd98174ac5b576ddb0af8160b3265dd48217b6908d2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page