Skip to main content

No project description provided

Project description

Feature-Gen: A Robust Feature Engineering Framework

Feature-Gen is a Python-based library designed to simplify and optimize feature engineering for classification tasks. By integrating genetic algorithms, ensemble learning, and advanced feature transformations, Feature-Gen enables the discovery of feature subsets that maximize model performance while ensuring interpretability. It supports efficient processing through multithreading and multiprocessing, making it scalable for large datasets.


Key Features

  • Automated Feature Engineering: Automatically identifies and optimizes feature subsets for classification tasks.
  • Advanced Transformations: Includes transformations like logarithmic, square, cubic, sigmoid, and tanh to uncover complex, non-linear relationships.
  • Multi-objective Optimization: Leverages the NSGA-II genetic algorithm to optimize both classification accuracy and feature subset size.
  • Ensemble Learning Integration: Combines Logistic Regression, SVM, and XGBoost to ensure diverse model perspectives.
  • Flexible Ensemble Methods: Supports strategies like Majority Voting, Weighted Averaging, and Greedy Selection for robust feature evaluation.
  • Scalable Architecture: Uses multithreading and multiprocessing to handle large datasets efficiently.
  • Extensive Validation: Tested on over 100 datasets, demonstrating robustness and adaptability across domains.

Installation

Install the library directly from PyPI:

pip install feature-gen

Getting Started

Example Usage

The following example demonstrates how to use Feature-Gen to perform feature engineering:

# Example Dataset
import pandas as pd
from feature_gen.feature_gen_master import FeatureGenMaster
from feature_gen.implementation.constants import EnsembleMethod
from sklearn.datasets import load_wine

# Load and prepare dataset
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Show the current features
print(df.columns)

# Initialize FeatureGenMaster
f_g = FeatureGenMaster(df, 'target')

# Start the feature engineering process
f_g.start(
    ensemble_methods=[EnsembleMethod.GREEDY, EnsembleMethod.WEIGHTED_AVERAGING]
)

# Retrieve results
print("Best New Features:", f_g.get_best_new_features())
print("Best Original Features:", f_g.get_best_original_features())
print("All Ensemble Methods Scores:", f_g.get_all_ensemble_methods_scores())

The following example demonstrates how to use Feature-Gen to perform feature engineering with full control over the library

import pandas as pd
from sklearn.datasets import load_wine

from feature_gen.feature_gen_master import FeatureGenMaster
from feature_gen.implementation.constants import EnsembleMethod

all_ensemble_methods = [
    EnsembleMethod.GREEDY,
    EnsembleMethod.WEIGHTED_MAJORITY_VOTING
]

# Load the Iris dataset
data = load_wine()

# Create a DataFrame with the features and target
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Show the current features
print(df.columns)

f_g = FeatureGenMaster(df, 'target', min_number_of_target_unique_values=20)

f_g.start(
    ensemble_methods=all_ensemble_methods,
    random_state=42,
    max_iter=50,
    C=1e5,
    solver='liblinear',
    gamma=1,
    n_components=100,
    sgd_loss='hinge',
    sgd_max_iter=1000,
    sgd_tol=1e-2,
    xgb_n_estimators=100,
    generations_num=2,
    bootstrap_samples_count=1,
    first_population_size=4
)

print('Best new features', f_g.get_best_new_features())
print('Best original features', f_g.get_best_original_features())
print('Best all features', f_g.get_all_best_features())
print("All ensemble methods scores", f_g.get_all_ensemble_methods_scores())

Framework Architecture

1. Micro-Step Genetic Algorithm

  • Bootstrap Sampling: Generates three independent bootstrap samples to ensure robustness and diversity.
  • Population Initialization: Creates a population of binary chromosomes representing feature subsets.
  • Evaluation Metrics:
    • Maximizes the F1 score of an ensemble model (Logistic Regression, SVM, XGBoost).
    • Minimizes the number of selected features for interpretability.
  • Genetic Operations:
    • Selection: Binary tournament selection to choose the best chromosomes.
    • Crossover: Uniform crossover for generating offspring.
    • Mutation: Flip-bit mutation to introduce diversity.
    • Population Update: Combines parents and offspring using NSGA-II for multi-objective optimization.

2. Macro-Step Genetic Algorithm

  • Feature Aggregation: Combines feature subsets from the micro-step using union logic.
  • Global Optimization: Refines the macro-feature set using NSGA-II.
  • Final Feature Set: Outputs an optimal feature set balancing accuracy and interpretability.

Core Features and Functionality

  • Multithreading and Multiprocessing:

    • Uses multithreading for concurrent evaluations and multiprocessing for parallelizing resource-intensive tasks.
    • Ensures scalability and efficient execution for large datasets.
  • Built-in Ensemble Methods:

    • Supports flexible aggregation strategies like Majority Voting, Weighted Averaging, and Greedy Selection.
  • Advanced Feature Transformations:

    • Includes transformations such as logarithmic, sigmoid, and tanh to capture non-linear relationships.
  • Extensive Validation:

    • Tested on over 100 datasets, ensuring robustness and reliability.

Strengths

  1. Robust Optimization: Balances competing objectives through the micro-macro genetic algorithm.
  2. Integration of Transformations: Enhances predictive performance by uncovering non-linear relationships.
  3. Generalizability: Ensures applicability across linear, boundary-based, and non-linear problems.
  4. Interpretability: Achieves significant feature set reductions without compromising accuracy.

Future Directions

  1. Scalability Enhancements: Expand support for distributed systems to handle even larger datasets.
  2. Dynamic Transformation Framework: Introduce dataset-specific transformation selection for enhanced adaptability.
  3. Additional Ensemble Methods: Integrate more aggregation strategies to improve robustness and flexibility.
  4. User Interface: Develop visualization tools for better insights into feature engineering results.

Resources


Contributing

Contributions are welcome! For major changes, please open an issue to discuss proposed updates. Ensure all pull requests align with the project's goals and include relevant tests.


License

This project is licensed under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_gen-0.1.4.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

feature_gen-0.1.4-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file feature_gen-0.1.4.tar.gz.

File metadata

  • Download URL: feature_gen-0.1.4.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.0

File hashes

Hashes for feature_gen-0.1.4.tar.gz
Algorithm Hash digest
SHA256 5af135e8488c06bd2f4fcef65a5ec6fef9493fac3c2a96b717d7b1e1956713a3
MD5 3162a5d156bbc86a7045cb40c2b10061
BLAKE2b-256 99a8df02ae61830d8cce4fed5bb0f4b39faa91f38420a6d7b0e4ccf90a137124

See more details on using hashes here.

File details

Details for the file feature_gen-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: feature_gen-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.0

File hashes

Hashes for feature_gen-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a792dea31fa9b00e0f2dcfb55998fac2fbb1a3ad1130f3a4752f12a1d588197a
MD5 36ab9079527768f7718d8e3a8932c127
BLAKE2b-256 ff7c5a846dbc0498cfeb360dceff6c1754ffbc262af167b708e432e2e3499b8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page