No project description provided
Project description
Feature-Gen: A Robust Feature Engineering Framework
Feature-Gen is a Python-based library designed to simplify and optimize feature engineering for classification tasks. By integrating genetic algorithms, ensemble learning, and advanced feature transformations, Feature-Gen enables the discovery of feature subsets that maximize model performance while ensuring interpretability. It supports efficient processing through multithreading and multiprocessing, making it scalable for large datasets.
Key Features
- Automated Feature Engineering: Automatically identifies and optimizes feature subsets for classification tasks.
- Advanced Transformations: Includes transformations like logarithmic, square, cubic, sigmoid, and tanh to uncover complex, non-linear relationships.
- Multi-objective Optimization: Leverages the NSGA-II genetic algorithm to optimize both classification accuracy and feature subset size.
- Ensemble Learning Integration: Combines Logistic Regression, SVM, and XGBoost to ensure diverse model perspectives.
- Flexible Ensemble Methods: Supports strategies like Majority Voting, Weighted Averaging, and Greedy Selection for robust feature evaluation.
- Scalable Architecture: Uses multithreading and multiprocessing to handle large datasets efficiently.
- Extensive Validation: Tested on over 100 datasets, demonstrating robustness and adaptability across domains.
Installation
Install the library directly from PyPI:
pip install feature-gen
Getting Started
Example Usage
The following example demonstrates how to use Feature-Gen to perform feature engineering:
# Example Dataset
import pandas as pd
from feature_gen.feature_gen_master import FeatureGenMaster
from feature_gen.implementation.constants import EnsembleMethod
from sklearn.datasets import load_wine
# Load and prepare dataset
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Show the current features
print(df.columns)
# Initialize FeatureGenMaster
f_g = FeatureGenMaster(df, 'target')
# Start the feature engineering process
f_g.start(
ensemble_methods=[EnsembleMethod.GREEDY, EnsembleMethod.WEIGHTED_AVERAGING]
)
# Retrieve results
print("Best New Features:", f_g.get_best_new_features())
print("Best Original Features:", f_g.get_best_original_features())
print("All Ensemble Methods Scores:", f_g.get_all_ensemble_methods_scores())
The following example demonstrates how to use Feature-Gen to perform feature engineering with full control over the library
import pandas as pd
from sklearn.datasets import load_wine
from feature_gen.feature_gen_master import FeatureGenMaster
from feature_gen.implementation.constants import EnsembleMethod
all_ensemble_methods = [
EnsembleMethod.GREEDY,
EnsembleMethod.WEIGHTED_MAJORITY_VOTING
]
# Load the Iris dataset
data = load_wine()
# Create a DataFrame with the features and target
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Show the current features
print(df.columns)
f_g = FeatureGenMaster(df, 'target', min_number_of_target_unique_values=20)
f_g.start(
ensemble_methods=all_ensemble_methods,
random_state=42,
max_iter=50,
C=1e5,
solver='liblinear',
gamma=1,
n_components=100,
sgd_loss='hinge',
sgd_max_iter=1000,
sgd_tol=1e-2,
xgb_n_estimators=100,
generations_num=2,
bootstrap_samples_count=1,
first_population_size=4
)
print('Best new features', f_g.get_best_new_features())
print('Best original features', f_g.get_best_original_features())
print('Best all features', f_g.get_all_best_features())
print("All ensemble methods scores", f_g.get_all_ensemble_methods_scores())
Framework Architecture
1. Micro-Step Genetic Algorithm
- Bootstrap Sampling: Generates three independent bootstrap samples to ensure robustness and diversity.
- Population Initialization: Creates a population of binary chromosomes representing feature subsets.
- Evaluation Metrics:
- Maximizes the F1 score of an ensemble model (Logistic Regression, SVM, XGBoost).
- Minimizes the number of selected features for interpretability.
- Genetic Operations:
- Selection: Binary tournament selection to choose the best chromosomes.
- Crossover: Uniform crossover for generating offspring.
- Mutation: Flip-bit mutation to introduce diversity.
- Population Update: Combines parents and offspring using NSGA-II for multi-objective optimization.
2. Macro-Step Genetic Algorithm
- Feature Aggregation: Combines feature subsets from the micro-step using union logic.
- Global Optimization: Refines the macro-feature set using NSGA-II.
- Final Feature Set: Outputs an optimal feature set balancing accuracy and interpretability.
Core Features and Functionality
-
Multithreading and Multiprocessing:
- Uses multithreading for concurrent evaluations and multiprocessing for parallelizing resource-intensive tasks.
- Ensures scalability and efficient execution for large datasets.
-
Built-in Ensemble Methods:
- Supports flexible aggregation strategies like Majority Voting, Weighted Averaging, and Greedy Selection.
-
Advanced Feature Transformations:
- Includes transformations such as logarithmic, sigmoid, and tanh to capture non-linear relationships.
-
Extensive Validation:
- Tested on over 100 datasets, ensuring robustness and reliability.
Strengths
- Robust Optimization: Balances competing objectives through the micro-macro genetic algorithm.
- Integration of Transformations: Enhances predictive performance by uncovering non-linear relationships.
- Generalizability: Ensures applicability across linear, boundary-based, and non-linear problems.
- Interpretability: Achieves significant feature set reductions without compromising accuracy.
Future Directions
- Scalability Enhancements: Expand support for distributed systems to handle even larger datasets.
- Dynamic Transformation Framework: Introduce dataset-specific transformation selection for enhanced adaptability.
- Additional Ensemble Methods: Integrate more aggregation strategies to improve robustness and flexibility.
- User Interface: Develop visualization tools for better insights into feature engineering results.
Resources
- Documentation: Available on PyPI: Feature-Gen Documentation
- Source Code: Hosted on your development repository.
Contributing
Contributions are welcome! For major changes, please open an issue to discuss proposed updates. Ensure all pull requests align with the project's goals and include relevant tests.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feature_gen-0.1.2.tar.gz.
File metadata
- Download URL: feature_gen-0.1.2.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71461249c2fa84781c470e24b50a96327a97ffa485ebeb54e08401fe4dfbb885
|
|
| MD5 |
4c086c7fb74d1311dec4ef97c349c8f2
|
|
| BLAKE2b-256 |
4dbc2d55c4af48dfeb50df4e1478e7f4eae6c2c315165d019584de757b2d679c
|
File details
Details for the file feature_gen-0.1.2-py3-none-any.whl.
File metadata
- Download URL: feature_gen-0.1.2-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bdbd9cd9f28f34be4af2d35f784514e9d3f6c3387f5ddc7ae50145513f1a447
|
|
| MD5 |
a0540fe17fdb2ab84f8ab13498757589
|
|
| BLAKE2b-256 |
0eadbd01663067b90e5d969c95ab835659bc71454ccfe1978114de172ddc4196
|