Skip to main content

A short human-friendly description of your package

Project description

Feature Engineering Suite

This is a comprehensive and flexible Python library for performing common feature engineering tasks, designed to be easily integrated into Scikit-learn pipelines.

Installation

First, ensure you have the necessary files (setup.py and the feature_engineering_suite directory) structured correctly.

Navigate to the root directory (the one containing setup.py) in your terminal and run this command to create a source distribution:

python setup.py sdist

This will create a dist directory containing a file like feature_engineering_suite-0.1.0.tar.gz. You can now install your package using pip:

pip install dist/feature_engineering_suite-0.1.0.tar.gz

How to Use

The library is designed to be intuitive and flexible. Here's a complete example.

1. Sample Data

Let's start with a sample dataset.

import pandas as pd  
import numpy as np

# Create a sample DataFrame for a classification problem  
data = {  
    'age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],  
    'salary': [50000, 60000, 75000, 90000, 110000, 135000, 160000, 180000, 210000, 240000],  
    'years_experience': [2, 5, 8, 12, 15, 18, 22, 25, 28, 30],  
    'department': ['HR', 'IT', 'Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],  
    'education': ['Bachelor', 'Master', 'Bachelor', 'PhD', 'Master', 'Bachelor', 'PhD', 'Master', 'Bachelor', 'PhD'],  
    'purchased_premium': [0, 0, 1, 0, 1, 1, 1, 0, 1, 1]  
}  
df = pd.DataFrame(data)  
df['salary_correlated'] = df['salary'] * 1.1 + np.random.normal(0, 5000, df.shape[0])

X = df.drop('purchased_premium', axis=1)  
y = df['purchased_premium']

2. Feature Selection

First, let's identify the most important and least redundant features.

from feature_engineering_suite import FeatureSelector

# Get feature importance scores  
importance = FeatureSelector.get_feature_importance(X.select_dtypes(include=np.number), y, task='classification')  
print("--- Feature Importance ---")  
print(importance)

# Find and remove highly correlated features  
corr_selector = FeatureSelector(correlation_threshold=0.9)  
corr_selector.fit(X.select_dtypes(include=np.number))  
print(f"\n--- Features to Drop (Correlation > 0.9) ---n{corr_selector.features_to_drop_}")  
X_uncorrelated = corr_selector.transform(X)  
print(f"nShape of X before dropping correlated features: {X.shape}")  
print(f"Shape of X after dropping correlated features: {X_uncorrelated.shape}")

3. Transformation and Standardization

Now, let's apply transformations to the numerical features.

from feature_engineering_suite import Standardizer, LogTransformer

# Apply standard scaling to 'age' and 'years_experience'  
standardizer = Standardizer(columns=['age', 'years_experience'])  
X_scaled = standardizer.fit_transform(X_uncorrelated)

# Apply log transformation to the 'salary' column  
log_transformer = LogTransformer(columns=['salary'])  
X_final_numeric = log_transformer.fit_transform(X_scaled)

print("\n--- Data After Transformations ---")  
print(X_final_numeric.head())

4. Categorical Encoding

Finally, let's encode the categorical features.

from feature_engineering_suite import Encoder

# Define an ordinal mapping for the 'education' column  
education_map = {'Bachelor': 1, 'Master': 2, 'PhD': 3}

# Use the Encoder for both one-hot and ordinal encoding  
# We will one-hot encode 'department' and ordinally encode 'education'

# One-hot encode department  
onehot_encoder = Encoder(method='onehot', columns=['department'])  
X_encoded = onehot_encoder.fit_transform(X_final_numeric)

# Ordinal encode education  
ordinal_encoder = Encoder(method='ordinal', columns=['education'], mapping={'education': education_map})  
X_fully_processed = ordinal_encoder.fit_transform(X_encoded)

print("\n--- Fully Processed DataFrame ---")  
print(X_fully_processed.head())  
print(f"\nFinal shape of processed data: {X_fully_processed.shape}")

This library provides the building blocks you need to create powerful and reproducible feature engineering pipelines for any dataset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_engineering_nikel-0.1.0.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

feature_engineering_nikel-0.1.0-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file feature_engineering_nikel-0.1.0.tar.gz.

File metadata

File hashes

Hashes for feature_engineering_nikel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4bc9b81ae4229d83042790c07ffb382b985b4eb5aafb2b7c4399d1c50e8f4df2
MD5 1f46cb27fa47bbc859fc2aa58857afc4
BLAKE2b-256 339ec1e238b710d04ed33197df6fce3f50d5bfb08de2c58d4a529e7f2b082a83

See more details on using hashes here.

File details

Details for the file feature_engineering_nikel-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for feature_engineering_nikel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96bbe3cce50aad428b9164064918a8c1aefca27e87da9dea9111f0520230bf69
MD5 78cf41605806fe570bec264238969ffb
BLAKE2b-256 0ae78a8dfa2afa7ddce2288458613276846a35dc9efef20e45c99733ebeb340b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page