A comprehensive Python toolkit for feature engineering and data analysis
Reason this release was yanked:
newer better
Project description
MLToolkit
A comprehensive Python toolkit for feature engineering and rudimentary data analysis to prepare dataframes for machine learning.
Installation
Install from GitHub
pip install git+https://github.com/bluelion1999/feature_engineering_tk.git
Install from PyPI (coming soon)
pip install feature-engineering-tk
Install from source
git clone https://github.com/bluelion1999/feature_engineering_tk.git
cd feature_engineering_tk
pip install -e .
Development Installation
For development with additional tools:
git clone https://github.com/bluelion1999/feature_engineering_tk.git
cd feature_engineering_tk
pip install -e ".[dev]"
Breaking Changes (v2.0.0)
Version 2.0.0 introduces important breaking changes. Please review carefully before upgrading.
Inplace Parameter Default Changed
The inplace parameter default has changed from True to False for all methods in DataPreprocessor and FeatureEngineer. This aligns with pandas conventions and prevents accidental data mutations.
Before (v1.x):
preprocessor = DataPreprocessor(df)
preprocessor.handle_missing_values(strategy='mean') # Modified internal df by default
cleaned_df = preprocessor.get_dataframe()
After (v2.0.0):
preprocessor = DataPreprocessor(df)
# Option 1: Explicitly use inplace=True (old behavior)
preprocessor.handle_missing_values(strategy='mean', inplace=True)
cleaned_df = preprocessor.get_dataframe()
# Option 2: Capture returned DataFrame (recommended)
cleaned_df = preprocessor.handle_missing_values(strategy='mean', inplace=False)
Migration Guide:
If you were relying on the implicit inplace=True behavior, you have two options:
-
Add
inplace=Trueto all method calls (quick fix):preprocessor.handle_missing_values(strategy='mean', inplace=True) preprocessor.remove_duplicates(inplace=True)
-
Refactor to use returned DataFrames (recommended, more pandas-like):
df = preprocessor.handle_missing_values(strategy='mean') df = preprocessor.remove_duplicates()
Affected Classes:
DataPreprocessor- All transformation methodsFeatureEngineer- All encoding, scaling, and feature creation methods
Not Affected:
DataAnalyzer- Read-only, no inplace operationsFeatureSelector- Uses different pattern withapply_selection()
See CHANGELOG.md for full list of changes.
Modules
- data_analysis.py: Exploratory data analysis and visualization
- feature_engineering.py: Feature transformation and creation
- preprocessing.py: Data cleaning and preprocessing
- feature_selection.py: Feature selection methods
Quick Start
import pandas as pd
from feature_engineering_tk import DataAnalyzer, FeatureEngineer, DataPreprocessor, FeatureSelector, quick_analysis
# Load your data
df = pd.read_csv('your_data.csv')
# Quick analysis
quick_analysis(df)
Usage Examples
1. Data Analysis
from feature_engineering_tk import DataAnalyzer
# Initialize analyzer
analyzer = DataAnalyzer(df)
# Get basic information
info = analyzer.get_basic_info()
print(f"Shape: {info['shape']}")
print(f"Memory: {info['memory_usage_mb']:.2f} MB")
# Check missing values
missing = analyzer.get_missing_summary()
print(missing)
# Get numeric summary statistics
numeric_stats = analyzer.get_numeric_summary()
print(numeric_stats)
# Get categorical summary
cat_stats = analyzer.get_categorical_summary()
print(cat_stats)
# Find high correlations
high_corr = analyzer.get_high_correlations(threshold=0.7)
print(high_corr)
# Detect outliers using IQR method
outliers_iqr = analyzer.detect_outliers_iqr(columns=['age', 'salary'], multiplier=1.5)
# Detect outliers using Z-score method
outliers_zscore = analyzer.detect_outliers_zscore(columns=['age', 'salary'], threshold=3.0)
# Visualizations
analyzer.plot_missing_values()
analyzer.plot_correlation_heatmap()
analyzer.plot_distributions(columns=['age', 'salary', 'score'])
2. Data Preprocessing
from feature_engineering_tk import DataPreprocessor
# Initialize preprocessor
preprocessor = DataPreprocessor(df)
# Handle missing values
preprocessor.handle_missing_values(strategy='mean', columns=['age', 'salary'])
preprocessor.handle_missing_values(strategy='mode', columns=['category'])
preprocessor.handle_missing_values(strategy='median', columns=['score'])
# Remove duplicates
preprocessor.remove_duplicates()
# Handle outliers
preprocessor.handle_outliers(
columns=['salary', 'age'],
method='iqr',
action='cap',
multiplier=1.5
)
# Convert data types
preprocessor.convert_dtypes({
'date': 'datetime',
'category': 'category',
'price': 'float64'
})
# Clip values to range
preprocessor.clip_values('age', lower=0, upper=120)
# Remove constant columns
preprocessor.remove_constant_columns()
# Remove high cardinality columns
preprocessor.remove_high_cardinality_columns(threshold=0.95)
# Filter rows based on condition
preprocessor.filter_rows(lambda df: df['age'] > 18)
# Drop columns
preprocessor.drop_columns(['id', 'temp_column'])
# Rename columns
preprocessor.rename_columns({'old_name': 'new_name'})
# Apply custom function
preprocessor.apply_custom_function('text', lambda x: x.lower(), new_column='text_lower')
# Get cleaned dataframe
cleaned_df = preprocessor.get_dataframe()
3. Feature Engineering
from feature_engineering_tk import FeatureEngineer
# Initialize feature engineer
engineer = FeatureEngineer(df)
# Label encoding
engineer.encode_categorical_label(columns=['gender', 'city'])
# One-hot encoding
engineer.encode_categorical_onehot(
columns=['country', 'department'],
drop_first=True,
prefix={'country': 'cnt', 'department': 'dept'}
)
# Ordinal encoding
engineer.encode_categorical_ordinal(
column='education',
categories=['High School', 'Bachelor', 'Master', 'PhD']
)
# Scale features
engineer.scale_features(columns=['age', 'salary'], method='standard')
engineer.scale_features(columns=['price', 'quantity'], method='minmax')
engineer.scale_features(columns=['income'], method='robust')
# Create polynomial features
engineer.create_polynomial_features(
columns=['feature1', 'feature2'],
degree=2,
interaction_only=False
)
# Create binning
engineer.create_binning(
column='age',
bins=5,
strategy='quantile',
labels=['Very Young', 'Young', 'Middle', 'Senior', 'Very Senior']
)
engineer.create_binning(
column='salary',
bins=[0, 30000, 60000, 100000, 200000],
labels=['Low', 'Medium', 'High', 'Very High']
)
# Log transformation
engineer.create_log_transform(columns=['salary', 'revenue'])
# Square root transformation
engineer.create_sqrt_transform(columns=['area', 'population'])
# Extract datetime features
engineer.create_datetime_features(
column='date',
features=['year', 'month', 'day', 'dayofweek', 'quarter', 'is_weekend']
)
# Create aggregations
engineer.create_aggregations(
group_by='city',
agg_column='salary',
agg_funcs=['mean', 'median', 'std']
)
engineer.create_aggregations(
group_by=['department', 'level'],
agg_column='performance_score',
agg_funcs=['mean', 'max', 'min']
)
# Create ratio features
engineer.create_ratio_features(
numerator='profit',
denominator='revenue',
name='profit_margin'
)
# Create flag features
engineer.create_flag_features(
column='age',
condition=lambda x: x >= 65,
flag_name='is_senior'
)
engineer.create_flag_features(
column='status',
condition='active',
flag_name='is_active'
)
# Get engineered dataframe
engineered_df = engineer.get_dataframe()
4. Feature Selection
from feature_engineering_tk import FeatureSelector, select_features_auto
# Initialize feature selector
selector = FeatureSelector(df, target_column='target')
# Select by variance
selected = selector.select_by_variance(threshold=0.01)
print(f"Features with variance > 0.01: {selected}")
# Remove highly correlated features
selected = selector.select_by_correlation(threshold=0.8, method='pearson')
print(f"Features after correlation filter: {selected}")
# Select top k features correlated with target
selected = selector.select_by_target_correlation(k=10, method='pearson')
print(f"Top 10 features correlated with target: {selected}")
# Statistical test selection
selected = selector.select_by_statistical_test(
k=15,
task='classification',
score_func='f_classif'
)
print(f"Top 15 features by statistical test: {selected}")
# Feature importance using Random Forest
selected = selector.select_by_importance(
k=10,
task='classification',
n_estimators=100,
random_state=42
)
print(f"Top 10 features by importance: {selected}")
# Select by missing values threshold
selected = selector.select_by_missing_values(threshold=0.3)
print(f"Features with < 30% missing: {selected}")
# Get feature importance dataframe
importance_df = selector.get_feature_importance_df()
print(importance_df)
# Apply selection to get new dataframe
selected_df = selector.apply_selection(keep_target=True)
# Automatic feature selection pipeline
auto_selected_df = select_features_auto(
df,
target_column='target',
task='classification',
max_features=20,
variance_threshold=0.01,
correlation_threshold=0.9
)
5. Complete Pipeline Example
import pandas as pd
from feature_engineering_tk import DataAnalyzer, DataPreprocessor, FeatureEngineer, FeatureSelector
# Load data
df = pd.read_csv('data.csv')
# Step 1: Analyze
print("Analyzing data...")
analyzer = DataAnalyzer(df)
quick_analysis(df)
# Step 2: Preprocess
print("\nPreprocessing data...")
preprocessor = DataPreprocessor(df)
preprocessor.handle_missing_values(strategy='mean', columns=['numeric_col'])
preprocessor.handle_missing_values(strategy='mode', columns=['categorical_col'])
preprocessor.remove_duplicates()
preprocessor.handle_outliers(columns=['salary'], method='iqr', action='cap')
df_clean = preprocessor.get_dataframe()
# Step 3: Feature Engineering
print("\nEngineering features...")
engineer = FeatureEngineer(df_clean)
engineer.encode_categorical_onehot(columns=['category'], drop_first=True)
engineer.scale_features(columns=['age', 'salary'], method='standard')
engineer.create_datetime_features(column='date', features=['year', 'month', 'dayofweek'])
engineer.create_ratio_features('profit', 'revenue', 'profit_margin')
df_engineered = engineer.get_dataframe()
# Step 4: Feature Selection
print("\nSelecting features...")
selector = FeatureSelector(df_engineered, target_column='target')
selected_features = selector.select_by_importance(k=15, task='classification')
df_final = selector.apply_selection(keep_target=True)
print(f"\nFinal dataset shape: {df_final.shape}")
print(f"Selected features: {selected_features}")
# Ready for ML!
X = df_final.drop('target', axis=1)
y = df_final['target']
API Reference
DataAnalyzer
get_basic_info(): Get basic dataframe informationget_missing_summary(): Get summary of missing valuesget_numeric_summary(): Get statistics for numeric columnsget_categorical_summary(): Get summary for categorical columnsdetect_outliers_iqr(): Detect outliers using IQR methoddetect_outliers_zscore(): Detect outliers using Z-scoreget_correlation_matrix(): Get correlation matrixget_high_correlations(): Find highly correlated feature pairsget_cardinality_info(): Get cardinality informationplot_missing_values(): Visualize missing valuesplot_correlation_heatmap(): Plot correlation heatmapplot_distributions(): Plot feature distributions
DataPreprocessor
handle_missing_values(): Handle missing values with various strategiesremove_duplicates(): Remove duplicate rowshandle_outliers(): Handle outliersconvert_dtypes(): Convert column data typesclip_values(): Clip values to rangeremove_constant_columns(): Remove constant columnsremove_high_cardinality_columns(): Remove high cardinality columnsfilter_rows(): Filter rows by conditiondrop_columns(): Drop specified columnsrename_columns(): Rename columnsapply_custom_function(): Apply custom transformation
FeatureEngineer
encode_categorical_label(): Label encodingencode_categorical_onehot(): One-hot encodingencode_categorical_ordinal(): Ordinal encodingscale_features(): Scale features (standard, minmax, robust)create_polynomial_features(): Create polynomial featurescreate_binning(): Bin continuous featurescreate_log_transform(): Apply log transformationcreate_sqrt_transform(): Apply square root transformationcreate_datetime_features(): Extract datetime featurescreate_aggregations(): Create aggregation featurescreate_ratio_features(): Create ratio featurescreate_flag_features(): Create binary flag features
FeatureSelector
select_by_variance(): Select by variance thresholdselect_by_correlation(): Remove highly correlated featuresselect_by_target_correlation(): Select by correlation with targetselect_by_statistical_test(): Select using statistical testsselect_by_importance(): Select by feature importanceselect_by_missing_values(): Select by missing value thresholdget_feature_importance_df(): Get feature scores dataframeapply_selection(): Apply selection to dataframe
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feature_engineering_tk-2.0.0.tar.gz.
File metadata
- Download URL: feature_engineering_tk-2.0.0.tar.gz
- Upload date:
- Size: 34.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28f833cad54161f614d6e8466d5527494896920c366e95c7434ed49fdd1e5b64
|
|
| MD5 |
112392c41f0125b43cfcffbbb08fd0bd
|
|
| BLAKE2b-256 |
cfea47f2bcd59171e5f21a3c6c0eccf1188f8793ff005cf0b36fb1a66d826928
|
File details
Details for the file feature_engineering_tk-2.0.0-py3-none-any.whl.
File metadata
- Download URL: feature_engineering_tk-2.0.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f3204b84bc0cab03108657d09f1e43c3a6395742806fabe9436425b0bb31c4f
|
|
| MD5 |
2dcb4ccdcf5efac342f9031fd93f8065
|
|
| BLAKE2b-256 |
375906aecc3a0c5f063e38c7c6c11a06c2790180c168a8704e91c240aca55629
|