A no-code solution for performing data transformations like imputation, encoding, scaling, and feature creation, with an intuitive interface for interactive DataFrame manipulation and easy CSV export.
Project description
FeatureRefiner
FeatureRefiner is a Python package for feature engineering that provides a set of tools for data transformation, imputation, encoding, scaling, and feature creation. This package comes with an interactive Streamlit interface that allows users to easily apply these transformations to their datasets.
Features
- Create polynomial features
- Handle and extract date-time features
- Encode categorical data using various encoding techniques
- Impute missing values with different strategies
- Normalize and scale data using multiple scaling methods
- Interactive Streamlit interface for easy usage
Installation
It's recommended to install FeatureRefiner
in a virtual environment to manage dependencies effectively and avoid conflicts with other projects.
1. Set Up a Virtual Environment
For Python 3.3 and above:
-
Create a Virtual Environment:
python -m venv env
Replace
env
with your preferred name for the virtual environment. -
Activate the Virtual Environment:
-
On Windows:
env\Scripts\activate
-
On macOS/Linux:
source env/bin/activate
-
2. Install FeatureRefiner
Once the virtual environment is activated, you can install FeatureRefiner
using pip
:
pip install FeatureRefiner
Quick Start
After installing the package, run the FeatureRefiner interface using:
run-FeatureRefiner
This will open a Streamlit app where you can upload your dataset and start applying transformations.
Usage
Command-Line Interface
To launch the Streamlit app, simply use the command:
run-FeatureRefiner
Importing Modules in Python
You can also use FeatureRefiner modules directly in your Python scripts:
from FeatureRefiner.imputation import MissingValueImputation
from FeatureRefiner.encoding import FeatureEncoding
from FeatureRefiner.imputation import MissingValueImputation
from FeatureRefiner.encoding import FeatureEncoding
from FeatureRefiner.scaling import DataNormalize
from FeatureRefiner.date_time_features import DateTimeExtractor
from FeatureRefiner.create_features import PolynomialFeaturesTransformer
Modules Overview
The FeatureRefiner
package provides several modules for different data transformation tasks:
- create_features.py - Generate polynomial features.
- date_time_features.py - Extract and handle date-time related features.
- encoding.py - Encode categorical features using techniques like Label Encoding and One-Hot Encoding.
- imputation.py - Handle missing values with multiple imputation strategies.
- scaling.py - Normalize and scale numerical features.
Each of these modules is described in detail below.
1. create_features.py
The create_features.py
module provides functionality to generate polynomial features from numeric columns in a pandas DataFrame. The PolynomialFeaturesTransformer
class supports creating polynomial combinations of the input features up to a specified degree, enhancing the feature set for predictive modeling.
Key Features
- Degree Specification: Allows setting the degree of polynomial features during initialization or transformation.
- Numeric Column Filtering: Automatically filters and processes only the numeric columns in the DataFrame.
- Error Handling: Provides robust error handling for invalid inputs, including non-numeric data and improper degree values.
Supported Transformations
- Polynomial Feature Creation: Generates polynomial combinations of input features based on the specified degree.
Example Usage:
from FeatureRefiner.create_features import PolynomialFeaturesTransformer
import pandas as pd
# Example DataFrame
data = {'feature1': [1, 2, 3], 'feature2': [4, 5, 6]}
df = pd.DataFrame(data)
# Initialize the PolynomialFeaturesTransformer object
transformer = PolynomialFeaturesTransformer(degree=2)
# Transform the DataFrame to include polynomial features
transformed_df = transformer.fit_transform(df)
print(transformed_df)
Methods
-
__init__(degree)
: Initializes the transformer with the specified degree of polynomial features. -
fit_transform(df, degree=None)
: Fits the transformer to the numeric columns of the DataFrame and generates polynomial features. Optionally, updates the polynomial degree. -
_validate_input(df)
: Validates the input DataFrame, ensuring it contains only numeric columns and no categorical data.
2. date_time_features.py
The date_time_features.py
module provides functionality to extract and parse datetime components from a specified column in a pandas DataFrame. The DateTimeExtractor
class supports extracting year, month, day, and day of the week from a datetime column.
Key Features
- Date Parsing: Handles multiple date formats for parsing datetime data.
- Component Extraction: Extracts year, month, day, and day of the week from a datetime column.
Supported Extractors
- Year Extraction: Adds a new column
year
with the extracted year. - Month Extraction: Adds a new column
month
with the extracted month. - Day Extraction: Adds a new column
day
with the extracted day. - Day of Week Extraction: Adds a new column
day_of_week
with the extracted day of the week.
Example Usage
from FeatureRefiner.date_time_features import DateTimeExtractor
import pandas as pd
# Example DataFrame
data = {'date': ['2024-01-01', '2024-02-14', '2024-03-21']}
df = pd.DataFrame(data)
# Initialize the DateTimeExtractor object
extractor = DateTimeExtractor(df, datetime_col='date')
# Extract all datetime components
result_df = extractor.extract_all()
print(result_df)
Methods
-
_parse_date(date_str)
: Tries to parse a date string using multiple formats. -
extract_year()
: Extracts the year from the datetime column and adds it as a new column namedyear
. -
extract_month()
: Extracts the month from the datetime column and adds it as a new column namedmonth
. -
extract_day()
: Extracts the day from the datetime column and adds it as a new column namedday
. -
extract_day_of_week()
: Extracts the day of the week from the datetime column and adds it as a new column namedday_of_week
. -
extract_all()
: Extracts year, month, day, and day of the week from the datetime column and adds them as new columns.
3. encoding.py
The encoding.py
module provides functionality to encode categorical features in a pandas DataFrame using Label Encoding and One-Hot Encoding. The FeatureEncoding
class in this module offers methods for converting categorical data into a numerical format suitable for machine learning algorithms.
Key Features:
- Label Encoding: Converts categorical text data into numerical data by assigning a unique integer to each category.
- One-Hot Encoding: Converts categorical data into a binary matrix, creating a new column for each unique category.
Supported Encoders:
- LabelEncoder: Converts each category to a unique integer.
- OneHotEncoder: Converts categorical data into a binary matrix, with an option to drop the first category to avoid multicollinearity.
Example Usage:
from FeatureRefiner.encoding import FeatureEncoding
import pandas as pd
# Example DataFrame
data = {'Color': ['Red', 'Blue', 'Green'], 'Size': ['S', 'M', 'L']}
df = pd.DataFrame(data)
# Initialize the FeatureEncoding object
encoder = FeatureEncoding(df)
# Apply Label Encoding
df_label_encoded = encoder.label_encode(['Color'])
# Apply One-Hot Encoding
df_one_hot_encoded = encoder.one_hot_encode(['Size'])
print(df_label_encoded)
print(df_one_hot_encoded)
Methods:
label_encode(columns: list) -> pd.DataFrame
: Apply Label Encoding to the specified columns.one_hot_encode(columns: list) -> pd.DataFrame
: Apply One-Hot Encoding to the specified columns, concatenate the encoded columns with the original DataFrame, and drop the original columns.
4. imputation.py
The imputation.py
module provides functionality for handling missing values in a pandas DataFrame using various imputation strategies. The MissingValueImputation
class in this module offers methods to fill missing values based on the specified strategies.
Key Features:
- Flexible Imputation: Allows for multiple imputation strategies such as mean, median, mode, or custom values.
- Column-Specific Strategies: Supports different strategies for different columns.
- Fit and Transform: Includes methods for fitting the imputation model and transforming data in a single step.
Supported Strategies:
- Mean: Fills missing values with the mean of the column (only applicable to numeric columns).
- Median: Fills missing values with the median of the column (only applicable to numeric columns).
- Mode: Fills missing values with the mode of the column.
- Custom Values: Allows specifying a custom value for imputation.
Example Usage:
from FeatureRefiner.imputation import MissingValueImputation
import pandas as pd
# Example DataFrame
data = {'A': [1, 2, np.nan, 4, 5], 'B': [10, np.nan, 30, np.nan, 50]}
df = pd.DataFrame(data)
# Define imputation strategies
strategies = {
'A': 'mean',
'B': 25
}
# Initialize the MissingValueImputation object
imputer = MissingValueImputation(strategies=strategies)
# Fit and transform the DataFrame
imputed_df = imputer.fit_transform(df)
print(imputed_df)
Methods:
_compute_fill_value(df: pd.DataFrame, column: str, strategy: Union[str, int, float]) -> Union[float, str]
: Computes the fill value based on the imputation strategy for a given column.fit(df: pd.DataFrame) -> 'MissingValueImputation'
: Computes the fill values for missing data based on the provided strategies.transform(df: pd.DataFrame) -> pd.DataFrame
: Applies the imputation to the DataFrame using the computed fill values.fit_transform(df: pd.DataFrame) -> pd.DataFrame
: Computes the fill values and applies the imputation to the DataFrame in one step.
5. scaling.py
The scaling.py
module provides functionality to scale and normalize numerical data in a pandas DataFrame using various scaling techniques from scikit-learn
. The DataNormalize
class in this module offers methods for scaling data using different techniques provided by scikit-learn
. It supports several scalers, such as StandardScaler
, MinMaxScaler
, RobustScaler
, and others.
Key Features:
- General Data Scaling: Scales all numerical columns in the DataFrame.
- Column-Specific Scaling: Allows scaling specific columns within the DataFrame.
- Multiple Scalers Supported: Supports different scaling methods such as standardization, normalization, robust scaling, and more.
Supported Scalers:
- StandardScaler (
standard
): Scales data to have zero mean and unit variance. - MinMaxScaler (
minmax
): Scales data to a specified range (default is 0 to 1). - RobustScaler (
robust
): Scales data using statistics that are robust to outliers. - MaxAbsScaler (
maxabs
): Scales data to the range [-1, 1] based on the maximum absolute value. - Normalizer (
l2
): Scales each sample individually to have unit norm (L2 norm). - QuantileTransformer (
quantile
): Transforms features to follow a uniform or normal distribution. - PowerTransformer (
power
): Applies a power transformation to make data more Gaussian-like.
Example Usage:
from FeatureRefiner.scaling import DataNormalize
import pandas as pd
# Example DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# Initialize the DataNormalize object
scaler = DataNormalize()
# Scale the entire DataFrame using MinMaxScaler
scaled_df = scaler.scale(df, method='minmax')
print(scaled_df)
Methods:
scale(df: pd.DataFrame, method: str = 'standard') -> pd.DataFrame
: Scales the entire DataFrame using the specified method.scale_columns(df: pd.DataFrame, columns: list, method: str = 'standard') -> pd.DataFrame
: Scales specific columns of the DataFrame using the specified method.
Requirements
Before installing, please make sure you have the following packages installed:
- Python >= 3.7
- Streamlit
- Pandas
- NumPy
- scikit-learn
- st-aggrid
For more detailed information, see the requirements.txt
file.
Contributing
We welcome contributions! Please read our Contributing Guidelines for more details.
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Acknowledgements
Special thanks to all the libraries and frameworks that have helped in developing this package, including:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file featurerefiner-1.1.tar.gz
.
File metadata
- Download URL: featurerefiner-1.1.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ad0c29b682538defcde04634633d1cd4fb3f82e354548d002533350b91b5e40 |
|
MD5 | ddeceb7c3ccb04bdfa3d735d4f57755b |
|
BLAKE2b-256 | d89a141892cf21f4dd9a6473bcdf777085be9f707d6d96e8109fb4b4d7b6a3a1 |
File details
Details for the file FeatureRefiner-1.1-py3-none-any.whl
.
File metadata
- Download URL: FeatureRefiner-1.1-py3-none-any.whl
- Upload date:
- Size: 29.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 810bd48a30e254cfdd12b2c870319a1906ff38e8127f522ea80c834fdaf3fa47 |
|
MD5 | f6d795fa6ce560bcf30c7077b0186926 |
|
BLAKE2b-256 | c783fdc613262d2f3f2f6969a7d48b3a5ddc0de17092f90f8ab5de112c44df0f |