RobustPreprocessor is designed to preprocess datasets effectively to ensure robust data preparation before further analysis or modeling.
Project description
RobustPreprocessor
RobustPreprocessor is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.
Features
-
Outlier Handling:
- Interquartile Range (IQR) clipping.
- Z-score filtering.
-
Infinity Value Handling:
- Replace with finite extremes.
- Drop rows containing infinity.
- Replace with a specific default value (e.g., 0).
-
Missing Value Imputation:
- Mean, median, or most frequent value strategies.
-
Feature Removal:
- Drop constant or near-constant features.
-
Visualization:
- Plot feature distributions with histograms.
-
Execution Logging:
- Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.
Installation
Install via pip:
pip install robustpreprocessor
Or clone the repository and install locally:
git clone https://github.com/nqmn/robustpreprocessor.git
cd robustpreprocessor
pip install .
Usage
Import the Package
from robustpreprocessor import RobustPreprocessor
import pandas as pd
# Example dataset
data = pd.DataFrame({
"feature_1": [1, 2, 3, 1000, 5],
"feature_2": [1, 2, None, 4, 5],
"feature_3": [0, 0, 0, 0, 0],
"feature_4": [1, 2, np.inf, -np.inf, 5],
})
# Initialize the preprocessor
preprocessor = RobustPreprocessor(verbose=True)
# Preprocess the dataset
cleaned_data = preprocessor.preprocess(
data,
outlier_method="IQR",
infinity_handling="set_value",
missing_value_strategy="mean",
feature_removal_criteria="constant"
)
# View the cleaned data
print(cleaned_data)
Visualization
Use the plot_feature_distributions
method to visualize the distributions of numeric features:
preprocessor.plot_feature_distributions(cleaned_data)
Logging and Execution Summary
After preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:
{
"process_type": "RobustPreprocessor",
"user_selections": {
"outlier_method": "IQR",
"infinity_handling": "set_value",
"missing_value_strategy": "mean",
"feature_removal_criteria": "constant"
},
"steps_executed": {
"select_numeric_columns": "Selected 4 numeric columns",
"outlier_handling": "Handled outliers using IQR method.",
"infinity_handling": "Replaced infinity values with a set value (0).",
"missing_value_imputation": "Imputed missing values with mean strategy.",
"feature_removal": "Dropped 1 constant columns."
},
"dropped_columns": 1,
"execution_time_seconds": 0.1234
}
Dependencies
numpy
pandas
scikit-learn
matplotlib
scipy
Install them with:
pip install -r requirements.txt
Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Commit your changes with clear descriptions.
- Submit a pull request.
License
This project is licensed under the MIT License. See the LICENSE
file for details.
Support
If you encounter issues or have questions, feel free to open an issue on GitHub.
Acknowledgments
- Inspired by common challenges in data preprocessing.
- Thanks to the contributors and the open-source community!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file robustpreprocessor-1.0.0.tar.gz
.
File metadata
- Download URL: robustpreprocessor-1.0.0.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13768a47a1528b89b669e4fc6f5cf02a8171a0b783a277eca0e6e2c012e7f2c2 |
|
MD5 | 1af201853f70f5d2a6022a5483d77bef |
|
BLAKE2b-256 | 41e4329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897 |
File details
Details for the file robustpreprocessor-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: robustpreprocessor-1.0.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a44fb3bb47994a1496e19db3031c259b2d81b24dbbc2589ca475f4d06052e041 |
|
MD5 | 1d18f0d712ccfb19069746b528fb36ab |
|
BLAKE2b-256 | f257b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f |