Skip to main content

RobustPreprocessor is designed to preprocess datasets effectively to ensure robust data preparation before further analysis or modeling.

Project description

RobustPreprocessor

RobustPreprocessor is a Python package for comprehensive and flexible data preprocessing. It is designed to clean and prepare numeric datasets for machine learning and analysis by handling outliers, missing values, infinity values, and redundant features.


Features

  • Outlier Handling:

    • Interquartile Range (IQR) clipping.
    • Z-score filtering.
  • Infinity Value Handling:

    • Replace with finite extremes.
    • Drop rows containing infinity.
    • Replace with a specific default value (e.g., 0).
  • Missing Value Imputation:

    • Mean, median, or most frequent value strategies.
  • Feature Removal:

    • Drop constant or near-constant features.
  • Visualization:

    • Plot feature distributions with histograms.
  • Execution Logging:

    • Outputs a JSON summary of preprocessing steps, including dropped columns and execution time.

Installation

Install via pip:

pip install robustpreprocessor

Or clone the repository and install locally:

git clone https://github.com/nqmn/robustpreprocessor.git
cd robustpreprocessor
pip install .

Usage

Import the Package

from robustpreprocessor import RobustPreprocessor
import pandas as pd

# Example dataset
data = pd.DataFrame({
    "feature_1": [1, 2, 3, 1000, 5],
    "feature_2": [1, 2, None, 4, 5],
    "feature_3": [0, 0, 0, 0, 0],
    "feature_4": [1, 2, np.inf, -np.inf, 5],
})

# Initialize the preprocessor
preprocessor = RobustPreprocessor(verbose=True)

# Preprocess the dataset
cleaned_data = preprocessor.preprocess(
    data,
    outlier_method="IQR",
    infinity_handling="set_value",
    missing_value_strategy="mean",
    feature_removal_criteria="constant"
)

# View the cleaned data
print(cleaned_data)

Visualization

Use the plot_feature_distributions method to visualize the distributions of numeric features:

preprocessor.plot_feature_distributions(cleaned_data)

Logging and Execution Summary

After preprocessing, the class outputs a JSON log summarizing the steps taken, including execution time and dropped columns:

{
    "process_type": "RobustPreprocessor",
    "user_selections": {
        "outlier_method": "IQR",
        "infinity_handling": "set_value",
        "missing_value_strategy": "mean",
        "feature_removal_criteria": "constant"
    },
    "steps_executed": {
        "select_numeric_columns": "Selected 4 numeric columns",
        "outlier_handling": "Handled outliers using IQR method.",
        "infinity_handling": "Replaced infinity values with a set value (0).",
        "missing_value_imputation": "Imputed missing values with mean strategy.",
        "feature_removal": "Dropped 1 constant columns."
    },
    "dropped_columns": 1,
    "execution_time_seconds": 0.1234
}

Dependencies

  • numpy
  • pandas
  • scikit-learn
  • matplotlib
  • scipy

Install them with:

pip install -r requirements.txt

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Commit your changes with clear descriptions.
  4. Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Support

If you encounter issues or have questions, feel free to open an issue on GitHub.


Acknowledgments

  • Inspired by common challenges in data preprocessing.
  • Thanks to the contributors and the open-source community!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robustpreprocessor-1.0.0.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

robustpreprocessor-1.0.0-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file robustpreprocessor-1.0.0.tar.gz.

File metadata

  • Download URL: robustpreprocessor-1.0.0.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for robustpreprocessor-1.0.0.tar.gz
Algorithm Hash digest
SHA256 13768a47a1528b89b669e4fc6f5cf02a8171a0b783a277eca0e6e2c012e7f2c2
MD5 1af201853f70f5d2a6022a5483d77bef
BLAKE2b-256 41e4329ceb49b1176a13a1f2f631fa48ca92bfdd113d743a37108fba69a19897

See more details on using hashes here.

File details

Details for the file robustpreprocessor-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for robustpreprocessor-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a44fb3bb47994a1496e19db3031c259b2d81b24dbbc2589ca475f4d06052e041
MD5 1d18f0d712ccfb19069746b528fb36ab
BLAKE2b-256 f257b6b01d00dd3a290e679b237ff1ff224822f2ad661b38bad6d869cae8192f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page