Skip to main content

AutoPrep is an automated preprocessing pipeline with univariate anomaly marking

Project description

AutoPrep - Automated Preprocessing Pipeline with Univariate Anomaly Marking

PyPIv PyPI status PyPI - Python Version PyPI - License

This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies. Structure of Preprocessing Pipeline

pip install AutoPrep

Dependencies

  • scikit-learn
  • category_encoders
  • bitstring
  • ydata_profiling

Basic Usage

To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:

############## dummy data #############
import pandas as pd
data = {
    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Bob', 'Charlie', 'R2D2'],  
    'Age': [25, 30, 35, 90],                 
    'Salary': [50000.00, 60000.50, 75000.75, 80000.00], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, True]  
}
data = pd.DataFrame(data)
########################################


from Autoprep import AutoPrep

pipeline = AutoPrep(
    nominal_columns=["ID", "Name", "Is Manager"],
    datetime_columns=["Hire Date"],
    pattern_recognition_columns=["Name"]

)
X_output = pipeline.preprocess(df=data)

# pipeline.get_profiling(X=data)
# pipeline.visualize_pipeline_structure_html()

The resulting output dataframe can be accessed by using:

X_output

> Output:
    col_1  col_2  ...   col_n
1   data   ...    ...   data   
2   data   ...    ...   data  
... ...    ...    ...   ...   

Highlights ⭐

📌 Implementation of univariate methods / Detection of univariate anomalies

Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

📌 BinaryEncoder instead of OneHotEncoder for nominal columns / Big Data and Performance

Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.

  • (John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
  • (Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151

📌 Transformation of time series data and standardization of data with RobustScaler / Normalization for better prediction results

📌 Labeling of NaN values in an extra column instead of removing them / No loss of information


Pipeline - Built-in Logic

Logic of Pipeline



Feel free to contribute 🙂

Reference

Further Information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoprep-1.8.3.tar.gz (17.6 kB view hashes)

Uploaded Source

Built Distribution

AutoPrep-1.8.3-py3-none-any.whl (30.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page