AutoPrep is an automated preprocessing pipeline with univariate anomaly marking

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

AutoPrep - Automated Preprocessing Pipeline with Univariate Anomaly Indicators

PyPIv PyPI status PyPI - Python Version PyPI - License

This pipeline focuses on data preprocessing, standardization, and cleaning, with additional features to identify univariate anomalies. Structure of Preprocessing Pipeline

pip install AutoPrep

Dependencies

scikit-learn
category_encoders
bitstring
ydata_profiling

Basic Usage

To utilize this pipeline, you need to import the necessary libraries and initialize the AutoPrep pipeline. Here is a basic example:

import pandas as pd
import numpy as np
import sys
sys.path.append("../")
sys.path.append("./")

data = {

    'ID': [1, 2, 3, 4],                 
    'Name': ['Alice', 'Bob', 'Charlie', 42],  
    'Rank': ['A','B','C','D'],
    'Age': [25, 30, 35, np.nan],                 
    'Salary': [50000.00, 60000.50, 75000.75, 80000.00], 
    'Hire Date': pd.to_datetime(['2020-01-15', '2019-05-22', '2018-08-30', '2021-04-12']), 
    'Is Manager': [False, True, False, ""]  
}
data = pd.DataFrame(data)

from AutoPrep import AutoPrep

pipeline = AutoPrep(
    nominal_columns=["ID", "Name", "Is Manager", "Age"],
    datetime_columns=["Hire Date"],
    pattern_recognition_columns=["Name"],
    scaler_option_num="standard",
    deactivate_missing_indicator=True
)
#### Automated Preprocessing of data
X_output_preprocessed = pipeline.fit_transform(df=data)

#### Automated Preprocessing + Anomalies in data with pyod library
X_output_anomalies = pipeline.find_anomalies(df=data)


#### Profiling of DataFrame / Visualization of pipeline structure
# pipeline.get_profiling(X=data)
# pipeline.visualize_pipeline_structure_html()

Highlights ⭐

📌 Implementation of univariate methods / Detection of univariate anomalies

Both methods (MOD Z-Value and Tukey Method) are resilient against outliers, ensuring that the position measurement will not be biased. They also support multivariate anomaly detection algorithms in identifying univariate anomalies.

📌 BinaryEncoder instead of OneHotEncoder for nominal columns / Big Data and Performance

Newest research shows similar results for encoding nominal columns with significantly fewer dimensions.

(John T. Hancock and Taghi M. Khoshgoftaar. "Survey on categorical data for neural networks." In: Journal of Big Data 7.1 (2020), pp. 1–41.), Tables 2, 4
(Diogo Seca and João Mendes-Moreira. "Benchmark of Encoders of Nominal Features for Regression." In: World Conference on Information Systems and Technologies. 2021, pp. 146–155.), P. 151

📌 Transformation of time series data and standardization of data with RobustScaler / Normalization for better prediction results

📌 Labeling of NaN values in an extra column instead of removing them / No loss of information

Pipeline - Built-in Logic

Logic of Pipeline

Feel free to contribute 🙂

Reference

https://www.researchgate.net/publication/379640146_Detektion_von_Anomalien_in_der_Datenqualitatskontrolle_mittels_unuberwachter_Ansatze (German Thesis)

Further Information

I used sklearn's Pipeline and Transformer concept to create this preprocessing pipeline
- Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- Transformer: https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

3.0.0

Sep 30, 2024

This version

2.0.3

Aug 27, 2024

2.0.2

Aug 19, 2024

2.0.1

Aug 18, 2024

2.0.0

Aug 17, 2024

1.8.5

Aug 16, 2024

1.8.4

Aug 16, 2024

1.8.3

Aug 14, 2024

1.8.2

Aug 13, 2024

1.8.1

Aug 12, 2024

1.8

Aug 12, 2024

0.1

Aug 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoprep-2.0.3.tar.gz (19.6 kB view hashes)

Uploaded Aug 27, 2024 Source

Built Distribution

AutoPrep-2.0.3-py3-none-any.whl (31.6 kB view hashes)

Uploaded Aug 27, 2024 Python 3

Hashes for autoprep-2.0.3.tar.gz

Hashes for autoprep-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`8537b7b03556c57be9039e786b7996b9b28393dde43cbf07efd70b10517307fb`
MD5	`31e74341e08c330796c9f2a82b41ecc9`
BLAKE2b-256	`cedfe113ba48c0bd7883b5c4c2ac64083ecfda2aa26265c0c6e6b2e1013a85df`

Hashes for AutoPrep-2.0.3-py3-none-any.whl

Hashes for AutoPrep-2.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32ebfa3badb8665c331912776d42ac3a3c59e9a16d24028324ebde42bbbcf1a7`
MD5	`8133bda75cb98f52129c215254f65448`
BLAKE2b-256	`6e02b7b28936a2a33e7978918ee6c50b410abc413d1436793eec1cf017c70298`