Skip to main content

Data preprocessing and cleaning tools for data science projects

Project description

ADCL-Automatic-Data-Cleaning Project

Overview

ADCL-Automatic-Data-Cleaning is a Python package designed to facilitate automated data cleaning, particularly leveraging deep learning techniques for preprocessing tasks essential in data science and machine learning workflows.

Features

  • Data Preprocessing: Standardize, normalize, and format your data for machine learning models.
  • Missing Value Imputation: Implements various techniques for handling missing data in both cross-sectional and time-series datasets.
  • Outlier Detection: Identifies and manages outliers using multiple strategies, improving the robustness of your models.
  • Encoding and Transformation: Converts categorical data into a machine-readable format using various encoding techniques.
  • Time Series Handling: Special functions for processing time-dependent data.

Repository Structure

  • data_preprocessing/: Contains the core library file data_preprocessing.py with all preprocessing functions.
  • examples/: Includes example_usage.ipynb, a Jupyter notebook demonstrating how to use the preprocessing functions.
  • missing_values_imputation_test/: Contains notebooks for testing missing value imputation across different data types.
  • outlier_detection_test/: Contains notebooks for testing outliers detection across different data types.
  • LICENSE: The project is open-sourced under the MIT license.

Installation

To install ADCL directly from PyPI, run the following command:

pip install adcl

Usage

Data Preprocessing

You can preprocess your datasets by importing functions from data_preprocessing.py. For example:

from adcl import process_data
filepath = 'path_to_your_data.csv'
df_train, df_test, y_column_name, date_col = process_data(train_input=filepath)

Missing Value Handling

Handle missing values by choosing an appropriate method from the library. An example usage for time series data:

from adcl import missing_values_handling
X_train_mis, X_test_mis = missing_values_handling(df_train=X_train, df_test=X_test, datetime_col=date_col, imputation_method='auto')

Outlier Detection

Detect Outliers by choosing an appropriate method from the library. An example usage for time series data:

from adcl import outlier_detection
X_train_out, X_test_out = outlier_detection(X_train=X_train, X_test=X_test, datetime_col=date_col
                                    , method='auto', nu=0.05, kernel='rbf', gamma='scale'
                                    , n_neighbors=20, contamination='auto', n_estimators=100
                                    , encoding_dim=8, epochs=50, batch_size=32
                                    , window_size=20, dtw_window=None)

Categorical Variables Encoding

Encode categorical variables by choosing an appropriate method from the library. An example usage for time series data:

from adcl import encode_data
X_train_enc, X_test_enc = encode_data(df_train=X_train, df_test=X_test, y_column_name,
                encoding_method='label', nu=0.05, kernel='rbf', gamma='scale',
                n_neighbors=20, contamination='auto', n_estimators=100,
                encoding_dim=8, epochs=50, batch_size=32)

Example Notebooks

For detailed examples, refer to the notebooks in the examples/ directory. These notebooks provide comprehensive guides on utilizing the package's functionalities effectively.

Contributing

Contributions are welcome! If you have suggestions for improving the library, feel free to fork the repository and submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any queries or further information, please contact steve19992@mail.ru.

By providing structured guidance on using the package and clearly explaining what each part of the package does, users of all levels can effectively integrate ADCL into their data cleaning and preprocessing workflows.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adcl-0.1.7.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adcl-0.1.7-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file adcl-0.1.7.tar.gz.

File metadata

  • Download URL: adcl-0.1.7.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.9

File hashes

Hashes for adcl-0.1.7.tar.gz
Algorithm Hash digest
SHA256 b98f09929b3657061260ad3d430b8ccba204084788966a39fe605d83b961657c
MD5 3d8270358e8bbf3f714d1935466a61c3
BLAKE2b-256 ae81a3f52475a5f11e9c8350a83acb31a7666a05e8b01408d8de118d102f2f0f

See more details on using hashes here.

File details

Details for the file adcl-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: adcl-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.9

File hashes

Hashes for adcl-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 f3b0c73db3fdcebcc875f1e4d207295cc5bcd4b17be840bed0b6e9747427004d
MD5 291a1f9ad7577452315afc2bcc38d3a0
BLAKE2b-256 42d25ab106ee767eb3ce7cc5b27b2e74188cb1e510ff453ce8545cafb9a1caa1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page