A high-level library for automatic preprocessing of tabular data
Project description
AutoDataPreprocess
AutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.
Features
- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)
- Basic data analysis and visualization
- Data cleaning (handling missing values, outliers, duplicates)
- Feature engineering
- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)
- Scaling and normalization
- Dimensionality reduction
- Feature selection
- Handling imbalanced data
- Time series preprocessing
- Data anonymization
Installation
You can install AutoDataPreprocess using pip: pip install autodatapreprocess
Quick Start
from autodatapreprocess import AutoDataPreprocess
# Load data
adp = AutoDataPreprocess('your_data_file.csv')
# Perform basic analysis
adp.basic_analysis()
# Clean the data
cleaned_data = adp.clean(missing='mean', outliers='iqr')
# Perform feature engineering
engineered_data = adp.fe(target_column='target', polynomial_degree=2)
# Encode categorical variables
encoded_data = adp.encode(methods={'category_column': 'onehot'})
# Scale the data
scaled_data = adp.scale(method='standard')
Detailed Usage
Data Loading
Load data from various sources:
# From CSV
adp = AutoDataPreprocess('data.csv')
# From SQL
adp = AutoDataPreprocess(sql_query="SELECT * FROM table", sql_connection_string="your_connection_string")
# From API
adp = AutoDataPreprocess(api_url="https://api.example.com/data", api_params={"key": "value"})
Data Cleaning
Clean your data with various options:
cleaned_data = adp.clean(
missing='mean',
outliers='iqr',
drop_threshold=0.7,
date_format='%Y-%m-%d',
remove_duplicates=True
)
Feature Engineering
Perform feature engineering:
engineered_data = adp.fe(
target_column='target',
polynomial_degree=2,
interaction_only=False,
bin_numeric=True,
num_bins=5,
cyclical_features=['month', 'day_of_week'],
text_columns=['description'],
date_columns=['date']
)
Encoding
Encode categorical variables:
encoded_data = adp.encode(
methods={
'category1': 'onehot',
'category2': 'label',
'category3': 'target'
},
target_column='target'
)
Scaling and Normalization
Scale or normalize your data:
scaled_data = adp.scale(method='standard')
normalized_data = adp.normalize(method='l2')
Dimensionality Reduction
Reduce the dimensionality of your data:
reduced_data = adp.dimreduction(method='pca', n_components=5)
Feature Selection
Select the most important features:
selected_data = adp.feature_selection(
target_column='target',
method='correlation',
correlation_threshold=0.8
)
Handling Imbalanced Data
Balance your dataset:
balanced_data = adp.balance_data(
target_column='target',
method='smote',
sampling_strategy='auto'
)
Time Series Preprocessing
Preprocess time series data:
preprocessed_ts_data = adp.time_series_preprocessing(
time_column='date',
freq='D',
method='mean',
detrend_columns=['value'],
seasonality_columns=['value'],
lag_columns=['value'],
lags=[1, 7, 30]
)
Data Anonymization
Anonymize sensitive data:
anonymized_data = adp.apply_anonymization(
columns=['sensitive_column'],
method='hash',
hash_algorithm='sha256'
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file AutoDataPreprocess-0.1.4.tar.gz.
File metadata
- Download URL: AutoDataPreprocess-0.1.4.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6e148e22d6d578e49f922829d24a47663625918b129bde3b6111bb894545298
|
|
| MD5 |
25fd1714eaa970f5e4e40d7ee7a0ea0f
|
|
| BLAKE2b-256 |
f4c260f77d3b94aaf522b6b477bc3dd28814a5aed70e9db20fa3abc97018efb2
|
File details
Details for the file AutoDataPreprocess-0.1.4-py3-none-any.whl.
File metadata
- Download URL: AutoDataPreprocess-0.1.4-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
653e1b9be6c98f6a795934f1dbfd6374f54368aa3875a7840d11779fa3ab110f
|
|
| MD5 |
583d734c10839b256761ff14ed17557d
|
|
| BLAKE2b-256 |
9af5ca7c0af60034c9e738043b52aca258f8755817403802e8cd00781764856f
|