A high-level library for automatic preprocessing of tabular data
Project description
AutoDataPreprocess
AutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.
Features
- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)
- Basic data analysis and visualization
- Data cleaning (handling missing values, outliers, duplicates)
- Feature engineering
- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)
- Scaling and normalization
- Dimensionality reduction
- Feature selection
- Handling imbalanced data
- Time series preprocessing
- Data anonymization
Installation
You can install AutoDataPreprocess using pip: pip install autodatapreprocess
Quick Start
from autodatapreprocess import AutoDataPreprocess
# Load data
adp = AutoDataPreprocess('your_data_file.csv')
# Perform basic analysis
adp.basic_analysis()
# Clean the data
cleaned_data = adp.clean(missing='mean', outliers='iqr')
# Perform feature engineering
engineered_data = adp.fe(target_column='target', polynomial_degree=2)
# Encode categorical variables
encoded_data = adp.encode(methods={'category_column': 'onehot'})
# Scale the data
scaled_data = adp.scale(method='standard')
Detailed Usage
Data Loading
Load data from various sources:
# From CSV
adp = AutoDataPreprocess('data.csv')
# From SQL
adp = AutoDataPreprocess(sql_query="SELECT * FROM table", sql_connection_string="your_connection_string")
# From API
adp = AutoDataPreprocess(api_url="https://api.example.com/data", api_params={"key": "value"})
Data Cleaning
Clean your data with various options:
cleaned_data = adp.clean(
missing='mean',
outliers='iqr',
drop_threshold=0.7,
date_format='%Y-%m-%d',
remove_duplicates=True
)
Feature Engineering
Perform feature engineering:
engineered_data = adp.fe(
target_column='target',
polynomial_degree=2,
interaction_only=False,
bin_numeric=True,
num_bins=5,
cyclical_features=['month', 'day_of_week'],
text_columns=['description'],
date_columns=['date']
)
Encoding
Encode categorical variables:
encoded_data = adp.encode(
methods={
'category1': 'onehot',
'category2': 'label',
'category3': 'target'
},
target_column='target'
)
Scaling and Normalization
Scale or normalize your data:
scaled_data = adp.scale(method='standard')
normalized_data = adp.normalize(method='l2')
Dimensionality Reduction
Reduce the dimensionality of your data:
reduced_data = adp.dimreduction(method='pca', n_components=5)
Feature Selection
Select the most important features:
selected_data = adp.feature_selection(
target_column='target',
method='correlation',
correlation_threshold=0.8
)
Handling Imbalanced Data
Balance your dataset:
balanced_data = adp.balance_data(
target_column='target',
method='smote',
sampling_strategy='auto'
)
Time Series Preprocessing
Preprocess time series data:
preprocessed_ts_data = adp.time_series_preprocessing(
time_column='date',
freq='D',
method='mean',
detrend_columns=['value'],
seasonality_columns=['value'],
lag_columns=['value'],
lags=[1, 7, 30]
)
Data Anonymization
Anonymize sensitive data:
anonymized_data = adp.apply_anonymization(
columns=['sensitive_column'],
method='hash',
hash_algorithm='sha256'
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for AutoDataPreprocess-0.1.2-py3-none-any.whl
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 | c149aff939dbbb2bc3b2c75434dd0465885b69a831d133ed638638e36d66f2e0 |
|
| MD5 | 17c01255a0e85c17ef37660a872e7bfb |
|
| BLAKE2b-256 | 85869c75332f0809a6511bbbebd1724e7b6f4fc63cc77280f9e4c9aeed1f2756 |