veda_lib is a Python library designed to streamline the data preprocessing and cleaning workflow for machine learning projects. It offers a comprehensive set of tools to handle common data preparation tasks

These details have not been verified by PyPI

Project links

Project description

veda_lib

A Python library designed to streamline the transition from raw data to machine learning models.
veda_lib automates and simplifies data preprocessing, cleaning, and balancing, addressing the time-consuming and complex aspects of these tasks to provide clean, ready-to-use data for your models.

Installation

First, install veda_lib using pip:

pip install veda_lib

How to use?

After installing veda_lib, import it into your project and start utilizing its modules to prepare your data. Below is a summary of the key functionalities provided by each module:

1. Preprocessor Module

Functions:
- Removing null values
- Handling duplicates
- Imputing missing values with appropriate methods
Usage: Ideal for initial data cleaning and preprocessing steps.
Parameters:
- keep (str/bool, default='first')
  How to keep duplicates. Options: ['first', 'last', False].
- min_cat_percent (float, default=5)
  Convert column into categorical if % of unique values < threshold.
- datalosspercent (float, default=10)
  Maximum acceptable % of data loss during cleaning.
- min_var (float, default=0.04)
  Row deletion threshold. Columns with missing proportion > min_var are ignored.
- min_col_threshold (float, default=0.65)
  Column deletion threshold. Drop columns with missing proportion > threshold.
- var_diff (float, default=0.05)
  Maximum allowable variance change (numerical imputation).
- mod_diff (float, default=0.05)
  Threshold for mode dominance (categorical imputation).
- numerical_column (list/None, default=None)
  List of numerical column names (if not auto-detected).
- categorical_column (list/None, default=None)
  List of categorical column names (if not auto-detected).
- temporal_column (list/None, default=None)
  List of temporal column names (if any).
- temporal_type (str, default='interpolate')
  Strategy for temporal imputation. Options: ['bfill', 'ffill', 'interpolate'].
- n_neighbors (int, default=5)
  Number of neighbors for multivariate imputation (KNN-based).
- label_encoding_type (str, default='onehot')
  Encoding strategy for categorical features. Options: ['onehot', 'labelencode'].

2. OutlierHandler Module

Functions:
- Handling outliers by either removing or capping them
- Customizable based on the nature of your data
Usage: Useful for managing data skewness and ensuring robust model performance.
Parameters:
- tests (list, default=['skew-kurtosis']) Test to check whether the data is having normal distribution or not. Options:
  - shapiro: Tests the null hypothesis that the data was drawn from a normal distribution.
  - skew-kurtosis: skewness measures asymmetry in the data, normal distribution has skewness app. 0 and kurtosis measures "peakedness", normal distribution has kurtosis app.
  - kstest: Compares the sample distribution with a theoretical normal distribution
  - Anderson: Checks how well data fits a normal distribution, focusing more on the tails
  - jarque-bera: Checks if skewness and kurtosis match those of a normal distribution.
- method (str, default='default') Outliers detection stratedy. Options:
  - default: Adaptive pipeline (Dip Test + DBSCAN | Isolation Forest | LOF | Normal Rule)
  - isolation forest: Always uses isolation forest
  - lof: Always uses local outlier factor
- handle (str, default='capping') Strategy for handling detected outliers. Options:
  - capping: Replace values beyond 3var limits with boundary values*
  - trimming: Drop rows with outliers.
  - winsorization: Clip values at limits.
- minlen (int, defualt=5000) Minimum dataset size above which Shapiro test is applied.
- skew_thresh (float, default=1) Absolute skewness threshold. Values greater than this indicate non-normal distribution.
- kurt_thresh (float, default=1) Absolute deviation from kurtosis=3 (normal distribution). Values greater than this indicate non-normal distribution.

3. FeatureSelector Module

Functions:
- Selecting important features from the dataset
- Tailored selection based on the nature of the data
Usage: Helps in reducing dimensionality and focusing on the most impactful features.
Parameters:
- percentile (float, default=90) Percentile threshold (0â€“100) for selecting features most correlated with the target variable. Higher values select fewer features with stronger correlations.
- threshold (float, default=0.9) Cumulative mutual information threshold (0â€“1) that determines the optimal number of features to select. A higher threshold selects more features.
- cv (int, default=5) Number of cross-validation folds for selecting the best Lasso regularization strength (alpha). Must be a positive integer.

4. DimensionReducer Module

Functions:
- Reducing data dimensionality using appropriate techniques
Usage: Crucial for addressing the curse of dimensionality and improving model efficiency.
Parameters:
- variance_threshold (float, default=0.95) Fraction of variance to preserve during PCA/autoencoder training.
- prioritize_reproducibility (bool, default=True) Ensures deterministic results by fixing random seeds.
- min_neighbors (int, default=5) Minimum number of neighbors to controls local structure preservation.
- max_neighbors (int, default=50) Maximum number of neighbors to prevents over-smoothing of high-dimensional manifolds.
- min_dim (int, default=10) Minimum encoding dimension for Autoencoders.
- max_dim (int, default=100) Maximum encoding dimension for Autoencoders.
- hidden_layers (int, default=1) Number of hidden layers in Autoencoder.
- optimizer (str, default=adam) Optimizer used for training Autoencoders.
- loss (str, default=mean_squared_error) Loss function for Autoencoder reconstruction.
- min_epochs (int, default=20) Minimum number of epochs for Autoencoder training.
- max_epochs (int, default=100) Maximum epochs allowed for training Autoencoders.
- min_batch_size (int, default=32) Smallest batch size for Autoencoder training.
- max_batch_size (int, default=256) Largest batch size allowed for Autoencoder training.

5. BalanceData Module

Functions:
- Balancing class distribution in imbalanced datasets
- Methods chosen based on data characteristics
Usage: Essential for improving model fairness and performance on imbalanced datasets.
Parameters:
- threshold (float, 0.5) Minimum acceptable ratio of minority to majority class. If the imbalance ratio is greater than or equal to this threshold, no resampling is performed.
- classification (bool, None) Whether the task is classification or not. Options: [True, False]

6. Veda Module

Functions:
- Integrates all the above functionalities into a single pipeline
Usage: Pass your raw data through this module to perform comprehensive EDA and get fully preprocessed, cleaned, and balanced data ready for model training.
Parameters:
- classification (bool, None) Whether the task is classification or not. Options: [True, False]

Importing

Here is an example of importing Veda from veda_lib.Veda, here set classification to True if the problem is classification otherwise set to False.

from veda_lib import Veda

eda = Veda.Veda(classification=True)
X, y, outliers, strategy, model = eda.fit_transform(X, y)

Returns:
- X - Transformed feature set after complete processing.
- y - Transformed target variable.
- outliers - detected outliers from the data.
- strategy - Automatically selected balancing strategy ("none", "oversample", "combine", "anomaly", "ensemble").
- model - The fitted balancing model/sampler (e.g., SMOTE, IsolationForest, RandomForestClassifier), or None if not applicable.
Here is an example of importing DataPreprocessor from veda_lib.Preprocessor, using default values of parameters

from veda_lib import Preprocessor

preprocessor = Preprocessor.DataPreprocessor()
X, y = preprocessor.fit_transform(X, y)

Returns:
- X - Transformed feature set after preprocessing.
- y - Transformed target variable.
Here is an example of importing OutlierPreprocessor from veda_lib.OutlierHandler, using default values of parameters.

from veda_lib import OutlierHandler

outlier_preprocessor = OutlierHandler.OutlierPreprocessor()
X, y, outliers = outlier_preprocessor.fit_transform(X, y)

Returns:
- X - Transformed feature set after handling outliers.
- y - Transformed target variable.
- outliers - detected outliers from the data.
Here is an example of importing FeatureSelection from veda_lib.FeatureSelector, using default values of parameters.

from veda_lib import FeatureSelector

selector = FeatureSelector.FeatureSelection()
X, y = selector.fit_transform(X, y)

Returns:
- X - Transformed features set after feature selection.
- y - Transformed target variable.
Here is an example of importing DimensionReducer from veda_lib.DimensionReducer, using default values of parameters.

from veda_lib import DimensionReducer

reducer = DimensionReducer.DimensionReducer()
X, y = reducer.fit_transform(X, y)

Returns:
- X - Transformed features set after reducing dimensions.
- y - Transformed target variables.
Here is an example of importing AdaptiveBalancer from veda_lib.BalanceData, using default values of parameters.

from veda_lib import BalanceData

balancer = BalanceData.AdaptiveBalancer(classification=True)
X, y, strategy, model = balancer.fit_transform(X, y)

Returns:
- X - Transformed features set after balancing it.
- y - Transformed target variables.
- strategy -
- strategy - Automatically selected balancing strategy ("none", "oversample", "combine", "anomaly", "ensemble").
- model - The fitted balancing model/sampler (e.g., SMOTE, IsolationForest, RandomForestClassifier), or None if not applicable.

Contributing

I welcome contributions to veda_lib! If you have a bug report, feature suggestion, or want to contribute code, please open an issue or pull request on GitHub.

License

veda_lib is licensed under the Apache License Version 2.0. See the LICENSE file for more details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.7

Oct 1, 2025

0.0.6

Sep 27, 2025

0.0.5

Aug 21, 2024

0.0.4

Aug 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veda_lib-0.0.7.tar.gz (29.3 kB view details)

Uploaded Oct 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

veda_lib-0.0.7-py3-none-any.whl (30.0 kB view details)

Uploaded Oct 1, 2025 Python 3

File details

Details for the file veda_lib-0.0.7.tar.gz.

File metadata

Download URL: veda_lib-0.0.7.tar.gz
Upload date: Oct 1, 2025
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for veda_lib-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`634b850785b2a9aad6320d5aa18d61fd0a5f2f507b29a1d681074c8b22fc2916`
MD5	`901b5c848fe8a1c31fc6376d500f6ebb`
BLAKE2b-256	`4c6e2987b764b25ffbbf608543710eb5318c071071be0f0c2bc21e20861bfb95`

See more details on using hashes here.

File details

Details for the file veda_lib-0.0.7-py3-none-any.whl.

File metadata

Download URL: veda_lib-0.0.7-py3-none-any.whl
Upload date: Oct 1, 2025
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for veda_lib-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3cece227cf9550a1b8ce0ef35a3ca8bad271412ec7a9bde66282461a037b5e4f`
MD5	`63b239769e1d53fa905721eadce41c53`
BLAKE2b-256	`3b24c8a3eeb4eaab3220f15a5367259f0e6ab8e15a6ec5e72e2eaa393ac529af`

See more details on using hashes here.

veda-lib 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

veda_lib

Installation

How to use?

Importing

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes