Skip to main content

A comprehensive data processing library.

Project description

Data Tools Package

A comprehensive library for data preprocessing in AI development, focusing on scalability, usability, and modular design.

Features

Features

  • Data Loading: Efficiently load datasets in various formats.
  • Data Cleaning: Handle missing values, outliers, and duplicates.
  • Feature Engineering: Create new features using advanced techniques.
  • Categorical Processing: One-hot and label encoding for categorical variables.
  • Scaling: Normalize and standardize numerical features.
  • Outlier Handling: Detect and remove outliers using IQR.
  • Text Processing: Clean, tokenize, and vectorize text data.
  • Time Series Processing: Create time-based features and resample data.
  • Image Processing: Load, resize, normalize, and convert images.
  • Image Augmentation: Apply transformations to increase the diversity of your training dataset.

usage

from dataprocessor import DataLoader, DataCleaner, FeatureEngineer, ImageProcessor, ImageAugmenter

# Example usage of the package
loader = DataLoader()
data = loader.load_csv("data.csv")

cleaner = DataCleaner()
cleaned_data = cleaner.clean(data)

# Image processing example
image = ImageProcessor.load_image("path/to/image.jpg")
resized_image = ImageProcessor.resize_image(image, (224, 224))
normalized_image = ImageProcessor.normalize_image(resized_image)

# Image augmentation example
augmented_image = ImageAugmenter.augment_image(normalized_image)

testing

poetry run pytest

TODO: restructure

package/
├── .github/
│   ├── workflows/
│   │   ├── ci.yml
│   │   ├── cd.yml
├── src/
│   └── dataprocessor/
│       ├── __init__.py
│       ├── loaders/                   # Data loading modules
│       │   └── data_loader.py         # Load various data formats (CSV, JSON, etc.)
│       ├── cleaners/                  # Data cleaning modules
│       │   ├── data_cleaner.py        # Clean and preprocess data
│       │   ├── outlier_handler.py      # Outlier detection and handling
│       │   ├── scaling.py             # Scaling/normalization techniques
│       │   └── categorical_processor.py # Handling categorical data
│       ├── transformers/               # Data transformation modules
│       │   ├── feature_engineer.py      # Feature engineering tools
│       │   ├── text_processor.py         # Text data processing (tokenization, cleaning)
│       │   ├── time_series_processor.py  # Time series specific tools (windowing, etc.)
│       │   ├── image_processor.py        # Image preprocessing (resizing, normalization)
│       │   └── image_augmenter.py        # Data augmentation techniques for images
│       ├── evaluators/                  # Evaluation modules
│       │   └── evaluator.py             # Evaluation metrics and tools
│       ├── visualizers/                 # Visualization modules
│       │   └── visualizer.py            # Visualization tools (plots, charts)
│       ├── pipelines/                   # Pipeline modules
│       │   ├── pipeline.py               # Pipelines for chaining transformations
│       │   └── config.py                 # Configuration management for reproducibility
│       └── utils.py                     # Utility functions (logging, file handling)
├── tests/
│   ├── test_loaders/
│   │   └── test_data_loader.py
│   ├── test_cleaners/
│   │   ├── test_data_cleaner.py
│   │   ├── test_outlier_handler.py
│   │   └── test_scaling.py
│   │   └── test_categorical_processor.py
│   ├── test_transformers/
│   │   ├── test_feature_engineer.py
│   │   ├── test_text_processor.py
│   │   ├── test_time_series_processor.py
│   │   ├── test_image_processor.py
│   │   └── test_image_augmenter.py
│   ├── test_evaluators/
│   │   └── test_evaluator.py
│   ├── test_visualizers/
│   │   └── test_visualizer.py
│   ├── test_pipelines/
│   │   └── test_pipeline.py
│   └── test_audio_processor.py
│   └── test_tabular_processor.py
├── README.md
├── CONTRIBUTING.md                  # Guidelines for contributing to the package
├── CHANGELOG.md                     # Changelog for tracking updates and changes
├── examples/                        # Directory for example notebooks or scripts
│   ├── example_data_loading.py
│   ├── example_feature_engineering.py
│   └── example_visualization.py
├── requirements.txt                 # List of dependencies for the package
└── pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataprocessor_vb-0.1.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataprocessor_vb-0.1.0-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file dataprocessor_vb-0.1.0.tar.gz.

File metadata

  • Download URL: dataprocessor_vb-0.1.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.13.0 Darwin/23.3.0

File hashes

Hashes for dataprocessor_vb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7c5bb9579b8f3c6fccc1afcb2e5499585966cbda703655eb4183b63b44699236
MD5 58de2d0fa4179c243c6290cebc46de04
BLAKE2b-256 ae1b6ead9b38328d11fffe33a8626939faeea8aab341e96699ebd890f65bfcc2

See more details on using hashes here.

File details

Details for the file dataprocessor_vb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataprocessor_vb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.13.0 Darwin/23.3.0

File hashes

Hashes for dataprocessor_vb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2aa8ceb1e581cc95b4c3aef867f218b213cbdea037dc47434275bca89183577e
MD5 189a20e6e0c46f79570a959858b41e8a
BLAKE2b-256 92b84940751f273e9c6f6415460d66cca4cb8e32b64790eaa76af2a0d2e227cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page