Skip to main content

Generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks

Project description

Synthetic Data Generation for Tabular, Classification, and Time-Series Labels

This repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.

Features

  1. Tabular Data Generation: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.

  2. Classification Data Generation: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.

  3. Time-Series Data Generation: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.

  4. Data Privacy: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.

  5. Flexible and Extensible: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/syntheticdataset/synthetic-dataset.git

cd synthetic-dataset

pip install -r requirements.txt

Usage

Refer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.

from synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator

# Tabular data generation

tabular_gen = TabularDataGenerator(num_rows=1000)

tabular_data = tabular_gen.generate()



# Classification data generation

classification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)

classification_data, labels = classification_gen.generate()



# Time-series data generation

time_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)

time_series_data = time_series_gen.generate()

Contributing

Please read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.

License

This project is licensed under the MIT License - Licence see the file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-dataset-0.0.0.2.tar.gz (3.5 kB view hashes)

Uploaded Source

Built Distribution

synthetic_dataset-0.0.0.2-py3-none-any.whl (3.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page