Skip to main content

Generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks

Project description

Synthetic Data Generation for Tabular, Classification, and Time-Series Labels

This repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.

Features

  1. Tabular Data Generation: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.

  2. Classification Data Generation: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.

  3. Time-Series Data Generation: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.

  4. Data Privacy: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.

  5. Flexible and Extensible: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/syntheticdataset/synthetic-dataset.git

cd synthetic-dataset

pip install -r requirements.txt

Usage

Refer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.

from synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator

# Tabular data generation

tabular_gen = TabularDataGenerator(num_rows=1000)

tabular_data = tabular_gen.generate()



# Classification data generation

classification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)

classification_data, labels = classification_gen.generate()



# Time-series data generation

time_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)

time_series_data = time_series_gen.generate()

Contributing

Please read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.

License

This project is licensed under the MIT License - see the file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-dataset-0.0.0.1.tar.gz (3.4 kB view details)

Uploaded Source

Built Distribution

synthetic_dataset-0.0.0.1-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file synthetic-dataset-0.0.0.1.tar.gz.

File metadata

  • Download URL: synthetic-dataset-0.0.0.1.tar.gz
  • Upload date:
  • Size: 3.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for synthetic-dataset-0.0.0.1.tar.gz
Algorithm Hash digest
SHA256 83c98b15132320a3e6699fec6ae1930ecc27c58e3f286ccde2fb0496eef60662
MD5 bc40ec985d341ec4f1a94f47498254f1
BLAKE2b-256 1059370d0cbe244482733a6d2038c5ba440b3893ca7d5c5429a6c2f6a2bb7e0f

See more details on using hashes here.

File details

Details for the file synthetic_dataset-0.0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_dataset-0.0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9300d380488242362ad41321740fb3189a5de7ab3b6860b2840a8f08f573e9cd
MD5 0e29be21780234e5a583900171c37ee9
BLAKE2b-256 b6a527d09b8f81165f10598145fbf64895882ec4345dcfda3db2d28678a0af10

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page