Skip to main content

Generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks

Project description

Synthetic Data Generation for Tabular, Classification, and Time-Series Labels

This repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.

Features

  1. Tabular Data Generation: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.

  2. Classification Data Generation: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.

  3. Time-Series Data Generation: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.

  4. Data Privacy: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.

  5. Flexible and Extensible: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/syntheticdataset/synthetic-dataset.git

cd synthetic-dataset

pip install -r requirements.txt

Usage

Refer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.

from synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator

# Tabular data generation

tabular_gen = TabularDataGenerator(num_rows=1000)

tabular_data = tabular_gen.generate()



# Classification data generation

classification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)

classification_data, labels = classification_gen.generate()



# Time-series data generation

time_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)

time_series_data = time_series_gen.generate()

Contributing

Please read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.

License

This project is licensed under the MIT License - Licence see the file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-dataset-0.0.0.2.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

synthetic_dataset-0.0.0.2-py3-none-any.whl (3.3 kB view details)

Uploaded Python 3

File details

Details for the file synthetic-dataset-0.0.0.2.tar.gz.

File metadata

  • Download URL: synthetic-dataset-0.0.0.2.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for synthetic-dataset-0.0.0.2.tar.gz
Algorithm Hash digest
SHA256 41b8ab040623c3b440fc518275a1260c82e1282c172d0603e044a4c910b3125d
MD5 6716b0b014950fce8ad53fdad1dd9898
BLAKE2b-256 26ea2f021b6a2a16c960aece62899bfc33fc19fe2516d07d3ad88f6cfa4bbc27

See more details on using hashes here.

File details

Details for the file synthetic_dataset-0.0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_dataset-0.0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ea0bfa4cd8b0039e0b78c70e24e7e9fa053eabb7125ece98cb1851df587bfbc0
MD5 57de5825ef26283b781b9e474fdb5428
BLAKE2b-256 2e324614b7ca4899ff2a5ab1bf39f04bb0344654d0dbba14ba72d16a124ff8fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page