Generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks
Project description
Synthetic Data Generation for Tabular, Classification, and Time-Series Labels
This repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.
Features
-
Tabular Data Generation: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.
-
Classification Data Generation: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.
-
Time-Series Data Generation: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.
-
Data Privacy: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.
-
Flexible and Extensible: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.
Installation
Clone the repository and install the required dependencies:
git clone https://github.com/syntheticdataset/synthetic-dataset.git
cd synthetic-dataset
pip install -r requirements.txt
Usage
Refer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.
from synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator
# Tabular data generation
tabular_gen = TabularDataGenerator(num_rows=1000)
tabular_data = tabular_gen.generate()
# Classification data generation
classification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)
classification_data, labels = classification_gen.generate()
# Time-series data generation
time_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)
time_series_data = time_series_gen.generate()
Contributing
Please read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.
License
This project is licensed under the MIT License - see the file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file synthetic-dataset-0.0.0.1.tar.gz
.
File metadata
- Download URL: synthetic-dataset-0.0.0.1.tar.gz
- Upload date:
- Size: 3.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83c98b15132320a3e6699fec6ae1930ecc27c58e3f286ccde2fb0496eef60662 |
|
MD5 | bc40ec985d341ec4f1a94f47498254f1 |
|
BLAKE2b-256 | 1059370d0cbe244482733a6d2038c5ba440b3893ca7d5c5429a6c2f6a2bb7e0f |
File details
Details for the file synthetic_dataset-0.0.0.1-py3-none-any.whl
.
File metadata
- Download URL: synthetic_dataset-0.0.0.1-py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9300d380488242362ad41321740fb3189a5de7ab3b6860b2840a8f08f573e9cd |
|
MD5 | 0e29be21780234e5a583900171c37ee9 |
|
BLAKE2b-256 | b6a527d09b8f81165f10598145fbf64895882ec4345dcfda3db2d28678a0af10 |