A package for generating synthetic datasets

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Data Creator Class - README

The DataCreator class is a Python utility that allows you to generate synthetic datasets with different distributions and correlation structures for various use cases, such as data analysis, testing, and prototyping machine learning models. It provides flexibility in creating both numerical and categorical features, as well as options for adding biases, missing values, outliers, and noise to the dataset. Please read the whole Readme.dm before applying the package.

name="data-creator", version="0.1.0",

Nice to know:

The DataCreator creates datasets based on strength and correlation between features and to the target variable. The values of level 0 features are created by types of distributions (available distributions are explained in the further course of the Readme.dm). Level 1 or higher features are created by correlations. Please keep in mind that the last variable should be defined as a target variable. Fig. 1 is used to illustrate features with different strength (variable network).

Requirements

Before using the DataCreator class, ensure you have the following prerequisites:

Python (>=3.6)

Pandas (>=1.0.0)

NumPy (>=1.18.0)

For an easier and faster usage of the dataCreator, you should consider the following points:

define the number of features you want to use
define the strength between the features and the target variable
define the correlation between the features and the target variable
define the distribution type which level 0 features should have

Getting Started

Use the dataCreator class via pip install: pip install data-creator

Import the required libraries and the DataCreator class:

 import numpy as np

 import pandas as pd

 from data_creator import DataCreator

Create an instance of the DataCreator class with the desired parameters:

An example for reference:

	samples = 1000 # Number of samples in the dataset

	num_feat = 5 # Number of features in the dataset

	biased = True # Add biases to specific features (e.g., gender, race)

	missing_values = True # Add missing values to the dataset

	outliers = True # Add outliers to the dataset

	noise = True # Add noise to the dataset

	topic = "loan" # The topic or name of the dataset

	  

	generator = DataCreator(samples, num_feat, biased, missing_values, outliers, noise, topic)

Access the generated synthetic dataset

With callin the method "generate_Data()" the generated synthetic dataset is stored in the biasedData variable.

The dataset is in pandas DataFrame format, so you can use standard DataFrame operations to explore and analyze the data.

	biasedData = generator.generate_Data()

When the code is executed, the terminal will display questions or requests that need to be executed.

Please, follow the prompts to specify the characteristics of each feature. Here are some important information:

-For each feature, enter its name and whether it is categorical or numerical.

If the dataset should be biased (biased = True), you're dataset should include the features Race and Gender,

the utility will automatically generate data based on predefined biases.

For numerical and categorical features, you can select a distribution type:

  Normal Distribution

  Uniform Distribution

  Binomial Distribution

  Exponential Distribution

  Multinomial Distribution (requires specifying probabilities for each category)

For numerical features with correlation, you can choose between:

  Linear Correlation with other numerical features

  Quadratic Correlation with other numerical features

  Exponential Correlation with other numerical features

  Polynomial Correlation (currently in progress)

For a categorical feature you can choose the correlation type "categorical"
The last feature should be defined and handled like a target variable. This will enable you to use the dataset for training ML model

Once all features are specified, the utility will generate the dataset, including any requested biases, missing values, outliers, and noise. The resulting dataset will be saved as a CSV file in the "datasets" folder within your project directory, with the name "generated_dataset_topic.csv".

Contributions

Contributions to the DataCreator class are welcome. Feel free to open issues or submit pull requests on the GitHub repository.

Happy data generation!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Aug 4, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-data-creator-0.1.0.tar.gz (8.7 kB view hashes)

Uploaded Aug 4, 2023 Source

Built Distribution

synthetic_data_creator-0.1.0-py3-none-any.whl (7.4 kB view hashes)

Uploaded Aug 4, 2023 Python 3

Hashes for synthetic-data-creator-0.1.0.tar.gz

Hashes for synthetic-data-creator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e083d8232fafb2612cbd15688373131ecb9b233183d6a17058d7f829617c2aeb`
MD5	`aba92574fcbabafa2536e40024458f63`
BLAKE2b-256	`fd013541e27eb23e0e788538963cb15032cac7bb074e027d7ee8e92b034757c5`

Hashes for synthetic_data_creator-0.1.0-py3-none-any.whl

Hashes for synthetic_data_creator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`afae37367c3bc99f901bd7e2247439deac2a5a6da582392a153e0f4297a61405`
MD5	`d3c0097d04797f6536b220a6e8a6f104`
BLAKE2b-256	`0eb87f17a48e131b3f0535a583e02a75fba70500223a41a2a084b28341764555`