Skip to main content

A package for generating synthetic datasets

Project description

Data Creator Class - README

The DataCreator class is a Python utility that allows you to generate synthetic datasets with different distributions and correlation structures for various use cases, such as data analysis, testing, and prototyping machine learning models. It provides flexibility in creating both numerical and categorical features, as well as options for adding biases, missing values, outliers, and noise to the dataset. Please read the whole Readme.dm before applying the package.

name="data-creator", version="0.1.0",

Nice to know:

The DataCreator creates datasets based on strength and correlation between features and to the target variable. The values of level 0 features are created by types of distributions (available distributions are explained in the further course of the Readme.dm). Level 1 or higher features are created by correlations. Please keep in mind that the last variable should be defined as a target variable. Fig. 1 is used to illustrate features with different strength (variable network).

Requirements

Before using the DataCreator class, ensure you have the following prerequisites:

Python (>=3.6)

Pandas (>=1.0.0)

NumPy (>=1.18.0)

For an easier and faster usage of the dataCreator, you should consider the following points:

  • define the number of features you want to use

  • define the strength between the features and the target variable

  • define the correlation between the features and the target variable

  • define the distribution type which level 0 features should have

Getting Started

  1. Use the dataCreator class via pip install: pip install data-creator

  2. Import the required libraries and the DataCreator class:

     import numpy as np
    
     import pandas as pd
    
     from data_creator import DataCreator
    
  3. Create an instance of the DataCreator class with the desired parameters:

An example for reference:

	samples = 1000 # Number of samples in the dataset

	num_feat = 5 # Number of features in the dataset

	biased = True # Add biases to specific features (e.g., gender, race)

	missing_values = True # Add missing values to the dataset

	outliers = True # Add outliers to the dataset

	noise = True # Add noise to the dataset

	topic = "loan" # The topic or name of the dataset

	  

	generator = DataCreator(samples, num_feat, biased, missing_values, outliers, noise, topic)
  1. Access the generated synthetic dataset

With callin the method "generate_Data()" the generated synthetic dataset is stored in the biasedData variable.

The dataset is in pandas DataFrame format, so you can use standard DataFrame operations to explore and analyze the data.

	biasedData = generator.generate_Data()
  1. When the code is executed, the terminal will display questions or requests that need to be executed.

Please, follow the prompts to specify the characteristics of each feature. Here are some important information:

-For each feature, enter its name and whether it is categorical or numerical.

  • If the dataset should be biased (biased = True), you're dataset should include the features Race and Gender,

the utility will automatically generate data based on predefined biases.

  • For numerical and categorical features, you can select a distribution type:

      Normal Distribution
    
      Uniform Distribution
    
      Binomial Distribution
    
      Exponential Distribution
    
      Multinomial Distribution (requires specifying probabilities for each category)
    
  • For numerical features with correlation, you can choose between:

      Linear Correlation with other numerical features
    
      Quadratic Correlation with other numerical features
    
      Exponential Correlation with other numerical features
    
      Polynomial Correlation (currently in progress)
    
  • For a categorical feature you can choose the correlation type "categorical"

  • The last feature should be defined and handled like a target variable. This will enable you to use the dataset for training ML model

Once all features are specified, the utility will generate the dataset, including any requested biases, missing values, outliers, and noise. The resulting dataset will be saved as a CSV file in the "datasets" folder within your project directory, with the name "generated_dataset_topic.csv".

Contributions

Contributions to the DataCreator class are welcome. Feel free to open issues or submit pull requests on the GitHub repository.

Happy data generation!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-data-creator-0.1.0.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthetic_data_creator-0.1.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file synthetic-data-creator-0.1.0.tar.gz.

File metadata

  • Download URL: synthetic-data-creator-0.1.0.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for synthetic-data-creator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e083d8232fafb2612cbd15688373131ecb9b233183d6a17058d7f829617c2aeb
MD5 aba92574fcbabafa2536e40024458f63
BLAKE2b-256 fd013541e27eb23e0e788538963cb15032cac7bb074e027d7ee8e92b034757c5

See more details on using hashes here.

File details

Details for the file synthetic_data_creator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_data_creator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afae37367c3bc99f901bd7e2247439deac2a5a6da582392a153e0f4297a61405
MD5 d3c0097d04797f6536b220a6e8a6f104
BLAKE2b-256 0eb87f17a48e131b3f0535a583e02a75fba70500223a41a2a084b28341764555

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page