Skip to main content

Indox Synthetic Data Generation (GAN-tensorflow)

Project description

IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework

License PyPI Python Downloads

Discord GitHub stars

Official WebsiteDocumentationDiscord

NEW: Subscribe to our mailing list for updates and news!

Overview

IndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.

Key Features

  • GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
  • TensorFlow Integration: Built on TensorFlow for efficient, GPU-accelerated training and generation.
  • Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
  • Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
  • Training Monitoring: Built-in patience-based early stopping for optimal model training.
  • Scalable Generation: Efficiently generate large volumes of synthetic data post-training.

Installation

pip install indoxgen-tensor

Quick Start Guide

Basic Usage

from indoxGen_tensor import TabularGANConfig, TabularGANTrainer
import pandas as pd

# Load your data
data = pd.read_csv("data/Adult.csv")

# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]

# Set up the configuration
config = TabularGANConfig(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=128,
    epochs=50,
    n_critic=5
)

# Initialize and train the model
trainer = TabularGANTrainer(
    config=config,
    categorical_columns=categorical_columns,
    mixed_columns=mixed_columns,
    integer_columns=integer_columns
)
history = trainer.train(data, patience=15)

# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)

Advanced Techniques

Customizing the GAN Architecture

custom_config = TabularGANConfig(
    input_dim=300,
    generator_layers=[256, 512, 1024, 512],
    discriminator_layers=[512, 1024, 512, 256],
    learning_rate=1e-4,
    batch_size=256,
    epochs=100,
    n_critic=3
)

custom_trainer = TabularGANTrainer(config=custom_config, ...)

Handling Imbalanced Datasets

# Assuming 'rare_class' is underrepresented in your original data
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)

# Adjust generation or sampling to match desired distribution

Configuration and Customization

The TabularGANConfig class allows for extensive customization:

  • input_dim: Dimension of the input noise vector
  • generator_layers and discriminator_layers: List of layer sizes for the generator and discriminator
  • learning_rate, beta_1, beta_2: Adam optimizer parameters
  • batch_size, epochs: Training configuration
  • n_critic: Number of discriminator updates per generator update

Refer to the API documentation for a comprehensive list of configuration options.

Best Practices

  1. Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
  2. Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
  3. Validation: Regularly compare the distribution of synthetic data with the original dataset.
  4. Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
  5. Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.

Roadmap

  • Implement basic GAN architecture for tabular data
  • Add support for mixed data types (categorical, continuous, integer)
  • Integrate early stopping and training history
  • Implement more advanced GAN variants (WGAN, CGAN)
  • Add built-in privacy preserving mechanisms
  • Develop automated hyperparameter tuning
  • Create visualization tools for synthetic data quality assessment
  • Implement distributed training support for large-scale datasets

Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.

License

IndoxGen-Tensor is released under the MIT License. See LICENSE.md for more details.


IndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxgen_tensor-0.1.0.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

indoxGen_tensor-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file indoxgen_tensor-0.1.0.tar.gz.

File metadata

  • Download URL: indoxgen_tensor-0.1.0.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.0

File hashes

Hashes for indoxgen_tensor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 778d86079f7080ced26708e2462b0f1399b4e5b9cffbe384da5f6f1fcfda7837
MD5 7df6276f712d55f7d5be6360801286e5
BLAKE2b-256 4e25f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c

See more details on using hashes here.

File details

Details for the file indoxGen_tensor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for indoxGen_tensor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa84cba244d6a5aa0505e939d506b2e14c9a029e0fa70c871f9dc6f6a17d3cdb
MD5 4041743a6fd69fcb468d0cf1d4da7fe8
BLAKE2b-256 ac28db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page