Skip to main content

Indox Synthetic Data Generation (GAN-pytorch)

Project description

IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework

License PyPI Python Downloads

Discord GitHub stars

Official WebsiteDocumentationDiscord

NEW: Subscribe to our mailing list for updates and news!

Overview

IndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.

Key Features

  • GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
  • PyTorch Integration: Built on PyTorch for efficient, GPU-accelerated training and generation.
  • Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
  • Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
  • Training Monitoring: Built-in patience-based early stopping for optimal model training.
  • Scalable Generation: Efficiently generate large volumes of synthetic data post-training.

Installation

pip install IndoxGen-Torch

Quick Start Guide

Basic Usage

from indoxGen_pytorch import TabularGANConfig, TabularGANTrainer
import pandas as pd

# Load your data
data = pd.read_csv("data/Adult.csv")

# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]

# Set up the configuration
config = TabularGANConfig(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=128,
    epochs=50,
    n_critic=5
)

# Initialize and train the model
trainer = TabularGANTrainer(
    config=config,
    categorical_columns=categorical_columns,
    mixed_columns=mixed_columns,
    integer_columns=integer_columns
)
history = trainer.train(data, patience=15)

# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)

Advanced Techniques

Customizing the GAN Architecture

custom_config = TabularGANConfig(
    input_dim=300,
    generator_layers=[256, 512, 1024, 512],
    discriminator_layers=[512, 1024, 512, 256],
    learning_rate=1e-4,
    batch_size=256,
    epochs=100,
    n_critic=3
)

custom_trainer = TabularGANTrainer(config=custom_config, ...)

Handling Imbalanced Datasets

original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)

Configuration and Customization

The TabularGANConfig class allows for extensive customization:

  • input_dim: Dimension of the input noise vector
  • generator_layers and discriminator_layers: List of layer sizes for the generator and discriminator
  • learning_rate, beta_1, beta_2: Adam optimizer parameters
  • batch_size, epochs: Training configuration
  • n_critic: Number of discriminator updates per generator update

Refer to the API documentation for a comprehensive list of configuration options.

Best Practices

  1. Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
  2. Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
  3. Validation: Regularly compare the distribution of synthetic data with the original dataset.
  4. Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
  5. Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.

Roadmap

  • Implement basic GAN architecture for tabular data
  • Add support for mixed data types (categorical, continuous, integer)
  • Integrate early stopping and training history
  • Implement more advanced GAN variants (WGAN, CGAN)
  • Add built-in privacy preserving mechanisms
  • Develop automated hyperparameter tuning
  • Create visualization tools for synthetic data quality assessment
  • Implement distributed training support for large-scale datasets

Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.

License

IndoxGen-Torch is released under the MIT License. See LICENSE.md for more details.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxgen_torch-0.0.9.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

indoxGen_torch-0.0.9-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file indoxgen_torch-0.0.9.tar.gz.

File metadata

  • Download URL: indoxgen_torch-0.0.9.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.0

File hashes

Hashes for indoxgen_torch-0.0.9.tar.gz
Algorithm Hash digest
SHA256 44f5619b22d29be8deab50d040cd6aa2f9938d5ce678fbf06073c1143c1cb1bf
MD5 9df208f482005db6fd41592209917417
BLAKE2b-256 cf7cf17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57

See more details on using hashes here.

File details

Details for the file indoxGen_torch-0.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for indoxGen_torch-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 c314d81cead763586a4b2cf54aa7a01123363b14a3cfbaff0acb81e1440aec69
MD5 024e214efe4d458b8f59b729249916b6
BLAKE2b-256 988868f47b7f0ecf6c1e5f33dff5a52682c7920a74c39b886fc215ca8d1bc14e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page