Indox Synthetic Data Generation (GAN-pytorch)

These details have not been verified by PyPI

Project links

Homepage

Project description

IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework

Official Website • Documentation • Discord

NEW: Subscribe to our mailing list for updates and news!

Overview

IndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.

Key Features

GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
PyTorch Integration: Built on PyTorch for efficient, GPU-accelerated training and generation.
Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
Training Monitoring: Built-in patience-based early stopping for optimal model training.
Scalable Generation: Efficiently generate large volumes of synthetic data post-training.

Installation

pip install IndoxGen-Torch

Quick Start Guide

Basic Usage

from indoxGen_pytorch import TabularGANConfig, TabularGANTrainer
import pandas as pd

# Load your data
data = pd.read_csv("data/Adult.csv")

# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]

# Set up the configuration
config = TabularGANConfig(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=128,
    epochs=50,
    n_critic=5
)

# Initialize and train the model
trainer = TabularGANTrainer(
    config=config,
    categorical_columns=categorical_columns,
    mixed_columns=mixed_columns,
    integer_columns=integer_columns
)
history = trainer.train(data, patience=15)

# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)

Advanced Techniques

Customizing the GAN Architecture

custom_config = TabularGANConfig(
    input_dim=300,
    generator_layers=[256, 512, 1024, 512],
    discriminator_layers=[512, 1024, 512, 256],
    learning_rate=1e-4,
    batch_size=256,
    epochs=100,
    n_critic=3
)

custom_trainer = TabularGANTrainer(config=custom_config, ...)

Handling Imbalanced Datasets

original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)

Configuration and Customization

The TabularGANConfig class allows for extensive customization:

input_dim: Dimension of the input noise vector
generator_layers and discriminator_layers: List of layer sizes for the generator and discriminator
learning_rate, beta_1, beta_2: Adam optimizer parameters
batch_size, epochs: Training configuration
n_critic: Number of discriminator updates per generator update

Refer to the API documentation for a comprehensive list of configuration options.

Best Practices

Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
Validation: Regularly compare the distribution of synthetic data with the original dataset.
Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.

Roadmap

Implement basic GAN architecture for tabular data
Add support for mixed data types (categorical, continuous, integer)
Integrate early stopping and training history
Implement more advanced GAN variants (WGAN, CGAN)
Add built-in privacy preserving mechanisms
Develop automated hyperparameter tuning
Create visualization tools for synthetic data quality assessment
Implement distributed training support for large-scale datasets

Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.

License

IndoxGen-Torch is released under the MIT License. See LICENSE.md for more details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.9

Oct 14, 2024

0.0.8

Oct 14, 2024

0.0.7

Oct 14, 2024

0.0.6

Oct 14, 2024

0.0.5

Oct 13, 2024

This version

0.0.4

Oct 13, 2024

0.0.3

Oct 13, 2024

0.0.2

Oct 13, 2024

0.0.1

Oct 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxgen_torch-0.0.4.tar.gz (30.5 kB view hashes)

Uploaded Oct 13, 2024 Source

Built Distribution

indoxGen_torch-0.0.4-py3-none-any.whl (31.6 kB view hashes)

Uploaded Oct 13, 2024 Python 3

Hashes for indoxgen_torch-0.0.4.tar.gz

Hashes for indoxgen_torch-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`bac59dfaa1ee51b26a76549a787b96dcf308e4ac5c00347fb44d289a25fcd626`
MD5	`2b8dee87341cedf76d2c8d2e363e7192`
BLAKE2b-256	`5d91dbb87ed1579ae02044294561d95ac3a1bbac2c12be141a019829044ccce5`

Hashes for indoxGen_torch-0.0.4-py3-none-any.whl

Hashes for indoxGen_torch-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08b236c17e38a5f15c710cbb106ff94b8b5b399b931cbcb3f86edccb21bf4155`
MD5	`2567faca9a9d8fb4dadf5c2404dacfd7`
BLAKE2b-256	`b4d5f029bfa5c763f6a36510a05ef64d403b330d4f6f8f58b3e24b79ee1762d3`