Indox Synthetic Data Generation (GAN-pytorch)
Project description
IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework
Official Website • Documentation • Discord
NEW: Subscribe to our mailing list for updates and news!
Overview
IndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.
Key Features
- GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- PyTorch Integration: Built on PyTorch for efficient, GPU-accelerated training and generation.
- Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
- Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- Training Monitoring: Built-in patience-based early stopping for optimal model training.
- Scalable Generation: Efficiently generate large volumes of synthetic data post-training.
Installation
pip install IndoxGen-Torch
Quick Start Guide
Basic Usage
from indoxGen_pytorch import TabularGANConfig, TabularGANTrainer
import pandas as pd
# Load your data
data = pd.read_csv("data/Adult.csv")
# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
"relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
# Set up the configuration
config = TabularGANConfig(
input_dim=200,
generator_layers=[128, 256, 512],
discriminator_layers=[512, 256, 128],
learning_rate=2e-4,
beta_1=0.5,
beta_2=0.9,
batch_size=128,
epochs=50,
n_critic=5
)
# Initialize and train the model
trainer = TabularGANTrainer(
config=config,
categorical_columns=categorical_columns,
mixed_columns=mixed_columns,
integer_columns=integer_columns
)
history = trainer.train(data, patience=15)
# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
Advanced Techniques
Customizing the GAN Architecture
custom_config = TabularGANConfig(
input_dim=300,
generator_layers=[256, 512, 1024, 512],
discriminator_layers=[512, 1024, 512, 256],
learning_rate=1e-4,
batch_size=256,
epochs=100,
n_critic=3
)
custom_trainer = TabularGANTrainer(config=custom_config, ...)
Handling Imbalanced Datasets
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
Configuration and Customization
The TabularGANConfig
class allows for extensive customization:
input_dim
: Dimension of the input noise vectorgenerator_layers
anddiscriminator_layers
: List of layer sizes for the generator and discriminatorlearning_rate
,beta_1
,beta_2
: Adam optimizer parametersbatch_size
,epochs
: Training configurationn_critic
: Number of discriminator updates per generator update
Refer to the API documentation for a comprehensive list of configuration options.
Best Practices
- Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
- Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
- Validation: Regularly compare the distribution of synthetic data with the original dataset.
- Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
- Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.
Roadmap
- Implement basic GAN architecture for tabular data
- Add support for mixed data types (categorical, continuous, integer)
- Integrate early stopping and training history
- Implement more advanced GAN variants (WGAN, CGAN)
- Add built-in privacy preserving mechanisms
- Develop automated hyperparameter tuning
- Create visualization tools for synthetic data quality assessment
- Implement distributed training support for large-scale datasets
Contributing
We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.
License
IndoxGen-Torch is released under the MIT License. See LICENSE.md for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indoxGen_torch-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08b236c17e38a5f15c710cbb106ff94b8b5b399b931cbcb3f86edccb21bf4155 |
|
MD5 | 2567faca9a9d8fb4dadf5c2404dacfd7 |
|
BLAKE2b-256 | b4d5f029bfa5c763f6a36510a05ef64d403b330d4f6f8f58b3e24b79ee1762d3 |