Indox Synthetic Data Generation (GAN-tensorflow)
Project description
IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework
Official Website • Documentation • Discord
NEW: Subscribe to our mailing list for updates and news!
Overview
IndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.
Key Features
- GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- TensorFlow Integration: Built on TensorFlow for efficient, GPU-accelerated training and generation.
- Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
- Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- Training Monitoring: Built-in patience-based early stopping for optimal model training.
- Scalable Generation: Efficiently generate large volumes of synthetic data post-training.
Installation
pip install indoxgen-tensor
Quick Start Guide
Basic Usage
from indoxGen_tensor import TabularGANConfig, TabularGANTrainer
import pandas as pd
# Load your data
data = pd.read_csv("data/Adult.csv")
# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
"relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
# Set up the configuration
config = TabularGANConfig(
input_dim=200,
generator_layers=[128, 256, 512],
discriminator_layers=[512, 256, 128],
learning_rate=2e-4,
beta_1=0.5,
beta_2=0.9,
batch_size=128,
epochs=50,
n_critic=5
)
# Initialize and train the model
trainer = TabularGANTrainer(
config=config,
categorical_columns=categorical_columns,
mixed_columns=mixed_columns,
integer_columns=integer_columns
)
history = trainer.train(data, patience=15)
# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
Advanced Techniques
Customizing the GAN Architecture
custom_config = TabularGANConfig(
input_dim=300,
generator_layers=[256, 512, 1024, 512],
discriminator_layers=[512, 1024, 512, 256],
learning_rate=1e-4,
batch_size=256,
epochs=100,
n_critic=3
)
custom_trainer = TabularGANTrainer(config=custom_config, ...)
Handling Imbalanced Datasets
# Assuming 'rare_class' is underrepresented in your original data
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
# Adjust generation or sampling to match desired distribution
Configuration and Customization
The TabularGANConfig
class allows for extensive customization:
input_dim
: Dimension of the input noise vectorgenerator_layers
anddiscriminator_layers
: List of layer sizes for the generator and discriminatorlearning_rate
,beta_1
,beta_2
: Adam optimizer parametersbatch_size
,epochs
: Training configurationn_critic
: Number of discriminator updates per generator update
Refer to the API documentation for a comprehensive list of configuration options.
Best Practices
- Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
- Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
- Validation: Regularly compare the distribution of synthetic data with the original dataset.
- Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
- Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.
Roadmap
- Implement basic GAN architecture for tabular data
- Add support for mixed data types (categorical, continuous, integer)
- Integrate early stopping and training history
- Implement more advanced GAN variants (WGAN, CGAN)
- Add built-in privacy preserving mechanisms
- Develop automated hyperparameter tuning
- Create visualization tools for synthetic data quality assessment
- Implement distributed training support for large-scale datasets
Contributing
We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.
License
IndoxGen-Tensor is released under the MIT License. See LICENSE.md for more details.
IndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indoxGen_tensor-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa84cba244d6a5aa0505e939d506b2e14c9a029e0fa70c871f9dc6f6a17d3cdb |
|
MD5 | 4041743a6fd69fcb468d0cf1d4da7fe8 |
|
BLAKE2b-256 | ac28db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8 |