Indox Synthetic Data Generation (GAN-tensorflow)
Project description
IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework
Official Website • Documentation • Discord
NEW: Subscribe to our mailing list for updates and news!
Overview
IndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.
Key Features
- GAN-based Generation: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- TensorFlow Integration: Built on TensorFlow for efficient, GPU-accelerated training and generation.
- Flexible Data Handling: Supports categorical, mixed, and integer columns for versatile data modeling.
- Customizable Architecture: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- Training Monitoring: Built-in patience-based early stopping for optimal model training.
- Scalable Generation: Efficiently generate large volumes of synthetic data post-training.
Installation
pip install indoxgen-tensor
Quick Start Guide
Basic Usage
from indoxGen_tensor import TabularGANConfig, TabularGANTrainer
import pandas as pd
# Load your data
data = pd.read_csv("data/Adult.csv")
# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
"relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
# Set up the configuration
config = TabularGANConfig(
input_dim=200,
generator_layers=[128, 256, 512],
discriminator_layers=[512, 256, 128],
learning_rate=2e-4,
beta_1=0.5,
beta_2=0.9,
batch_size=128,
epochs=50,
n_critic=5
)
# Initialize and train the model
trainer = TabularGANTrainer(
config=config,
categorical_columns=categorical_columns,
mixed_columns=mixed_columns,
integer_columns=integer_columns
)
history = trainer.train(data, patience=15)
# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
Advanced Techniques
Customizing the GAN Architecture
custom_config = TabularGANConfig(
input_dim=300,
generator_layers=[256, 512, 1024, 512],
discriminator_layers=[512, 1024, 512, 256],
learning_rate=1e-4,
batch_size=256,
epochs=100,
n_critic=3
)
custom_trainer = TabularGANTrainer(config=custom_config, ...)
Handling Imbalanced Datasets
# Assuming 'rare_class' is underrepresented in your original data
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
# Adjust generation or sampling to match desired distribution
Configuration and Customization
The TabularGANConfig
class allows for extensive customization:
input_dim
: Dimension of the input noise vectorgenerator_layers
anddiscriminator_layers
: List of layer sizes for the generator and discriminatorlearning_rate
,beta_1
,beta_2
: Adam optimizer parametersbatch_size
,epochs
: Training configurationn_critic
: Number of discriminator updates per generator update
Refer to the API documentation for a comprehensive list of configuration options.
Best Practices
- Data Preprocessing: Ensure your data is properly cleaned and normalized before training.
- Hyperparameter Tuning: Experiment with different configurations to find the optimal setup for your dataset.
- Validation: Regularly compare the distribution of synthetic data with the original dataset.
- Privacy Considerations: Implement differential privacy techniques when dealing with sensitive data.
- Scalability: For large datasets, consider using distributed training capabilities of TensorFlow.
Roadmap
- Implement basic GAN architecture for tabular data
- Add support for mixed data types (categorical, continuous, integer)
- Integrate early stopping and training history
- Implement more advanced GAN variants (WGAN, CGAN)
- Add built-in privacy preserving mechanisms
- Develop automated hyperparameter tuning
- Create visualization tools for synthetic data quality assessment
- Implement distributed training support for large-scale datasets
Contributing
We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.
License
IndoxGen-Tensor is released under the MIT License. See LICENSE.md for more details.
IndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file indoxgen_tensor-0.1.0.tar.gz
.
File metadata
- Download URL: indoxgen_tensor-0.1.0.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 778d86079f7080ced26708e2462b0f1399b4e5b9cffbe384da5f6f1fcfda7837 |
|
MD5 | 7df6276f712d55f7d5be6360801286e5 |
|
BLAKE2b-256 | 4e25f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c |
File details
Details for the file indoxGen_tensor-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: indoxGen_tensor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa84cba244d6a5aa0505e939d506b2e14c9a029e0fa70c871f9dc6f6a17d3cdb |
|
MD5 | 4041743a6fd69fcb468d0cf1d4da7fe8 |
|
BLAKE2b-256 | ac28db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8 |