Package for Synthetic Data Generation using Distributional Learninig of VAE
Project description
DistVAE-Tabular
DistVAE is a novel approach to distributional learning in the VAE framework, focusing on accurately capturing the underlying distribution of the observed dataset through a nonparametric CDF estimation.
We utilize the continuous ranked probability score (CRPS), a strictly proper scoring rule, as the reconstruction loss while preserving the mathematical derivation of the lower bound of the data log-likelihood. Additionally, we introduce a synthetic data generation mechanism that effectively preserves differential privacy.
For a detailed method explanations, check our paper! (link)
1. Installation
Install using pip:
pip install distvae-tabular
2. Usage
from distvae_tabular import distvae
distvae.DistVAE # DistVAE model
distvae.generate_data # generate synthetic data
- See example.ipynb for detailed example and its results with
loan
dataset.- Link for download
loan
dataset: https://www.kaggle.com/datasets/teertha/personal-loan-modeling
- Link for download
Example
"""device setting"""
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
"""load dataset and specify column types"""
import pandas as pd
data = pd.read_csv('./loan.csv')
continuous_features = [
'Age',
'Experience',
'Income',
'CCAvg',
'Mortgage',
]
categorical_features = [
'Family',
'Personal Loan',
'Securities Account',
'CD Account',
'Online',
'CreditCard'
]
integer_features = [
'Age',
'Experience',
'Income',
'Mortgage'
]
"""DistVAE"""
from distvae_tabular import distvae
distvae = distvae.DistVAE(
data=data,
continuous_features=continuous_features,
categorical_features=categorical_features,
integer_features=integer_features,
epochs=5 # for quick checking (default is 1000)
)
"""training"""
distvae.train()
"""generate synthetic data"""
syndata = distvae.generate_data(100)
syndata
"""generate synthetic data with Differential Privacy"""
syndata = distvae.generate_data(100, lambda_=0.1)
syndata
Citation
If you use this code or package, please cite our associated paper:
@article{an2024distributional,
title={Distributional learning of variational AutoEncoder: application to synthetic data generation},
author={An, Seunghwan and Jeon, Jong-June},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file distvae_tabular-0.1.2.tar.gz
.
File metadata
- Download URL: distvae_tabular-0.1.2.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b10a62c7abc46665e9a0fb7ffec54317523d55b5a64b7f165ab23e6e26be9307 |
|
MD5 | 449dbaede24fc5443bcc60da2fd491dc |
|
BLAKE2b-256 | 2b9a531aad9dac7d9042dd5c5074f8ab5e93c4264b34c2e613e860b07d6d161d |
File details
Details for the file distvae_tabular-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: distvae_tabular-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37b825c61b85ef3c9936228a99925fae4d3a0a694a7fce78bf32fdcd00ba9351 |
|
MD5 | a679afc0af0c55495bf2a1e41b77e9af |
|
BLAKE2b-256 | 848f9a9dca96ac8b3fd255346a8fe55c0f7ad29d31eaa7e65902ed707b2c798d |