Skip to main content

Package for Synthetic Data Generation using Distributional Learninig of VAE

Project description

DistVAE-Tabular

DistVAE is a novel approach to distributional learning in the VAE framework, focusing on accurately capturing the underlying distribution of the observed dataset through a nonparametric CDF estimation.

We utilize the continuous ranked probability score (CRPS), a strictly proper scoring rule, as the reconstruction loss while preserving the mathematical derivation of the lower bound of the data log-likelihood. Additionally, we introduce a synthetic data generation mechanism that effectively preserves differential privacy.

For a detailed method explanations, check our paper! (link)

1. Installation

Install using pip:

pip install distvae-tabular

2. Usage

from distvae_tabular import distvae
distvae.DistVAE # DistVAE model
distvae.generate_data # generate synthetic data

Example

"""device setting"""
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

"""load dataset and specify column types"""
import pandas as pd
data = pd.read_csv('./loan.csv') 
continuous_features = [
    'Age',
    'Experience',
    'Income', 
    'CCAvg',
    'Mortgage',
]
categorical_features = [
    'Family',
    'Personal Loan',
    'Securities Account',
    'CD Account',
    'Online',
    'CreditCard'
]
integer_features = [
    'Age',
    'Experience',
    'Income', 
    'Mortgage'
]

"""DistVAE"""
from distvae_tabular import distvae

distvae = distvae.DistVAE(
    data=data,
    continuous_features=continuous_features,
    categorical_features=categorical_features,
    integer_features=integer_features,
    epochs=5 # for quick checking (default is 1000)
)

"""training"""
distvae.train()

"""generate synthetic data"""
syndata = distvae.generate_data(100)
syndata

"""generate synthetic data with Differential Privacy"""
syndata = distvae.generate_data(100, lambda_=0.1)
syndata

Citation

If you use this code or package, please cite our associated paper:

@article{an2024distributional,
  title={Distributional learning of variational AutoEncoder: application to synthetic data generation},
  author={An, Seunghwan and Jeon, Jong-June},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distvae_tabular-0.1.2.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

distvae_tabular-0.1.2-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file distvae_tabular-0.1.2.tar.gz.

File metadata

  • Download URL: distvae_tabular-0.1.2.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for distvae_tabular-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b10a62c7abc46665e9a0fb7ffec54317523d55b5a64b7f165ab23e6e26be9307
MD5 449dbaede24fc5443bcc60da2fd491dc
BLAKE2b-256 2b9a531aad9dac7d9042dd5c5074f8ab5e93c4264b34c2e613e860b07d6d161d

See more details on using hashes here.

File details

Details for the file distvae_tabular-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for distvae_tabular-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 37b825c61b85ef3c9936228a99925fae4d3a0a694a7fce78bf32fdcd00ba9351
MD5 a679afc0af0c55495bf2a1e41b77e9af
BLAKE2b-256 848f9a9dca96ac8b3fd255346a8fe55c0f7ad29d31eaa7e65902ed707b2c798d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page