A tool for relational data generation
Project description
RelGen
RelGen is the abbreviation of Relation Generation. This tool is used to generate relational data in databases. On the other hand, the pronunciation of Rel is similar to Real, which means that the relational data generated by RelGen is very real.
Overview
RelGen is a Python library designed to generate real relational data for users. RelGen uses a variety of advanced deep generative models and algorithms to learn data distribution from real data and generate high-quality simulation data.
Figure: RelGen Overall Architecture
Features
Generate relational data using deep generative models. RelGen provides a variety of deep generative models, including Generative Adversarial Network (GAN), Autoregressive Model (AR Model) and Diffusion Model.
Generate data for multiple relational tables. RelGen can flexibly generate data for multiple relational tables in the database, so that the data distribution of each generated table is close to that of the original table, and the joined table can also be similar to the original joined table.
Evaluate the quality of generated relational data. RelGen evaluates the generated relational data in terms of fidelity, privacy, and diversity, and visualizes the quality of the generated relational data using histogram and t-SNE plot.
Installation
RelGen requires Python version 3.7 or later.
RelGen requires torch version 1.7.0 or later. If you want to use RelGen with GPU, please ensure that CUDA or cudatoolkit version is 9.2 or later. This requires NVIDIA driver version >= 396.26 (for Linux) or >= 397.44 (for Windows10).
Install from conda
Install from source
git clone https://github.com/ruc-datalab/RelGen.git && cd RelGen
pip install -r requirements.txt
Quick-Start
Load Dataset
Load a demo dataset to get started. This dataset is a single table describing the census.
Load metadata for the census dataset.
from relgen.data.metadata import Metadata
metadata = Metadata()
metadata.load_from_json("datasets/census/metadata.json")
Load data for the census dataset.
import pandas as pd
data = {
"census": pd.read_csv("datasets/census/census.csv")
}
Encapsulate the census dataset and process it.
from relgen.data.dataset import Dataset
dataset = Dataset(metadata)
dataset.fit(data)
Generating Data
Train the synthesizer.
from relgen.synthesizer.arsynthesizer import MADESynthesizer
synthesizer = MADESynthesizer(dataset)
synthesizer.fit(data)
Generate relational data.
sampled_data = synthesizer.sample()
Evaluating Data
Compare real data and generated data to evaluate the quality of generated data.
from relgen.evaluator import Evaluator
evaluator = Evaluator(data["census"], sampled_data["census"])
Show comparison histogram of data distribution between real data and generated data.
evaluator.eval_histogram(columns=["age", "sex", "relationship"])
Show comparison t-SNE plot of data distribution between real data and generated data.
evaluator.eval_tsne()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file relgen-0.1.0.tar.gz.
File metadata
- Download URL: relgen-0.1.0.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16379ae8ee8b71b82d7de66c54a71a38c7a936ea15886e0e40658b7c57925b82
|
|
| MD5 |
d346e70ee3dc87fc432cdd96be52fe1c
|
|
| BLAKE2b-256 |
2cdd9fcd591b25a72eb3986397d1e70a8a90a2e5f3ebf986b7f502f1e92414fe
|
File details
Details for the file relgen-0.1.0-py3-none-any.whl.
File metadata
- Download URL: relgen-0.1.0-py3-none-any.whl
- Upload date:
- Size: 71.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3077ac6a8836d1397ac348b10c94efc133436c15c283b5029bb0b218085f3529
|
|
| MD5 |
4ee4f4ec7b21e9fd744a77de6a661100
|
|
| BLAKE2b-256 |
40498843218b6d60f97a54d6c39c6523d6bd24e50b09eb81b85033cca8f634e1
|