Automated generative modeling and sampling

These details have been verified by PyPI

Maintainers

amontanez24 fealho francesh kveerama lajohn mit_dai_lab npatki pvkdeveloper rwedge-datacebo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“SDV” An open source project from Data to AI Lab at MIT.

SDV - Synthetic Data Vault

Automated generative modeling and sampling

Free software: MIT license
Documentation: https://HDI-Project.github.io/SDV

Summary

The goal of the Synthetic Data Vault (SDV) is to allow data scientists to navigate, model and sample relational databases. The main access point of the library is the class SDV, that wraps the functionality of the three core classes: the DataNavigator, the Modeler and the Sampler.

Using these classes, users can get easy access to information about the relational database, create generative models for tables in the database and sample rows from these models to produce synthetic data.

Installation

Install with pip

The easiest way to install SDV is using pip

pip install sdv

Install from sources

You can also clone the repository and install it from sources

git clone git@github.com:HDI-Project/SDV.git

After cloning the repository, it's recommended that you create a virtualenv. In this example, we will create it using VirtualEnvwrapper:

cd SDV
mkvirtualenv -p $(which python3.6) -a $(pwd) sdv

After creating the virtualenv and activating it, you can install the project by runing the following command:

make install

For development, use the following command instead, which will install some additional dependencies for code linting and testing.

make install-develop

Usage Example

Below there is a short example about how to use SDV to model and sample a dataset composed of relational tables.

NOTE: In order to be able to run this example, please make sure to have cloned the repository and execute these commands inside it, as they rely on some of the demo data included in it.

Using the SDV class

The easiest way to use SDV in Python is using the SDV class imported from the root of the package:

>>> from sdv import SDV

>>> data_vault = SDV('tests/data/meta.json')
>>> data_vault.fit()
>>> samples = data_vault.sample_all()
>>> for dataset in samples:
...    print(samples[dataset].head(3), '\n')
   CUSTOMER_ID  CUST_POSTAL_CODE  PHONE_NUMBER1  CREDIT_LIMIT COUNTRY
0            0           61026.0   5.410825e+09        1017.0  FRANCE
1            1           20166.0   7.446005e+09        1316.0      US
2            2           11371.0   8.993345e+09        1839.0      US

   ORDER_ID  CUSTOMER_ID  ORDER_TOTAL
0         0            0       1251.0
1         1            0       1691.0
2         2            0       1126.0

   ORDER_ITEM_ID  ORDER_ID  PRODUCT_ID  UNIT_PRICE  QUANTITY
0              0         0         9.0        20.0       0.0
1              1         0         8.0        79.0       3.0
2              2         0         8.0        66.0       1.0

With this, we will be able to generate sintetic samples of data. The only argument we pass to SDV is a path to a JSON file containing the information of the different tables, their fields and relations. Further explanation of how to generate this file can be found on the docs.

After instantiating the class, we call to the fit() method in order to transform and model the data, and after that we are ready to sample rows, tables or the whole database.

Using each class manually

The modelling and sampling process using SDV follows these steps:

We use a DataNavigator instance to extract relevant information from the dataset, as well as to transform their contents into numeric values.
The DataNavigator is then used to create a Modeler instance, which uses the information in the DataNavigator to create generative models of the tables.
The Modeler instance can be passed to a Sampler to sample rows of synthetic data.

Using the DataNavigator

The DataNavigator can be used to extract useful information about a dataset, such as the relationships between tables. Here we will use it to load the test data from the CSV files and apply some transformations to it.

First, we will create an instance of CSVDataLoader, that will load the data and prepare it to use it with DataNavigator. To create an instance of the CSVDataLoader class, the filepath to the meta.json file must be provided.

>>> from sdv import CSVDataLoader
>>> data_loader = CSVDataLoader('tests/data/meta.json')

The load_data() function can then be used to create an instance of a DataNavigator.

>>> data_navigator = data_loader.load_data()

The DataNavigator stores the data as a dictionary mapping the table names to a tuple of the data itself (represented as a pandas.Dataframe) and the meta information for that table. You can access the data using the following command:

>>> customer_table = data_navigator.tables['DEMO_CUSTOMERS']
>>> customer_data = customer_table.data
>>> customer_data.head(3).T

                           0           1           2
CUSTOMER_ID               50           4    97338810
CUST_POSTAL_CODE       11371       63145        6096
PHONE_NUMBER1     6175553295  8605551835  7035552143
CREDIT_LIMIT            1000         500        1000
COUNTRY                   UK          US      CANADA

>>> customers_meta = customer_table.meta
>>> customers_meta.keys()
dict_keys(['fields', 'headers', 'name', 'path', 'primary_key', 'use'])
>>> customers_meta['fields']
  {'CUSTOMER_ID': {'name': 'CUSTOMER_ID',
  'subtype': 'integer',
  'type': 'number',
  'uniques': 0,
  'regex': '^[0-9]{10}$'},
 'CUST_POSTAL_CODE': {'name': 'CUST_POSTAL_CODE',
  'subtype': 'integer',
  'type': 'number',
  'uniques': 0},
 'PHONE_NUMBER1': {'name': 'PHONE_NUMBER1',
  'subtype': 'integer',
  'type': 'number',
  'uniques': 0},
 'CREDIT_LIMIT': {'name': 'CREDIT_LIMIT',
  'subtype': 'integer',
  'type': 'number',
  'uniques': 0},
 'COUNTRY': {'name': 'COUNTRY', 'type': 'categorical', 'uniques': 0}}

You can also use the data navigator to get parents or children of a table.

>>> data_navigator.get_parents('DEMO_ORDERS')
{'DEMO_CUSTOMERS'}

>>> data_navigator.get_children('DEMO_CUSTOMERS')
{'DEMO_ORDERS'}

Finally, we can use the transform_data() function to apply transformations from the RDT library to our data. If no transformations are provided, the function will convert all categorical types and datetime types to numeric values by default. It will return a dictionary mapping the table name to the transformed data represented as a pandas.Dataframe.

>>> transformed_data = data_navigator.transform_data()
>>> transformed_data['DEMO_CUSTOMERS'].head(3).T
                             0             1             2
CUSTOMER_ID       5.000000e+01  4.000000e+00  9.733881e+07
CUST_POSTAL_CODE  1.137100e+04  6.314500e+04  6.096000e+03
PHONE_NUMBER1     6.175553e+09  8.605552e+09  7.035552e+09
CREDIT_LIMIT      1.000000e+03  5.000000e+02  1.000000e+03
COUNTRY           5.617796e-01  8.718027e-01  5.492714e-02

Using the Modeler

The Modeler can be used to recursively model the data. This is important because the tables in the data have relationships between them, that should also be modeled in order to have reliable sampling. Let's look at the test data for example. There are three tables in this data set: DEMO_CUSTOMERS, DEMO_ORDERS and DEMO_ORDER_ITEMS.

The DEMO_ORDERS table has a field labelled CUSTOMER_ID, that references the "id" field of the DEMO_CUSTOMERS table. SDV wants to model not only the data, but these relationships as well. The Modeler class is responsible for carrying out this task.

To do so, first, import from the Modeler and create an instance of the class. The Modeler must be provided the DataNavigator and the type of model to use. If no model type is provided, it will use a copulas.multivariate.Gaussian Copula by default. Note that in order for the modeler to work, the DataNavigator must have already transformed its data.

>>> from sdv import Modeler
>>> modeler = Modeler(data_navigator)

Then you can model the entire database. The modeler will store models for every table in the dataset.

>>> modeler.model_database()

The models that were created for each table can be accessed using the following command:

>>> customers_model = modeler.models['DEMO_CUSTOMERS']
>>> print(customers_model)
CUSTOMER_ID
==============
Distribution Type: Gaussian
Variable name: CUSTOMER_ID
Mean: 22198555.57142857
Standard deviation: 36178958.000449404

CUST_POSTAL_CODE
==============
Distribution Type: Gaussian
Variable name: CUST_POSTAL_CODE
Mean: 34062.71428571428
Standard deviation: 25473.85661931119

PHONE_NUMBER1
==============
Distribution Type: Gaussian
Variable name: PHONE_NUMBER1
Mean: 6464124184.428572
Standard deviation: 1272684276.6679976

...

The output above shows the parameters that got stored for every column in the users table.

The modeler can also be saved to a file using the save() method. This will save a pickle file on the specified path.

>>> modeler.save('demo_model.pkl')

If you have stored a model in a previous session using the command above, you can load the model using the load() method:

>>> modeler = Modeler.load('demo_model.pkl')

Using the Sampler

The Sampler takes in a Modeler and DataNavigator. Using the models created in the last step, the Sampler can recursively move through the tables in the dataset, and sample synthetic data. It can be used to sample rows from specified tables, sample an entire table at once or sample the whole database.

Let's do an example with our dataset. First import the Sampler and create an instance of the class.

>>> from sdv import Sampler
>>> sampler = Sampler(data_navigator, modeler)

To sample from a row, use the command sample_rows(). Note that before sampling from a child table, one of its parent tables must have been sampled beforehand.

>>> sampler.sample_rows('DEMO_CUSTOMERS', 1).T
                            0
CUSTOMER_ID                 0
CUST_POSTAL_CODE        44462
PHONE_NUMBER1     7.45576e+09
CREDIT_LIMIT              976
COUNTRY                    US

To sample a whole table use sample_table(). This will create as many rows as there where in the original database.

>>> sampler.sample_table('DEMO_CUSTOMERS')
   CUSTOMER_ID  CUST_POSTAL_CODE  PHONE_NUMBER1  CREDIT_LIMIT COUNTRY
0            0           27937.0   8.095336e+09        1029.0  CANADA
1            1           18183.0   2.761015e+09         891.0  CANADA
2            2           16402.0   4.956798e+09        1313.0   SPAIN
3            3            7116.0   8.072395e+09        1124.0  FRANCE
4            4             368.0   4.330203e+09        1186.0  FRANCE
5            5           64304.0   6.256936e+09        1113.0      US
6            6           94698.0   8.271224e+09        1086.0  CANADA

Finally, the entire database can be sampled using sample_all(num_rows). The num_rows parameter specifies how many child rows to create per parent row. This function returns a dictionary mapping table names to the generated dataframes.

>>> samples = sampler.sample_all()
>>> for dataset in samples:
...     print(samples[dataset].head(3), '\n')
   CUSTOMER_ID  CUST_POSTAL_CODE  PHONE_NUMBER1  CREDIT_LIMIT COUNTRY
0            0           46038.0   7.779893e+09         673.0      UK
1            1           21063.0   6.511387e+09         808.0   SPAIN
2            2           24494.0   5.703648e+09         757.0   SPAIN

   ORDER_ID  CUSTOMER_ID  ORDER_TOTAL
0         0            0       1520.0
1         1            0       1217.0
2         2            0       1375.0

   ORDER_ITEM_ID  ORDER_ID  PRODUCT_ID  UNIT_PRICE  QUANTITY
0              0         0        17.0        94.0       3.0
1              1         0        14.0        44.0       3.0
2              2         0        20.0        78.0       3.0

History

0.1.1 - Anonymization of data

Add warnings when trying to model an unsupported dataset structure. GH#73
Add option to anonymize data. GH#51
Add support for modeling data with different distributions, when using GaussianMultivariate model. GH#68
Add support for VineCopulas as a model. GH#71
Improve GaussianMultivariate parameter sampling, avoiding warnings and unvalid parameters. GH#58
Fix issue that caused that sampled categorical values sometimes got numerical values mixed. GH#81
Improve the validation of extensions. GH#69
Update examples. GH#61
Replaced Table class with a NamedTuple. GH#92
Fix inconsistent dependencies and add upper bound to dependencies. GH#96
Fix error when merging extension in Modeler.CPA when running examples. GH#86

0.1.0 - First Release

First release on PyPI.

Project details

These details have been verified by PyPI

Maintainers

amontanez24 fealho francesh kveerama lajohn mit_dai_lab npatki pvkdeveloper rwedge-datacebo

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.12.1

Apr 19, 2024

1.12.1.dev1 pre-release

Apr 19, 2024

1.12.1.dev0 pre-release

Apr 19, 2024

1.12.0

Apr 16, 2024

1.12.0.dev0 pre-release

Apr 12, 2024

1.11.0

Mar 21, 2024

1.11.0.dev0 pre-release

Mar 21, 2024

1.10.0

Feb 15, 2024

1.10.0.dev0 pre-release

Feb 15, 2024

1.9.0

Jan 11, 2024

1.9.0.dev0 pre-release

Jan 11, 2024

1.8.0

Dec 5, 2023

1.8.0.dev0 pre-release

Dec 4, 2023

1.7.0

Nov 16, 2023

1.7.0.dev0 pre-release

Nov 15, 2023

1.6.0

Nov 7, 2023

1.6.0.dev1 pre-release

Nov 7, 2023

1.6.0.dev0 pre-release

Nov 6, 2023

1.5.0

Oct 13, 2023

1.5.0.dev0 pre-release

Oct 11, 2023

1.4.0

Aug 23, 2023

1.4.0.dev1 pre-release

Aug 23, 2023

1.4.0.dev0 pre-release

Aug 22, 2023

1.3.0

Aug 14, 2023

1.3.0.dev1 pre-release

Aug 14, 2023

1.3.0.dev0 pre-release

Aug 13, 2023

1.2.2.dev1 pre-release

Aug 2, 2023

1.2.2.dev0 pre-release

Jul 21, 2023

1.2.1

Jul 13, 2023

1.2.1.dev0 pre-release

Jul 10, 2023

1.2.0

Jun 7, 2023

1.2.0.dev1 pre-release

Jun 7, 2023

1.2.0.dev0 pre-release

Jun 6, 2023

1.1.0

May 10, 2023

1.1.0.dev0 pre-release

May 10, 2023

1.0.1

Apr 20, 2023

1.0.1.dev0 pre-release

Apr 19, 2023

1.0.0

Mar 28, 2023

1.0.0rc0 pre-release

Mar 28, 2023

1.0.0b1 pre-release

Mar 20, 2023

1.0.0b0 pre-release

Feb 24, 2023

0.18.0

Jan 24, 2023

0.18.0.dev0 pre-release

Jan 23, 2023

0.17.2

Dec 8, 2022

0.17.2.dev0 pre-release

Dec 8, 2022

0.17.1

Sep 29, 2022

0.17.1.dev0 pre-release

Sep 29, 2022

0.17.0

Sep 9, 2022

0.17.0.dev2 pre-release

Sep 8, 2022

0.17.0.dev1 pre-release

Aug 19, 2022

0.17.0.dev0 pre-release

Aug 16, 2022

0.16.0

Jul 22, 2022

0.16.0.dev5 pre-release

Jul 22, 2022

0.16.0.dev4 pre-release

Jul 21, 2022

0.16.0.dev3 pre-release

Jul 19, 2022

0.16.0.dev2 pre-release

Jul 15, 2022

0.16.0.dev1 pre-release

Jul 8, 2022

0.16.0.dev0 pre-release

Jul 1, 2022

0.15.0

May 25, 2022

0.15.0.dev1 pre-release

May 25, 2022

0.15.0.dev0 pre-release

May 24, 2022

0.14.1

May 3, 2022

0.14.1.dev0 pre-release

May 3, 2022

0.14.0

Mar 21, 2022

0.14.0.dev2 pre-release

Mar 14, 2022

0.14.0.dev1 pre-release

Mar 9, 2022

0.14.0.dev0 pre-release

Mar 4, 2022

0.13.1

Dec 22, 2021

0.13.1.dev0 pre-release

Dec 22, 2021

0.13.0

Nov 22, 2021

0.13.0.dev0 pre-release

Nov 20, 2021

0.12.1

Oct 12, 2021

0.12.1.dev0 pre-release

Oct 12, 2021

0.12.0

Aug 19, 2021

0.12.0.dev1 pre-release

Aug 17, 2021

0.12.0.dev0 pre-release

Aug 13, 2021

0.11.0

Jul 12, 2021

0.11.0.dev0 pre-release

Jul 7, 2021

0.10.1

Jun 11, 2021

0.10.1.dev0 pre-release

Jun 10, 2021

0.10.0

May 21, 2021

0.10.0.dev0 pre-release

May 21, 2021

0.9.1

Apr 29, 2021

0.9.1.dev1 pre-release

Apr 29, 2021

0.9.1.dev0 pre-release

Apr 28, 2021

0.9.0

Apr 1, 2021

0.9.0.dev0 pre-release

Mar 31, 2021

0.8.0

Feb 24, 2021

0.8.0.dev0 pre-release

Feb 24, 2021

0.7.0

Jan 28, 2021

0.7.0.dev1 pre-release

Jan 27, 2021

0.7.0.dev0 pre-release

Jan 27, 2021

0.6.2.dev2 pre-release

Jan 27, 2021

0.6.2.dev1 pre-release

Jan 25, 2021

0.6.2.dev0 pre-release

Jan 20, 2021

0.6.1

Dec 31, 2020

0.6.0

Dec 22, 2020

0.6.0.dev0 pre-release

Dec 22, 2020

0.5.0

Nov 25, 2020

0.5.0.dev0 pre-release

Nov 25, 2020

0.4.6.dev2 pre-release

Nov 16, 2020

0.4.6.dev1 pre-release

Nov 9, 2020

0.4.6.dev0 pre-release

Nov 4, 2020

0.4.5

Oct 17, 2020

0.4.4

Oct 6, 2020

0.4.4.dev0 pre-release

Oct 6, 2020

0.4.3

Sep 28, 2020

0.4.2

Sep 19, 2020

0.4.1

Sep 7, 2020

0.4.1.dev0 pre-release

Sep 7, 2020

0.4.0

Aug 8, 2020

0.4.0.dev0 pre-release

Aug 8, 2020

0.3.6

Jul 23, 2020

0.3.6.dev0 pre-release

Jul 23, 2020

0.3.5

Jul 9, 2020

0.3.4

Jul 4, 2020

0.3.4.dev0 pre-release

Jul 4, 2020

0.3.3

Jun 26, 2020

0.3.2

Feb 3, 2020

0.3.1

Jan 22, 2020

0.3.0

Dec 23, 2019

0.2.2

Dec 10, 2019

0.2.1

Nov 25, 2019

0.2.0

Nov 11, 2019

0.2.0.dev0 pre-release

Nov 6, 2019

0.1.2

Sep 18, 2019

This version

0.1.1

Apr 2, 2019

0.1.0

Sep 27, 2018

0.0.0

Jun 28, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdv-0.1.1.tar.gz (46.8 kB view hashes)

Uploaded Apr 2, 2019 Source

Built Distribution

sdv-0.1.1-py2.py3-none-any.whl (21.2 kB view hashes)

Uploaded Apr 2, 2019 Python 2 Python 3

Hashes for sdv-0.1.1.tar.gz

Hashes for sdv-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3ed63653ce7128b84e55d5f0f6971433ed2f497ee0e87a61cae703b221822eaa`
MD5	`d5563373c67541682cc15b8fe676ea19`
BLAKE2b-256	`77d0074efa2583b6987265321e7ea2d52c5a53fdfc7c6d94509861a0e2d40389`

Hashes for sdv-0.1.1-py2.py3-none-any.whl

Hashes for sdv-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`48491970c42003838bd79d4e9219d258102ec8e48076975ed8766d47a1460871`
MD5	`bc7b43309f37048018745bc91d645a5a`
BLAKE2b-256	`f9b818f7cff31bc9bf695c27f43449103c871d7e700227435a8c65d1e57afa56`