Skip to main content

Data generation tool.

Project description

Pure Synthetic Data Generation with a PostgreSQL and Python-based Tool

Description

A lightweight tool written in Python that teams up with PostgreSQL and it's pg_stats view in order to generate fully synthetic data that seem as realistic as possible.

It generates the synthetic data by reading the pg_stats view of PostgreSQL (pg_statistic catalog), more explicitly by reading the most common values, their frequencies in the dataset, the average width of the column values, the number of distinct values etc. The algorithm combines all these values and properties in order to generate fully synthetic data that contain no actual values or fragments of the "real" data at all, but are very similar in the context of the "shape" and the properties of them.

Constrains

Types

There are some rules on the supported datatypes.

  1. To genereate data which do not have any external constraints all the postgres types are supported
  2. Data with primary key constraints can only be of numeric, text or date type (and its subtypes and variations)
  3. Data with foreign key constraints can only be of numeric, text or date type (and its subtypes and variations)

Data Import

Prerequisite:

Make sure the database tennis_atp_2020 is present on your Postgres instance such that the import.bat does not throw any errors:

psql -U postgres
postgres=# CREATE DATABASE tennis_atp_2020; 

Importing Test dataset for the tool: The "Tennis_ATP" test dataset can be found inside resources/ and can be set-up very easily using the import.bat file (if on Windows) or by importing the .csv files directly into Postgres (which should be pretty straight-forward).

Installation

  • Clone the repository into your desired directory.
  • Inside the cloned directory:
pip install -r requirements.txt 

to install all the required libraries/dependencies

  • Run the tool using the terminal shell while having the PostgreSQL server up and running

Usage

Tool arguments:

  • DBNAMEGEN - Name of the database to be created
  • -show/--show - Shows database stats (default)
  • -generate/--generate - Generates new synthesized data to database DBNAMEGEN
  • -mf/--mf - Multiplication factor for the generated synthetic data (default: 1.0)
  • -tables/--tables - Name(s) of table(s) to be filled, separated with ',', ignoring other tables (default: fill all tables)
  • -O/--owner - Owner of new database (default: same as user)
  • -v/--version - Show version information, then quit
  • -h/--help - Show tool help, then quit

Connection options:

  • DBNAMEIN - Name of the existing database to connect to
  • -H/--hostname - Name of the PostgreSQL server (default: localhost)
  • -P/--port - Port of the PostgreSQL server (default: 5432)
  • -U/--user - PostgreSQL server username

Some usage examples:

  • python pgsynthdata.py test postgres -show
    • Connects to database test, host=localhost, port=5432, default user with password postgres
    • Shows statistics of the database test
  • python pgsynthdata.py dbin dbgen pw1234 -H myHost -p 8070 -U testuser -generate
    • Connects to database dbin, host=myHost, port=8070, user=testuser with password pw1234
    • Create new database dbgen and generates synthetic data into it
  • python pgsynthdata.py dbin dbgencreate pw123 -U myUser -generate -tables myTable1, myTable2
    • Connects to database dbin, host=localhost, port=5432, user=myUser with password pw123
    • Creates new database dbgencreate with synthetic data on tables: table1 and table2
  • python pgsynthdata.py --help
    • Show the help information of the tool

Contributions

Contributions are welcomed, please check the issues list for future ideas.

Author

Documentation & Adjustments made by Etienne Baumgartner

Tool is written by Kevin Ammann based on preliminary work by Labian Gashi.

Feedback is welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgsynthdata-0.0.1.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgsynthdata-0.0.1-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file pgsynthdata-0.0.1.tar.gz.

File metadata

  • Download URL: pgsynthdata-0.0.1.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pgsynthdata-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5f713c0a44003225ccc3eefad1416f3763038b5409becd7029c7eef28da2526c
MD5 7ab04fbaef1938a9e322b5aa31dbc6b5
BLAKE2b-256 b846cb9736bf7c9301ac5ed74e52fe402bf82de3c1ac86b2f844f2f31a168f4e

See more details on using hashes here.

File details

Details for the file pgsynthdata-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pgsynthdata-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pgsynthdata-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f48ced1d71dd9e613f82de0b8d8ecb4a36c21088c90eb6fd8d0a93a9db61413
MD5 e00572f31fb6d75d1380483bd708e9ac
BLAKE2b-256 00ea9d0175729ea70568a17a36e5c7bea479cc2f71ba49f8c54d01b340924a1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page