Data generation tool.
Project description
Pure Synthetic Data Generation with a PostgreSQL and Python-based Tool
Description
A lightweight tool written in Python that teams up with PostgreSQL and it's pg_stats view in order to generate fully synthetic data that seem as realistic as possible.
It generates the synthetic data by reading the pg_stats view of PostgreSQL (pg_statistic catalog), more explicitly by reading the most common values, their frequencies in the dataset, the average width of the column values, the number of distinct values etc. The algorithm combines all these values and properties in order to generate fully synthetic data that contain no actual values or fragments of the "real" data at all, but are very similar in the context of the "shape" and the properties of them.
Constrains
Types
There are some rules on the supported datatypes.
- To genereate data which do not have any external constraints all the postgres types are supported
- Data with primary key constraints can only be of numeric, text or date type (and its subtypes and variations)
- Data with foreign key constraints can only be of numeric, text or date type (and its subtypes and variations)
Data Import
Prerequisite:
Make sure the database tennis_atp_2020 is present on your Postgres instance such that the import.bat does not throw any errors:
psql -U postgres
postgres=# CREATE DATABASE tennis_atp_2020;
Importing Test dataset for the tool: The "Tennis_ATP" test dataset can be found inside resources/ and can be set-up very easily using the import.bat file (if on Windows) or by importing the .csv files directly into Postgres (which should be pretty straight-forward).
Installation
- Clone the repository into your desired directory.
- Inside the cloned directory:
pip install -r requirements.txt
to install all the required libraries/dependencies
- Run the tool using the terminal shell while having the PostgreSQL server up and running
Usage
Tool arguments:
- DBNAMEGEN - Name of the database to be created
- -show/--show - Shows database stats (default)
- -generate/--generate - Generates new synthesized data to database DBNAMEGEN
- -mf/--mf - Multiplication factor for the generated synthetic data (default: 1.0)
- -tables/--tables - Name(s) of table(s) to be filled, separated with ',', ignoring other tables (default: fill all tables)
- -O/--owner - Owner of new database (default: same as user)
- -v/--version - Show version information, then quit
- -h/--help - Show tool help, then quit
Connection options:
- DBNAMEIN - Name of the existing database to connect to
- -H/--hostname - Name of the PostgreSQL server (default: localhost)
- -P/--port - Port of the PostgreSQL server (default: 5432)
- -U/--user - PostgreSQL server username
Some usage examples:
- python pgsynthdata.py test postgres -show
- Connects to database test, host=localhost, port=5432, default user with password postgres
- Shows statistics of the database test
- python pgsynthdata.py dbin dbgen pw1234 -H myHost -p 8070 -U testuser -generate
- Connects to database dbin, host=myHost, port=8070, user=testuser with password pw1234
- Create new database dbgen and generates synthetic data into it
- python pgsynthdata.py dbin dbgencreate pw123 -U myUser -generate -tables myTable1, myTable2
- Connects to database dbin, host=localhost, port=5432, user=myUser with password pw123
- Creates new database dbgencreate with synthetic data on tables: table1 and table2
- python pgsynthdata.py --help
- Show the help information of the tool
Contributions
Contributions are welcomed, please check the issues list for future ideas.
Author
Documentation & Adjustments made by Etienne Baumgartner
Tool is written by Kevin Ammann based on preliminary work by Labian Gashi.
Feedback is welcome.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pgsynthdata-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f48ced1d71dd9e613f82de0b8d8ecb4a36c21088c90eb6fd8d0a93a9db61413 |
|
MD5 | e00572f31fb6d75d1380483bd708e9ac |
|
BLAKE2b-256 | 00ea9d0175729ea70568a17a36e5c7bea479cc2f71ba49f8c54d01b340924a1c |