A package for generating tabluar datasets

These details have not been verified by PyPI

Project links

Homepage

Project description

Datamaker

Rapidly build tabular datasets in the following formats:

SQL Tables (Generate code for SQL CREATE and INSERT queries)
CSV
JSON

Supports: PostgreSQL, MySQL, MS SQL, SQLite

Getting Started

Datamaker uses core python 3 and has been extensively tested on python 3.6+.

Prerequisites

Python 3.6+

Installing

pip install datamaker

Build a dataset

In this example we build a SQL table with 50 rows and two columns: id, name. The create and insert queries will be written to the file example.sql.

from datamaker.SQLTableGenerator import SQLTableGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.sql"

gen = SQLTableGen(table_name, num_rows, column_defs, output_path)

gen.build_sql_doc()

Documentation

Input parameters

table_name - This is the name of the table to be created
num_rows - The number of rows to insert into the table
column_defs - A dictionary object with key value pairs.
path - The path to the directory and file name where you want to export the sql queries or dataset for the specified format.

column_defs is a dictionary object of the form:

{column_name: {"col_type": type, ...}}

Note that col_type is always specified for each column. The definitions of the parameters for colum_defs is given below.

col_type: int, float, varchar, bool
varchar_length: int (required for MySQL and MS SQL)
is_primary_key: bool - True if this field is the primary key
is_name: bool - True if values for this column should come from the names collection
is_email: bool - True if this column is an email. Note that you must have at least one column with is_name = True to use is_email.
is_date: bool - True if this column is a date. Default format is YYYY-mm-dd
is_timestamp: bool - True if this column is a timestamp. Default format is YYYY-mm-dd HH:MM:SS
is_country: bool - True of this column is a country code. Codes are ISO Alpha-2.
is_state: bool - True if this column is a US state abbreviation.
is_address: bool - True if this column is a street address.
bounds: tuple(lowerbnd, upperbnd) - Forces lowerbnd <= column values <= upperbnd
fixed_collection: tuple(elem1, elem2, ..., elemn) - Forces column values to be elements of this tuple
fixed_map: tuple(column_name, {key: value}) - Forces values for this column to be associated to the value of another column
is_nullable: bool - True if this column can have null values
proportion_null: float - The approximate proportion of expected null values in this column. Note that is_nullable must be true to use proportion_null

Note that using is_primary_key assumes the column you're defining is an integer column starting at 1 and incrementing by 1 for each row.

If you're planning on running queries on a MySQL or MS SQL Server machine then you need to specify the length of each varchar field using the varchar_length parameter.

CSV and JSON generators

Datamaker can also be used to generate CSV and JSON files.

To create a CSV file add the following import statement to your script.

from datamaker.CSVGenerator import CSVGen

And to create a JSON file add the following import to your script.

from datamaker.JSONGenerator import JSONGen

The input parameters are no different for CSVGen and JSONGen are the same as given above. If we want to create the Example table given in the example code above as a CSV or JSON document we only to change the import statement and instantiate the CSVGen or JSONGen object using the same column_defs dictionary. Then call the appropriate build method for the given class. The CSVGen class uses build_csv_doc and the JSONGen class uses build_json_doc.

from datamaker.CSVGenerator import CSVGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.csv"

gen = CSVGen(table_name, num_rows, column_defs, output_path)

gen.build_csv_doc()

Here's the same Example table built as a JSON document.

from datamaker.JSONGenerator import JSONGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.json"

gen = JSONGen(table_name, num_rows, column_defs, output_path)

gen.build_json_doc()

A more complex example

In this example we're generating SQL code to create a MySQL table which tracks orders placed for an imaginary online store. We'll create a table with the following columns:

id - the primary key for this table
name - the name associated to the user who placed this order
order_ts - the timestamp associated with this order
product_id - the id for the product ordered
price - the cost of the product being ordered. Since we want every row with a given product_id to have the same cost we use a fixed map to implement this column

Note that in this example we're defining the output path as a valid directory instead of a path to a file. In this case datamaker will automatically generate the name of the file as:

table_name.current_date.sql

Where table_name is the name of the table and current_date is today's date.

from datamaker.SQLTableGenerator import SQLTableGen


table_name = "Orders"
num_rows = 500
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "varchar_length": 50, "is_name": True},
    "order_ts": {"col_type": "varchar", "varchar_length": 26, "is_timestamp": True},
    "product_id": {"col_type": "int", "bounds": (1, 5)},
    "price": {"col_type": "float", "fixed_map": ("product_id", {1: 5.99, 2: 10.99, 3: 0.99, 4: 99.99, 5: 20})},
}
output_path = "."

gen = SQLTableGen(table_name, num_rows, column_defs, output_path)

gen.build_sql_doc()

Future improvements:

Add support for sequential fields
Add support for column type inference
Add support for tsv generation

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Oct 3, 2020

This version

0.1.4

Nov 12, 2019

0.1.3

Nov 5, 2019

0.1.2

Nov 4, 2019

0.1.1

Oct 26, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaker-0.1.4.tar.gz (16.4 kB view hashes)

Uploaded Nov 12, 2019 Source

Built Distribution

datamaker-0.1.4-py3-none-any.whl (23.9 kB view hashes)

Uploaded Nov 12, 2019 Python 3

Hashes for datamaker-0.1.4.tar.gz

Hashes for datamaker-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`054474c370e18361cd901b727fc769c2666d4d5345424766b58abb89117018c7`
MD5	`ddb028fac927f83c0cf0068d9a2ca31d`
BLAKE2b-256	`da8e3bdba817b341be0fdf24d36a921c12f0f30bddf66b5a4b8bbf5da6e1ea72`

Hashes for datamaker-0.1.4-py3-none-any.whl

Hashes for datamaker-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`023c932a4e098d9ef7150b19afcfe61bb8ab0d648fddb5bcf7659f7c1c90a871`
MD5	`bdea91d5ee79649991dca624d5b11571`
BLAKE2b-256	`7db16d851805492c64a02ab130fd5ffa8906df61caad3ab91dd5d05a785230ab`