Skip to main content

A package for generating tabluar datasets

Project description

Datamaker

Rapidly build tabular datasets in the following formats:

  • SQL Tables (Generate code for SQL CREATE and INSERT queries)
  • CSV
  • JSON

Supports: PostgreSQL, MySQL, MS SQL, SQLite

Getting Started

Datamaker uses core python 3 and has been extensively tested on python 3.6+.

Prerequisites

Python 3.6+

Installing

pip install datamaker

Build a dataset

In this example we build a SQL table with 50 rows and two columns: id, name. The create and insert queries will be written to the file example.sql.

from datamaker.SQLTableGenerator import SQLTableGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.sql"

gen = SQLTableGen(table_name, num_rows, column_defs, output_path)

gen.build_sql_doc()

Documentation

Input parameters

  • table_name - This is the name of the table to be created
  • num_rows - The number of rows to insert into the table
  • column_defs - A dictionary object with key value pairs.
  • path - The path to the directory and file name where you want to export the sql queries or dataset for the specified format.

column_defs is a dictionary object of the form:

{column_name: {"col_type": type, ...}}

Note that col_type is always specified for each column. The definitions of the parameters for colum_defs is given below.

  • col_type: int, float, varchar, bool
  • varchar_length: int (required for MySQL and MS SQL)
  • is_primary_key: bool - True if this field is the primary key
  • is_name: bool - True if values for this column should come from the names collection
  • is_email: bool - True if this column is an email. Note that you must have at least one column with is_name = True to use is_email.
  • is_date: bool - True if this column is a date. Default format is YYYY-mm-dd
  • is_timestamp: bool - True if this column is a timestamp. Default format is YYYY-mm-dd HH:MM:SS
  • is_country: bool - True of this column is a country code. Codes are ISO Alpha-2.
  • is_state: bool - True if this column is a US state abbreviation.
  • is_address: bool - True if this column is a street address.
  • bounds: tuple(lowerbnd, upperbnd) - Forces lowerbnd <= column values <= upperbnd
  • fixed_collection: tuple(elem1, elem2, ..., elemn) - Forces column values to be elements of this tuple
  • fixed_map: tuple(column_name, {key: value}) - Forces values for this column to be associated to the value of another column
  • is_nullable: bool - True if this column can have null values
  • proportion_null: float - The approximate proportion of expected null values in this column. Note that is_nullable must be true to use proportion_null

Note that using is_primary_key assumes the column you're defining is an integer column starting at 1 and incrementing by 1 for each row.

If you're planning on running queries on a MySQL or MS SQL Server machine then you need to specify the length of each varchar field using the varchar_length parameter.

CSV and JSON generators

Datamaker can also be used to generate CSV and JSON files.

To create a CSV file add the following import statement to your script.

from datamaker.CSVGenerator import CSVGen

And to create a JSON file add the following import to your script.

from datamaker.JSONGenerator import JSONGen

The input parameters are no different for CSVGen and JSONGen are the same as given above. If we want to create the Example table given in the example code above as a CSV or JSON document we only to change the import statement and instantiate the CSVGen or JSONGen object using the same column_defs dictionary. Then call the appropriate build method for the given class. The CSVGen class uses build_csv_doc and the JSONGen class uses build_json_doc.

from datamaker.CSVGenerator import CSVGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.csv"

gen = CSVGen(table_name, num_rows, column_defs, output_path)

gen.build_csv_doc()

Here's the same Example table built as a JSON document.

from datamaker.JSONGenerator import JSONGen


table_name = "Example"
num_rows = 50
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "is_name": True}
}
output_path = "example.json"

gen = JSONGen(table_name, num_rows, column_defs, output_path)

gen.build_json_doc()

A more complex example

In this example we're generating SQL code to create a MySQL table which tracks orders placed for an imaginary online store. We'll create a table with the following columns:

  • id - the primary key for this table
  • name - the name associated to the user who placed this order
  • order_ts - the timestamp associated with this order
  • product_id - the id for the product ordered
  • price - the cost of the product being ordered. Since we want every row with a given product_id to have the same cost we use a fixed map to implement this column

Note that in this example we're defining the output path as a valid directory instead of a path to a file. In this case datamaker will automatically generate the name of the file as:

table_name.current_date.sql

Where table_name is the name of the table and current_date is today's date.

from datamaker.SQLTableGenerator import SQLTableGen


table_name = "Orders"
num_rows = 500
column_defs = {
    "id": {"col_type": "int", "is_primary_key": True},
    "name": {"col_type": "varchar", "varchar_length": 50, "is_name": True},
    "order_ts": {"col_type": "varchar", "varchar_length": 26, "is_timestamp": True},
    "product_id": {"col_type": "int", "bounds": (1, 5)},
    "price": {"col_type": "float", "fixed_map": ("product_id", {1: 5.99, 2: 10.99, 3: 0.99, 4: 99.99, 5: 20})},
}
output_path = "."

gen = SQLTableGen(table_name, num_rows, column_defs, output_path)

gen.build_sql_doc()

Future improvements:

  • Add support for sequential fields
  • Add support for column type inference
  • Add support for tsv generation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamaker-1.0.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamaker-1.0.1-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file datamaker-1.0.1.tar.gz.

File metadata

  • Download URL: datamaker-1.0.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.6

File hashes

Hashes for datamaker-1.0.1.tar.gz
Algorithm Hash digest
SHA256 0a6a6005634b91732a9a73ddd3c493a618da607000d88f8c12320076a1b05cf8
MD5 63987357bcd364cfee545b4860584dba
BLAKE2b-256 6994d991decd55502b9f48b8f4b627491787ef7a76d06ab25753a8bad9764654

See more details on using hashes here.

File details

Details for the file datamaker-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: datamaker-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.6

File hashes

Hashes for datamaker-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 21929fee57628dcc4f04efdc8e4842ad9b78490d8bf207e68d965feee4401fe0
MD5 0bff9e5b55d4da11235efe9318e0f00f
BLAKE2b-256 08ed91a0ca8ec92dee750a9ac5f43479a4764b7cbc3253c3be9e7ac90f251db3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page