Skip to main content

Generate delimited sample data with a simple schema.

Project description

Datagen helps you create sample delimited data using a simple schema format. It runs on Python 2.6-3.4 and particularly well on PyPy.

Installation

pip install datagen

Or:

$ git clone https://github.com/toddwilson/datagen.git
$ cd datagen
$ python setup.py install

Usage

usage: datagen [-h] [-d DELIMITER] [–with-header] -n NUM_ROWS -s SCHEMA [output]

1. Create a schema file

$ cat > schema.txt <<EOL
    #name      type[argument]
    id         int[6]
    first      firstname
    last       lastname
    email      email
    dob        date[after=1945-01-01, before=2001-01-01]
    password   string[8]
    is_active  bool
    language   randomset[python,ruby,go,java,c,js,brainfuck]
EOL

2. Make data

$ datagen -s schema.txt -n 5 --with-header
id|first|last|email|dob|password|is_active|language
238476|Velma|Medrano|sxLYZTnPf@ACLoxOVjUu.edu|1948-01-12|KmAcXnnS|1|python
202490|Kathy|Wellman|pAXx@MQcPrkMdNMcZa.com|1960-11-12|BwtZnRUN|1|java
905703|Fern|Odell|iCQ@KtN.mil|1972-12-12|ipVagvEB|0|c
130211|Khadijah|Sheffield|KBPf@ibR.edu|1961-02-02|ijAVDWUY|0|java
643257|Patricia|Cummings|vaZqWhl@YcVvZXx.int|1960-05-01|GJdImZaw|0|ruby

3. Actually start working on what you should be working on

Types

bool: 1 or 0 randomly.

int[length]: Random unsigned integer.

Params:

  • length: max-length

Example:

number  numberint[3]

509
49
783

incrementing_int: Automatically incrementing unsigned integer.

Example:

id  incrementing_int

1
2
3

string[length]: Random case-insensitive string.

Params:

  • length: max-length

Example:

code  string[4]

FiwH
Acbj
EtGM

randomset[list]: Random member from a list

Params:

  • set: a comma-separated list of values

Example:

country  randomset[US,UK,MX,CA,NZ]

MX
US
CA

ipv4: IPv4 address

Example:

ip  ipv4

18.149.184.112
66.170.176.163
186.49.28.83

date: ISO 8601 date (YYYY-MM-DD)

Params:

  • before: ISO 8601 date top limit

  • after: ISO 8601 bottom limit

Example:

start_date  date[after=2013-01-01, before=2014-01-01]

2013-10-05
2013-01-10
2013-05-14

datetime: ISO 8601 datetime (YYYY-MM-DD)

Params:

  • before: ISO 8601 datetime top limit

  • after: ISO 8601 bottomtime limit

Example:

start_at  datetime[after=2013-01-01T00:00:00, before=2014-01-01T00:00:00]

2013-10-03T13:00:23
2013-05-12T00:00:06
2013-09-20T03:18:02

ssn: 9-digit Social Security Number

Example:

ssn  ssn

421-87-2421
889-27-3485
861-33-1570

firstname: Randomized first name (from top names in US Census data)

Example:

first  firstname

Todd
Jessika
Dustin

lastname: Randomized last name (from top names in US Census data)

Example:

last  lastname

Rivers
Akins
Reardon

zipcode: 5-digit zipcode

Example:

zip  zipcode

47245
59502
20191

state: US States (2 letter)

Example:

state  state

ID
KY
AK

email: Email address

Example:

email  email

QnqfpcP@PIbsLUKq.org
SNgOqbQ@YSpfbZQP.int
asRooN@qjxukNUhLr.com

Adding Your Own Types

It’s really easy to add your own types to use in a schema file. Just create a method that accepts a single argument and decorate it with datagen.types.reg_type.

Example:

<my_datagen.py>

from random import uniform
from datagen.types import reg_type
from datagen import main


@register_type("price")  # the decorator sets the name of the type
def price(arg):  # the method must accept one argument (even if not used)
    return round(uniform(0, 100), 2)


if __name__ == '__main__':
    main()

<schema.txt>

item_id   int[5]
price     price
$ python my_datagen.py -s schema.txt -n 3
41746|7.32
4077|40.55
12814|43.82

Adding Arguments to Your Types

<my_datagen.py>

from random import uniform
from datagen.types import register_type, type_arg
from datagen import main


@type_arg("price")  # Use the same name as the type defined in reg_type()
def price_argument(arg):  # This method is passed the contents of what's in price[]
    return int(arg)  # This will get passed to price() when iterating


@register_type("price")  # the decorator sets the name of the type
def price(max_price):  # the method must accept one argument (even if not used)
    return round(uniform(0, max_price), 2)


if __name__ == '__main__':
    main()

<schema.txt>

item_id   int[5]
price     price[10]
$ python my_datagen.py -s schema.txt -n 3
66995|5.08
5894|7.86
53659|9.26

Performance

If you need datagen to write faster, use PyPy:

$ time python my_datagen.py -s schema.txt -n 1000000 > test_data
python my_datagen.py -s schema.txt -n 1000000 > test_data  7.87s user 0.07s system 99% cpu 7.950 total

$ time pypy my_datagen.py -s schema.txt -n 1000000 > test_data
pypy my_datagen.py -s schema.txt -n 1000000 > test_data  2.79s user 0.06s system 99% cpu 2.863 total

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page