Skip to main content

Dummy data generator focusing on customisability and maintained relationships for mocking data pipelines

Project description

███╗   ███╗ ██████╗  ██████╗██╗  ██╗██████╗ ██╗██████╗ ███████╗
████╗ ████║██╔═══██╗██╔════╝██║ ██╔╝██╔══██╗██║██╔══██╗██╔════╝
██╔████╔██║██║   ██║██║     █████╔╝ ██████╔╝██║██████╔╝█████╗
██║╚██╔╝██║██║   ██║██║     ██╔═██╗ ██╔═══╝ ██║██╔═══╝ ██╔══╝
██║ ╚═╝ ██║╚██████╔╝╚██████╗██║  ██╗██║     ██║██║     ███████╗
╚═╝     ╚═╝ ╚═════╝  ╚═════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝     ╚══════╝

Latest version released on PyPI Build status of the master branch Package license


MockPipe

There’s a lot of sample databases out there and lots of ways to generate some dummy data (i.e. faker, which this project uses), but i couldn’t find much in the way of dynamically generating realistic data that could be used to generate some scenarios that one might actually find coming out of a operational systems CDC feed. This is an attampt to create a utility/library that can be used to setup some .

From a yaml config a set of sample tables can be defined, using dummy default values for any newly generated rows along with a set of actions that can be performed with a certain frequency.

The dummy values actually invoke the Faker library to generate somewhat realistic entries, along with support for other data types that may refer to existing values within the table or other tables so that relationships can be maintained.

Data is persisted onto a duckdb database so the outputs can be persisted between executions and support any other analysis/queries you may want to do.

Features

  • Dynamic Data Generation: Generate sample tables from a YAML configuration, using dummy default values for newly generated rows.

  • Faker Integration: Leverage the Faker library to create realistic entries.

  • Relationship Maintenance: Support for data types that refer to existing values within the same table or other tables, ensuring relationships are preserved.

  • Action Frequency: Define a set of actions to be performed with a certain frequency.

  • Persistence: Data is persisted in a DuckDB database, allowing outputs to be saved between executions and enabling further analysis or queries.

Installation

To install Mockpipe, you can use pip:

pip install mockpipe

Basic Usage

import mockpipe

# Define your YAML configuration
yaml_config = """
tables:
  - name: users
    columns:
      - name: id
        type: integer
        primary_key: true
      - name: name
        type: string
        faker: name
      - name: email
        type: string
        faker: email
actions:
  - table: users
    action: insert
    frequency: 1.0

# Initialize Mockpipe with the configuration
mp = mockpipe.Mockpipe(yaml_config)

Command line Usage

Usage: mockpipe [OPTIONS]

Options:
  --config_create     generate a sample config file
  --config PATH       path to yaml config file
  --steps INTEGER     Number of steps to execute initially
  --run-time INTEGER  Time to run the mockpipe process in seconds
  --version           Show the version and exit.
  --help              Show this message and exit.

Config Specification

Top Level Keys

key

value type

allowed values

default value

sample

explanation

db_path

path

any

mockpipe.db

sample.db

path of duckdb db

delete_behaviour

string

[soft, hard]

soft

soft

whether deleted records will be marked as deleted with ‘D’ or actually hard deleted in the persisted db

inter_action_delay

float

0.0 ->

0.5

0.1

delay between each action

output

table

output format

Output

key

value type

allowed values

default value

sample

explanation

format

string

[json, csv]

json

json

file format output

path

path

any

extract

extract

folder path for output

Tables

key

value type

allowed values

default value

sample

explanation

name

string

any

N/A

employees

table name used. Also used for output

fields

table

List of fields in table

actions

table

List of actions within table

Fields

key

value type

allowed values

default value

sample

explanation

Note

name

string

any

N/A

order_date

table name used. Also used for output

type

string

[string, int, float, boolean]

N/A

string

List of fields in table

value

string

[increment, static(*), table_random(), fake.*]

N/A

fake.date_between

List of actions within table

See ‘Field Value Usage’

arugments

list

any

N/A

  • “-1y”

  • “today”

Arguments to pass to faker functions

See ‘Field Value Usage’

Actions

key

value type

allowed values

default value

sample

explanation

Note

name

string

any

N/A

update_order_status

name of action

field

string

any

N/A

order_status

field which gets updated

action

string

[create, delete, set]

N/A

set

type of action to perform

value

string

[increment, static(*), table_random(), fake.*]

N/A

fake.random_element

value to set field to

arguments

list

any

N/A

(‘pending’, ‘completed’, ‘shipped’, ‘delivered’)

if using faker, arguments to pass

frequency

float

0->1

N/A

0.25

relative frequency of action

where_condition

string

<table>.<value> == <condition>

N/A

products.product_id == table_random(products, product_id, 0)

where condition to limit which rows in table to apply action to

See where condition

action_condition

string

EFFECT_ONLY

N/A

EFFECT_ONLY

used to specify if the action is only ever to be invoked by another action (i.e., an effect)

effect

string

<table>.<action>(<target_col>=<source_col>, …)

N/A

product.product_count(order_id=order_id)

After the specified action is executed, another action can be invoked, passing values onwards to the next action

See Effect

effect_count

[int, string]

0->max(int), inherit

N/A

inherit

if effect is set, how many times to invoke the next effect

See Effect

effect_count_random

string

<min>,<max>

N/A

1,5

if effect is set, how many times to invoke the next effect

See Effect

Field Values

type

increment

explanation

Will only wok for integer fields. It acts as you’d expect, incrementing the value by 1 for each new row generated and selecting a random value from the specified table respectively.

syntax

increment

examples

increment

type

static

explanation

Will set a static value on each new row generated. This can be any value you want, but it will be the same for each row generated.

syntax

static(<value>)

examples

static(false), static(100), static('pending')

type

table_random

explanation

Will select a random value from the specified table for each new row generated. Note, will only select non-deleted rows. It’s important to set a default value in case the table is empty.

syntax

table_random(<table_name>, <column_name>, <default_value>)

examples

table_random(products, product_id, 0)

type

fake.*

explanation

Will generate a value using the faker library. The arguments key can be used to pass arguments to the faker function.

syntax

fake.<faker_function>

examples

fake.company

Effects

The effect is used to specify that after the specified action is executed, another action can be invoked, passing values onwards to the next action. This can be useful for chaining actions together to create one to one, one to many relationships, you can also specify how many times to invoke the next

effect:

explanation

Which action to invoke after the current action is executed.

syntax

<table>.<action>(<target_col>=<source_col>, <target_col=<source_col>, ...)

example

effect: product.product_count(order_id=order_id)

effect_count:

explanation

If the effect is set, how many times to invoke the next effect. Note, can not be used with effect_count_random.

syntax

<int>

example

1

effect_count_random:

explanation

If the effect is set, how many times to invoke the next effect. Note, can not be used with effect_count.

syntax

<min>,<max>

example

1,5

action_condition:

Used to specify if the action is only ever to be invoked by another action (i.e., an effect).

explanation

Used to specify if the action is only ever to be invoked by another action (i.e., an effect).

syntax

EFFECT_ONLY

example

EFFECT_ONLY

Where Condition

explanation

The where condition is used to limit which rows in the table an action is applied to. It can be set to a filter, i.e. where status==’pending’ or it can perform a lookup to another table to get the value to filter on.

syntax

<table>.<value> == / != / >= / <= / > / < <condition>

table_random condition syntax

table_random(<table_name>, <column_name>, <default_value>)

static syntax

static(<value>)

table_random example

products.product_id == table_random(orders, product_id, 0)

static example

products.product_id == static(1)

Future Enhancements

  • improved yaml config validation

  • improved logging

  • increased test coverage

  • simplyfy action usage and allow for duckdb functions

  • support additional data output formats (e.g. xml, parquet)

  • better sql typing support

Contributing

Contributions are welcome, Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mockpipe-0.0.3.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mockpipe-0.0.3-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file mockpipe-0.0.3.tar.gz.

File metadata

  • Download URL: mockpipe-0.0.3.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for mockpipe-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ee1403aaf178bd8be97fae2d5093ee3d2fe1c20d9619c37728b2dfaafac5ea56
MD5 1390aa314a36316c9d5f192ae8a21778
BLAKE2b-256 ac9f2fddccca76fa68e15aa718f292e5172e2b7400f3bc7030bfa4b308fffe0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mockpipe-0.0.3.tar.gz:

Publisher: build.yml on BenskiBoy/mockpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mockpipe-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: mockpipe-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for mockpipe-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4111b33b22bf4db175d528a4d18f37d309552d6e32a4688abac0f2a17d824885
MD5 a7209ecb56d66471d3dd5389fa38165e
BLAKE2b-256 4d172274a6878c2b5ddc63766dc7547304f1718c6d60b70e60ad6e5c1e315fc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for mockpipe-0.0.3-py3-none-any.whl:

Publisher: build.yml on BenskiBoy/mockpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page