Skip to main content

No project description provided

Project description

ODSynth

ODSynth generates samples of synthetic data for you, based on the expected schema of your data. This project may be used for generating data for:

  • Seeding your ETL applications
  • Benchmarking of ETL applications
  • Producing data in various formats (json, delimited text, xml, etc)

With the plugin system, developers can use their 'providers' locally in their own applications.

Core Idea

See core idea for this project here

How it works

  1. Specify a schema. See an example here. The providers specify the type of data to be generated. (For example first_name, last_name etc.)
  2. Use the schema to generate data in memory or publish data to a medium

Installation

A proper python package for this application is not yet available, so users must clone the repo and install the Python package locally.

git clone https://github.com/kbaafi/data-synthesizer.git
cd data-synthesizer
# Optional
# python -m venv venv
pip install -e .

Basic Usage

Use 'synth' to generate json data

synth --schema-spec-file=../schema.yaml --format=json --num-samples=3

Use 'synth' to generate csv data

synth --schema-spec-file=../flat_schema.yaml --format=txt --num-samples=3 --formatter-arg delimiter=comma

Delimiter may be one of 'comma', 'tab' or 'pipe'

Use the API in your own code

from odsynth.schema import Schema

def generate_data():
    num_samples=3
    batch_size=5                          # Batch size can be greater than num_samples
    format="txt"                          # Format can be json,xml,txt,pandas
    formatter_args=["delimiter=comma"]    # Depending on formatter, args may need to be provided. Default is None
    schema_spec_file="./sample_schema/flat_schema.yaml" # CSV formatter expects a tabular schema.
                                                        # XML, JSON, Pandas and Base Formatters can accept
                                                        # hierarchical data

    generator = Schema(schema_file=schema_spec_file).build_generator(
        num_examples=num_samples,
        batch_size=batch_size,
        format=format,
        formatter_args=formatter_args,
    )
    data = generator.get_data()

    # Prints generated data in csv format
    print(data)

Use 'publish' to load synthetic data to local disc in XML format

Publish 100 samples of schema specified in flat_schema.yaml, 10 examples per batch.

publish --schema-spec-file=../flat_schema.yaml --format=xml --writer=local_disc --writer-arg output_dir=../odsynth_out --num-samples=100 --batch-size=10

For more on the data generator and the data publisher, see the help pages for synth and publish publish --help or synth --help

Schemas and Providers

An example schema is shown below. This schema simulates the scenario of a parent responsible for up to 5 children. Providers are responsible for generating the primitive fields that comprise the record. An example of a provider that generates a random integer can be found here

fields:
  parent_firstname:
    provider: first_name
  parent_lastname:
    provider: last_name
  children:
    fields:
      firstname:
        provider: first_name
      lastname:
        provider: last_name
    max_count: 5
    is_array: true
  parent_age:
    provider: random_int
    provider_args:
      min: 25
      max: 55
  parent_ssn:
    provider: ssn

This schema is expected generated a data point that looks like this:

{
    "parent_first_name": "Christopher", "parent_lastname": "Villegas",
    "children": [
        {"firstname": "Jason", "lastname": "Rogers"},
        {"firstname": "Andrea", "lastname": "Young"},
        {"firstname": "Michelle", "lastname": "Kaiser"}
    ],
    "parent_age": 43,
    "parent_ssn": "269-11-8507"
}

Currently ODSynth implements the following Providers from Faker

We hope to be able to develop more Providers in the future.

Formatters

Generated data can be formatted into the following formats for use in memory or storage on disc:

Writers

Writers work with the publishing system to write generated data to a specified medium. Currently the local_disc writer has been implemented. Writers are primarily responsible for writing data to a destination medium which could take any form, e.g. S3, Azure Blob Storage, REST EndPoint, etc.

Plugins and the ODSYNTH_HOME

It is possible for developers to plugin in their own providers, formatters and writers to the Data Synthesis system by loading the user added components from the ODSYNTH_HOME directory. The ODSYNTH_HOME is specified by setting the environment variable ODSYNTH_HOME

export ODSYNTH_HOME=./sample_home_folder

An ODSYNTH Home folder is expected to have the following subfolders the various developer plugins:

  1. providers for user added providers
  2. formatters for user added formatters
  3. writers for user added writers

The plugins system will load all providers, formatters and writers from the HOME folder.

Development Roadmap

  • Build Data Formatter for Pandas
  • Build Data Formatter for XML
  • Build Data Writer for XML
  • Build Data Formatter for JSON
  • Build Data Writer for JSON
  • Build Formatter for Delimited Text
  • Add a logger (Under consideration)
  • Add some form of support for Py Faker's Locales
  • Improve DOM Validation (Ongoing)
  • Data Transformer for Spark
  • Add support for optional fields
  • Build Data Writers for:
    • S3
    • Kafka
    • (Possibly) to REST APIs
  • Implement a plugin system for users to add their own code(Providers, Writers and Transformers) in their own local system
  • (Possibly) Add examples for Dockerized deployment of Publishers
  • Add Code and User documentation
  • Add CICD Pipeline for deploying python package to PyPi
  • Improve Local Python Packaging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odsynth-0.0.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

odsynth-0.0.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file odsynth-0.0.1.tar.gz.

File metadata

  • Download URL: odsynth-0.0.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5ebd8ab4f6f395cc65b8715099599a8fa81ed39e82a8dd6595dfda99e022e0dd
MD5 375472b14aaf1f927d74a7c8c5e63d81
BLAKE2b-256 3158ea7789f4697848f8f584bc393c8742b8e1473b4a20700044ea334494fc25

See more details on using hashes here.

File details

Details for the file odsynth-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: odsynth-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7e9a28c3ad89b69faf2e439aec44dccb928a110fd716d6c05058bf8b7ab358c9
MD5 91a251fd317676aa4b5c992db289d0ec
BLAKE2b-256 ee5e3936b1e64860f95bdd7bda8e70a0b95b4da99c1e1d6be4a598fd5d3377b2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page