Skip to main content

No project description provided

Project description

ODSynth

ODSynth generates samples of synthetic data for you, based on the expected schema of your data. This project may be used for generating data for:

  • Seeding your ETL applications
  • Benchmarking of ETL applications
  • Producing data in various formats (json, delimited text, xml, etc)

With the plugin system, developers can use their 'providers' locally in their own applications.

Core Idea

See core idea for this project here

How it works

  1. Specify a schema. See an example here. The providers specify the type of data to be generated. (For example first_name, last_name etc.)
  2. Use the schema to generate data in memory or publish data to a medium

Installation

A proper python package for this application is not yet available, so users must clone the repo and install the Python package locally.

git clone https://github.com/kbaafi/data-synthesizer.git
cd data-synthesizer
# Optional
# python -m venv venv
pip install -e .

Basic Usage

Use 'synth' to generate json data

synth --schema-spec-file=../schema.yaml --format=json --num-samples=3

Use 'synth' to generate csv data

synth --schema-spec-file=../flat_schema.yaml --format=txt --num-samples=3 --formatter-arg delimiter=comma

Delimiter may be one of 'comma', 'tab' or 'pipe'

Use the API in your own code

from odsynth.schema import Schema

def generate_data():
    num_samples=3
    batch_size=5                          # Batch size can be greater than num_samples
    format="txt"                          # Format can be json,xml,txt,pandas
    formatter_args=["delimiter=comma"]    # Depending on formatter, args may need to be provided. Default is None
    schema_spec_file="./sample_schema/flat_schema.yaml" # CSV formatter expects a tabular schema.
                                                        # XML, JSON, Pandas and Base Formatters can accept
                                                        # hierarchical data

    generator = Schema(schema_file=schema_spec_file).build_generator(
        num_examples=num_samples,
        batch_size=batch_size,
        format=format,
        formatter_args=formatter_args,
    )
    data = generator.get_data()

    # Prints generated data in csv format
    print(data)

Use 'publish' to load synthetic data to local disc in XML format

Publish 100 samples of schema specified in flat_schema.yaml, 10 examples per batch.

publish --schema-spec-file=../flat_schema.yaml --format=xml --writer=local_disc --writer-arg output_dir=../odsynth_out --num-samples=100 --batch-size=10

For more on the data generator and the data publisher, see the help pages for synth and publish publish --help or synth --help

Schemas and Providers

An example schema is shown below. This schema simulates the scenario of a parent responsible for up to 5 children. Providers are responsible for generating the primitive fields that comprise the record. An example of a provider that generates a random integer can be found here

fields:
  parent_firstname:
    provider: first_name
  parent_lastname:
    provider: last_name
  children:
    fields:
      firstname:
        provider: first_name
      lastname:
        provider: last_name
    max_count: 5
    is_array: true
  parent_age:
    provider: random_int
    provider_args:
      min: 25
      max: 55
  parent_ssn:
    provider: ssn

This schema is expected generated a data point that looks like this:

{
    "parent_first_name": "Christopher", "parent_lastname": "Villegas",
    "children": [
        {"firstname": "Jason", "lastname": "Rogers"},
        {"firstname": "Andrea", "lastname": "Young"},
        {"firstname": "Michelle", "lastname": "Kaiser"}
    ],
    "parent_age": 43,
    "parent_ssn": "269-11-8507"
}

Currently ODSynth implements the following Providers from Faker

We hope to be able to develop more Providers in the future.

Formatters

Generated data can be formatted into the following formats for use in memory or storage on disc:

Writers

Writers work with the publishing system to write generated data to a specified medium. Currently the local_disc writer has been implemented. Writers are primarily responsible for writing data to a destination medium which could take any form, e.g. S3, Azure Blob Storage, REST EndPoint, etc.

Plugins and the ODSYNTH_HOME

It is possible for developers to plugin in their own providers, formatters and writers to the Data Synthesis system by loading the user added components from the ODSYNTH_HOME directory. The ODSYNTH_HOME is specified by setting the environment variable ODSYNTH_HOME

export ODSYNTH_HOME=./sample_home_folder

An ODSYNTH Home folder is expected to have the following subfolders the various developer plugins:

  1. providers for user added providers
  2. formatters for user added formatters
  3. writers for user added writers

The plugins system will load all providers, formatters and writers from the HOME folder.

Development Roadmap

  • Build Data Formatter for Pandas
  • Build Data Formatter for XML
  • Build Data Writer for XML
  • Build Data Formatter for JSON
  • Build Data Writer for JSON
  • Build Formatter for Delimited Text
  • Add a logger (Under consideration)
  • Add some form of support for Py Faker's Locales
  • Improve DOM Validation (Ongoing)
  • Data Transformer for Spark
  • Add support for optional fields
  • Build Data Writers for:
    • S3
    • Kafka
    • (Possibly) to REST APIs
  • Implement a plugin system for users to add their own code(Providers, Writers and Transformers) in their own local system
  • (Possibly) Add examples for Dockerized deployment of Publishers
  • Add Code and User documentation
  • Add CICD Pipeline for deploying python package to PyPi
  • Improve Local Python Packaging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odsynth-0.0.3.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

odsynth-0.0.3-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file odsynth-0.0.3.tar.gz.

File metadata

  • Download URL: odsynth-0.0.3.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fb8390344c771a7e2f53f0fb8660080e460d21dfb925e13edec8c32eefd81b68
MD5 b4eeb11af0d805f03fcd4c2628e2ca1e
BLAKE2b-256 37819e453b4e59b5c3bcc4c18edd83f65e7492385661e792f96dd754f7063389

See more details on using hashes here.

File details

Details for the file odsynth-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: odsynth-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 22d30e69649af00221ea348e54b2cb25b190788710e266cfcec2fb0d3a9bd8be
MD5 5f5f91adb88ba30fd7779aa04dd18193
BLAKE2b-256 8a21ebef3509fd6e028d0e1cc4ae1b500684228598ce4ffc47af1daccbddcd4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page