Skip to main content

No project description provided

Project description

ODSynth

ODSynth generates samples of synthetic data for you, based on the expected schema of your data. This project may be used for generating data for:

  • Seeding your ETL applications
  • Benchmarking of ETL applications
  • Producing data in various formats (json, delimited text, xml, etc)

With the plugin system, developers can use their 'providers' locally in their own applications.

Core Idea

See core idea for this project here

How it works

  1. Specify a schema. See an example here. The providers specify the type of data to be generated. (For example first_name, last_name etc.)
  2. Use the schema to generate data in memory or publish data to a medium

Installation

pip install odsynth

Basic Usage

Use 'synth' to generate json data

synth --schema-spec-file=./sample_schema/schema.yaml --format=json --num-samples=3

Use 'synth' to generate csv data

synth --schema-spec-file=./sample_schema/flat_schema.yaml --format=txt --num-samples=3 --formatter-arg delimiter=comma

Delimiter may be one of 'comma', 'tab' or 'pipe'

Use the API in your own code

from odsynth.schema import Schema

def generate_data():
    num_samples=3
    batch_size=5                          # Batch size can be greater than num_samples
    format="txt"                          # Format can be json,xml,txt,pandas
    formatter_args=["delimiter=comma"]    # Depending on formatter, args may need to be provided. Default is None
    schema_spec_file="./sample_schema/flat_schema.yaml" # CSV formatter expects a tabular schema.
                                                        # XML, JSON, Pandas and Base Formatters can accept
                                                        # hierarchical data

    generator = Schema(schema_file=schema_spec_file).build_generator(
        num_examples=num_samples,
        batch_size=batch_size,
        format=format,
        formatter_args=formatter_args,
    )
    data = generator.get_data()

    # Prints generated data in csv format
    print(data)

Use 'publish' to load synthetic data to local disc in XML format

Publish 100 samples of schema specified in flat_schema.yaml, 10 examples per batch.

publish --schema-spec-file=./sample_schema/flat_schema.yaml--format=xml --writer=local_disc --writer-arg output_dir=../odsynth_out --num-samples=100 --batch-size=10

For more on the data generator and the data publisher, see the help pages for synth and publish publish --help or synth --help

Schemas and Providers

An example schema is shown below. This schema simulates the scenario of a parent responsible for up to 5 children. Providers are responsible for generating the primitive fields that comprise the record. An example of a provider that generates a random integer can be found here

fields:
  parent_firstname:
    provider: first_name
  parent_lastname:
    provider: last_name
  children:
    fields:
      firstname:
        provider: first_name
      lastname:
        provider: last_name
    max_count: 5
    is_array: true
  parent_age:
    provider: random_int
    provider_args:
      min: 25
      max: 55
  parent_ssn:
    provider: ssn

This schema is expected generated a data point that looks like this:

{
    "parent_first_name": "Christopher", "parent_lastname": "Villegas",
    "children": [
        {"firstname": "Jason", "lastname": "Rogers"},
        {"firstname": "Andrea", "lastname": "Young"},
        {"firstname": "Michelle", "lastname": "Kaiser"}
    ],
    "parent_age": 43,
    "parent_ssn": "269-11-8507"
}

Currently ODSynth implements the following Providers based on Faker's Providers

We hope to be able to develop more Providers in the future.

Formatters

Generated data can be formatted into the following formats for use in memory or storage on disc:

Writers

Writers work with the publishing system to write generated data to a specified medium. Currently the local_disc writer has been implemented. Writers are primarily responsible for writing data to a destination medium which could take any form, e.g. S3, Azure Blob Storage, REST EndPoint, etc.

Plugins and the ODSYNTH_HOME

It is possible for developers to plugin in their own providers, formatters and writers to the Data Synthesis system by loading the user added components from the ODSYNTH_HOME directory. The ODSYNTH_HOME is specified by setting the environment variable ODSYNTH_HOME

export ODSYNTH_HOME=./sample_home_folder

An ODSYNTH Home folder is expected to have the following subfolders the various developer plugins:

  1. providers for user added providers
  2. formatters for user added formatters
  3. writers for user added writers

The plugins system will load all providers, formatters and writers from the HOME folder.

Development Roadmap

  • Add a logger (Under consideration)
  • Add some form of support for Py Faker's Locales
  • Data Transformer for Spark
  • Add support for optional fields
  • Build Data Writers for:
    • S3
    • Kafka
    • (Possibly) to REST APIs
  • Implement a plugin system for users to add their own code(Providers, Writers and Transformers) in their own local system
  • (Possibly) Add examples for Dockerized deployment of Publishers
  • Add Code and User documentation
  • Add CICD Pipeline for deploying python package to PyPi
  • Improve Local Python Packaging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odsynth-0.0.1a7.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

odsynth-0.0.1a7-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file odsynth-0.0.1a7.tar.gz.

File metadata

  • Download URL: odsynth-0.0.1a7.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.1a7.tar.gz
Algorithm Hash digest
SHA256 ee77d7281a7d181fb0ecb66e345c28617f1ec0a4e08627d157d3b4a9d7633cfa
MD5 77ef75abcd54deab36acf7fa2421f5fb
BLAKE2b-256 cc80dcc51762a6c6e58b9a5ad270722b82127d04f883ed607e06ab8a74658a16

See more details on using hashes here.

File details

Details for the file odsynth-0.0.1a7-py3-none-any.whl.

File metadata

  • Download URL: odsynth-0.0.1a7-py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.1a7-py3-none-any.whl
Algorithm Hash digest
SHA256 f45ea1c76142ac82e5dcdb9ce7f79fb034a43c1ae220febb2118b885ea168898
MD5 122bff2e0f43245d6c6ccc6e76682bd7
BLAKE2b-256 0f0f5691b7f24636be6098c264dabce8e6b9732b5541a91004265fe6ba504c27

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page