Skip to main content

No project description provided

Project description

ODSynth

ODSynth generates samples of synthetic data for you, based on the expected schema of your data. This project may be used for generating data for:

  • Seeding your ETL applications
  • Benchmarking of ETL applications
  • Producing data in various formats (json, delimited text, xml, etc)

With the plugin system, developers can use their 'providers' locally in their own applications.

Core Idea

See core idea for this project here

How it works

  1. Specify a schema. See an example here. The providers specify the type of data to be generated. (For example first_name, last_name etc.)
  2. Use the schema to generate data in memory or publish data to a medium

Installation

A proper python package for this application is not yet available, so users must clone the repo and install the Python package locally.

git clone https://github.com/kbaafi/data-synthesizer.git
cd data-synthesizer
# Optional
# python -m venv venv
pip install -e .

Basic Usage

Use 'synth' to generate json data

synth --schema-spec-file=../schema.yaml --format=json --num-samples=3

Use 'synth' to generate csv data

synth --schema-spec-file=../flat_schema.yaml --format=txt --num-samples=3 --formatter-arg delimiter=comma

Delimiter may be one of 'comma', 'tab' or 'pipe'

Use the API in your own code

from odsynth.schema import Schema

def generate_data():
    num_samples=3
    batch_size=5                          # Batch size can be greater than num_samples
    format="txt"                          # Format can be json,xml,txt,pandas
    formatter_args=["delimiter=comma"]    # Depending on formatter, args may need to be provided. Default is None
    schema_spec_file="./sample_schema/flat_schema.yaml" # CSV formatter expects a tabular schema.
                                                        # XML, JSON, Pandas and Base Formatters can accept
                                                        # hierarchical data

    generator = Schema(schema_file=schema_spec_file).build_generator(
        num_examples=num_samples,
        batch_size=batch_size,
        format=format,
        formatter_args=formatter_args,
    )
    data = generator.get_data()

    # Prints generated data in csv format
    print(data)

Use 'publish' to load synthetic data to local disc in XML format

Publish 100 samples of schema specified in flat_schema.yaml, 10 examples per batch.

publish --schema-spec-file=../flat_schema.yaml --format=xml --writer=local_disc --writer-arg output_dir=../odsynth_out --num-samples=100 --batch-size=10

For more on the data generator and the data publisher, see the help pages for synth and publish publish --help or synth --help

Schemas and Providers

An example schema is shown below. This schema simulates the scenario of a parent responsible for up to 5 children. Providers are responsible for generating the primitive fields that comprise the record. An example of a provider that generates a random integer can be found here

fields:
  parent_firstname:
    provider: first_name
  parent_lastname:
    provider: last_name
  children:
    fields:
      firstname:
        provider: first_name
      lastname:
        provider: last_name
    max_count: 5
    is_array: true
  parent_age:
    provider: random_int
    provider_args:
      min: 25
      max: 55
  parent_ssn:
    provider: ssn

This schema is expected generated a data point that looks like this:

{
    "parent_first_name": "Christopher", "parent_lastname": "Villegas",
    "children": [
        {"firstname": "Jason", "lastname": "Rogers"},
        {"firstname": "Andrea", "lastname": "Young"},
        {"firstname": "Michelle", "lastname": "Kaiser"}
    ],
    "parent_age": 43,
    "parent_ssn": "269-11-8507"
}

Currently ODSynth implements the following Providers from Faker

We hope to be able to develop more Providers in the future.

Formatters

Generated data can be formatted into the following formats for use in memory or storage on disc:

Writers

Writers work with the publishing system to write generated data to a specified medium. Currently the local_disc writer has been implemented. Writers are primarily responsible for writing data to a destination medium which could take any form, e.g. S3, Azure Blob Storage, REST EndPoint, etc.

Plugins and the ODSYNTH_HOME

It is possible for developers to plugin in their own providers, formatters and writers to the Data Synthesis system by loading the user added components from the ODSYNTH_HOME directory. The ODSYNTH_HOME is specified by setting the environment variable ODSYNTH_HOME

export ODSYNTH_HOME=./sample_home_folder

An ODSYNTH Home folder is expected to have the following subfolders the various developer plugins:

  1. providers for user added providers
  2. formatters for user added formatters
  3. writers for user added writers

The plugins system will load all providers, formatters and writers from the HOME folder.

Development Roadmap

  • Build Data Formatter for Pandas
  • Build Data Formatter for XML
  • Build Data Writer for XML
  • Build Data Formatter for JSON
  • Build Data Writer for JSON
  • Build Formatter for Delimited Text
  • Add a logger (Under consideration)
  • Add some form of support for Py Faker's Locales
  • Improve DOM Validation (Ongoing)
  • Data Transformer for Spark
  • Add support for optional fields
  • Build Data Writers for:
    • S3
    • Kafka
    • (Possibly) to REST APIs
  • Implement a plugin system for users to add their own code(Providers, Writers and Transformers) in their own local system
  • (Possibly) Add examples for Dockerized deployment of Publishers
  • Add Code and User documentation
  • Add CICD Pipeline for deploying python package to PyPi
  • Improve Local Python Packaging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

odsynth-0.0.2.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

odsynth-0.0.2-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file odsynth-0.0.2.tar.gz.

File metadata

  • Download URL: odsynth-0.0.2.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.2.tar.gz
Algorithm Hash digest
SHA256 65daf6fce8355c41e5bc15a938b9d7fd753c5f3e8856b520bd5f98bd663420e6
MD5 91d370d60188635f9ed3f454d0a31ba8
BLAKE2b-256 891f22e355b412618032cf759af73a882324033fa95413284ef115083c34796b

See more details on using hashes here.

File details

Details for the file odsynth-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: odsynth-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for odsynth-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b5abed4bd8d8b22df981e2aab5f29b5fe1fc31dfaca1a43a6f4443b17a43b3bc
MD5 fb4d70fe999dad7851861ec5d98ae6ed
BLAKE2b-256 ef42a0b951a9d9eae22b433f4049a314eb0cbf3d8af41567b564bb4552024657

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page