No project description provided
Project description
ODSynth
ODSynth generates samples of synthetic data for you, based on the expected schema of your data. This project may be used for generating data for:
- Seeding your ETL applications
- Benchmarking of ETL applications
- Producing data in various formats (json, delimited text, xml, etc)
With the plugin system, developers can use their 'providers' locally in their own applications.
Core Idea
See core idea for this project here
How it works
- Specify a schema. See an example here. The providers specify the type of data to be generated. (For example
first_name
,last_name
etc.) - Use the schema to generate data in memory or publish data to a medium
Installation
A proper python package for this application is not yet available, so users must clone the repo and install the Python package locally.
git clone https://github.com/kbaafi/data-synthesizer.git
cd data-synthesizer
# Optional
# python -m venv venv
pip install -e .
Basic Usage
Use 'synth' to generate json data
synth --schema-spec-file=../schema.yaml --format=json --num-samples=3
Use 'synth' to generate csv data
synth --schema-spec-file=../flat_schema.yaml --format=txt --num-samples=3 --formatter-arg delimiter=comma
Delimiter may be one of 'comma', 'tab' or 'pipe'
Use the API in your own code
from odsynth.schema import Schema
def generate_data():
num_samples=3
batch_size=5 # Batch size can be greater than num_samples
format="txt" # Format can be json,xml,txt,pandas
formatter_args=["delimiter=comma"] # Depending on formatter, args may need to be provided. Default is None
schema_spec_file="./sample_schema/flat_schema.yaml" # CSV formatter expects a tabular schema.
# XML, JSON, Pandas and Base Formatters can accept
# hierarchical data
generator = Schema(schema_file=schema_spec_file).build_generator(
num_examples=num_samples,
batch_size=batch_size,
format=format,
formatter_args=formatter_args,
)
data = generator.get_data()
# Prints generated data in csv format
print(data)
Use 'publish' to load synthetic data to local disc in XML format
Publish 100 samples of schema specified in flat_schema.yaml
, 10 examples per batch.
publish --schema-spec-file=../flat_schema.yaml --format=xml --writer=local_disc --writer-arg output_dir=../odsynth_out --num-samples=100 --batch-size=10
For more on the data generator and the data publisher, see the help pages for synth and publish
publish --help
orsynth --help
Schemas and Providers
An example schema is shown below. This schema simulates the scenario of a parent responsible for up to 5 children. Providers are responsible for generating the primitive fields that comprise the record. An example of a provider that generates a random integer can be found here
fields:
parent_firstname:
provider: first_name
parent_lastname:
provider: last_name
children:
fields:
firstname:
provider: first_name
lastname:
provider: last_name
max_count: 5
is_array: true
parent_age:
provider: random_int
provider_args:
min: 25
max: 55
parent_ssn:
provider: ssn
This schema is expected generated a data point that looks like this:
{
"parent_first_name": "Christopher", "parent_lastname": "Villegas",
"children": [
{"firstname": "Jason", "lastname": "Rogers"},
{"firstname": "Andrea", "lastname": "Young"},
{"firstname": "Michelle", "lastname": "Kaiser"}
],
"parent_age": 43,
"parent_ssn": "269-11-8507"
}
Currently ODSynth implements the following Providers from Faker
We hope to be able to develop more Providers in the future.
Formatters
Generated data can be formatted into the following formats for use in memory or storage on disc:
Writers
Writers work with the publishing system to write generated data to a specified medium. Currently the local_disc writer has been implemented. Writers are primarily responsible for writing data to a destination medium which could take any form, e.g. S3, Azure Blob Storage, REST EndPoint, etc.
Plugins and the ODSYNTH_HOME
It is possible for developers to plugin in their own providers, formatters and writers to the Data Synthesis system by loading the user added components from the ODSYNTH_HOME directory. The ODSYNTH_HOME is specified by setting the environment variable ODSYNTH_HOME
export ODSYNTH_HOME=./sample_home_folder
An ODSYNTH Home folder is expected to have the following subfolders the various developer plugins:
- providers for user added providers
- formatters for user added formatters
- writers for user added writers
The plugins system will load all providers, formatters and writers from the HOME folder.
Development Roadmap
- Build Data Formatter for Pandas
- Build Data Formatter for XML
- Build Data Writer for XML
- Build Data Formatter for JSON
- Build Data Writer for JSON
- Build Formatter for Delimited Text
- Add a logger (Under consideration)
- Add some form of support for Py Faker's Locales
- Improve DOM Validation (Ongoing)
- Data Transformer for Spark
- Add support for optional fields
- Build Data Writers for:
- S3
- Kafka
- (Possibly) to REST APIs
- Implement a plugin system for users to add their own code(Providers, Writers and Transformers) in their own local system
- (Possibly) Add examples for Dockerized deployment of Publishers
- Add Code and User documentation
- Add CICD Pipeline for deploying python package to PyPi
- Improve Local Python Packaging
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file odsynth-0.0.3.tar.gz
.
File metadata
- Download URL: odsynth-0.0.3.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb8390344c771a7e2f53f0fb8660080e460d21dfb925e13edec8c32eefd81b68 |
|
MD5 | b4eeb11af0d805f03fcd4c2628e2ca1e |
|
BLAKE2b-256 | 37819e453b4e59b5c3bcc4c18edd83f65e7492385661e792f96dd754f7063389 |
File details
Details for the file odsynth-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: odsynth-0.0.3-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22d30e69649af00221ea348e54b2cb25b190788710e266cfcec2fb0d3a9bd8be |
|
MD5 | 5f5f91adb88ba30fd7779aa04dd18193 |
|
BLAKE2b-256 | 8a21ebef3509fd6e028d0e1cc4ae1b500684228598ce4ffc47af1daccbddcd4c |