Skip to main content

Helps to generated synthetic data that can be used to test ML models etc

Project description

PySynic

Synthetic data generating framework for Python.

Introduction

Creating synthetic data is domain specific but often there are common requirements. For example, you may want

  • numbers or dates that are anywhere within a range even if you don't care exactly where
  • some values to be randomly null
  • a whole data frame to test your production code

PySynic facilitates that.

However, what we're really advocating is the philosophy of creating synthetic data to test your pipelines and machine learning models etc.

If you find the code here helpful, then great. It's available on PyPI so just add pysynic to your dependency list. It has no dependencies of its own so you won't have any transitive dependency issues.

Example

You can use this framework to generate data for your tests. For instance, if you want to test PySpark code with 1000 rows of bespoke data, you could write something similar to:

from pyspark.sql import SparkSession
from pysynic.synthetic_data import random_from, randomly_null, random_integer_in_range, random_date

def test_first_diagnosis(spark_session: SparkSession):
    data = []
    for i in range(1000):
        data.append([random_integer_in_range(0, 100, i),
                     randomly_null(random_from(["cancer", "heart attack", "stroke"])),
                     random_date(i, 31, "1/Jul/2021")
                     ])
    df = spark_session.createDataFrame(data, 
                                       ["patient_id", "disease_code", "admission_date"])
    results = YOUR_PRODUCTION_METHOD(df)
    assert results.count() > 0  # etc, etc

In this PyTest snippet, we create a Spark DataFrame that contains synthetic data. We're not too interested in exactly what the data is, just that it is representative. Then we use it to call our production code that presumably does something interesting and finally we make some sensible assertions. These assertions will be domain specific and we can't tell you what they are but hopefully you can see that with just a few lines of Python we can have large, semi-random test data sets.

Note that in this example, the data is the same every time we run it. If you want it to be unpredictable, then don't provide a seed to the PySynic methods (in this case above, don't pass i but instead None). Whether you want an element of determinism or true randomness is up to you. There are arguments for both.

If we were to run the same code in a PySpark shell, we could see that the output looks something like:

>>> df.show()
+----------+------------+-------------------+                                   
|patient_id|disease_code|     admission_date|
+----------+------------+-------------------+
|         0|        null|2021-07-01 00:00:00|
|         1|        null|2021-07-02 00:00:00|
|         2|        null|2021-07-03 00:00:00|
|         3|      cancer|2021-07-04 00:00:00|
|         4|      stroke|2021-07-05 00:00:00|
|         5|        null|2021-07-06 00:00:00|
|         6|        null|2021-07-07 00:00:00|
|         7|heart attack|2021-07-08 00:00:00|
|         8|        null|2021-07-09 00:00:00|
|         9|        null|2021-07-10 00:00:00|
|        10|        null|2021-07-11 00:00:00|
|        11|        null|2021-07-12 00:00:00|
|        12|      cancer|2021-07-13 00:00:00|
|        13|      stroke|2021-07-14 00:00:00|
|        14|        null|2021-07-15 00:00:00|
|        15|heart attack|2021-07-16 00:00:00|
|        16|heart attack|2021-07-17 00:00:00|
|        17|      stroke|2021-07-18 00:00:00|
|        18|      cancer|2021-07-19 00:00:00|
|        19|      stroke|2021-07-20 00:00:00|
+----------+------------+-------------------+
only showing top 20 rows

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysynic-0.0.7.tar.gz (7.2 kB view hashes)

Uploaded Source

Built Distribution

pysynic-0.0.7-py3-none-any.whl (7.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page