Skip to main content

Databricks Labs - PySpark Synthetic Data Generator

Project description

Databricks Labs Data Generator (dbldatagen)

build codecov PyPi downloads

Project Description

The dbldatgen Databricks Labs project is a Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos and many other uses.

It operates by defining a data generation specification in code that controls how the synthetic data is to be generated. The specification may incorporate use of existing schemas, or create data in an adhoc fashion.

It has no dependencies on any libraries that are not already incuded in the Databricks runtime, and you can use it from Scala, R or other languages by defining a view over the generated data.

Feature Summary

It supports:

  • Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
  • Generating repeatable, predictable data supporting the needs for producing multiple tables, Change Data Capture, merge and join scenarios with consistency between primary and foreign keys
  • Generating synthetic data for all of the Spark SQL supported primitive types as a Spark data frame which may be persisted, saved to external storage or used in other computations
  • Generating ranges of dates, timestamps and numeric values
  • Generation of discrete values - both numeric and text
  • Generation of values at random and based on the values of other fields (either based on the hash of the underlying values or the values themselves)
  • Ability to specify a distribution for random data generation
  • Generating arrays of values for ML style feature arrays
  • Applying weights to the occurrence of values
  • Generating values to conform to a schema or independent of an existing schema
  • use of SQL expressions in test data generation
  • plugin mechanism to allow use of 3rd party libraries such as Faker
  • Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source

Details of these features can be found in the online documentation - online documentation.

Documentation

Please refer to the online documentation for details of use and many examples.

Release notes and details of the latest changes for this specific release can be found in the Github repository here

Installation

Use pip install dbldatagen to install the PyPi package

Within a Databricks notebook, invoke the following in a notebook cell

%pip install dbldatagen

This can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks community edition.

The documentation installation notes contains details of installation using alternative mechanisms.

Compatibility

The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are compatible with the Databricks runtime 9.1 LTS and later releases.

Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS or later) and built with Python 3.7.5

For full library compatibility for a specific Databricks Spark release, see the Databricks release notes for library compatibility

Using the Data Generator

To use the data generator, install the library using the %pip install method or install the Python wheel directly in your environment.

Once the library has been installed, you can use it to generate a data frame composed of synthetic data.

For example

import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 10
data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
                            .withIdOutput()
                            .withColumn("r", FloatType(), 
                                             expr="floor(rand() * 350) * (86400 + 3600)",
                                             numColumns=column_count)
                            .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
                            .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
                            .withColumn("code3", StringType(), values=['a', 'b', 'c'])
                            .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
                                           random=True)
                            .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
                                           random=True, weights=[9, 1, 1])

                            )
                            
df = df_spec.build()
num_rows=df.count()                          

Refer to the online documentation for further examples.

The Github repository also contains further examples in the examples directory

Project Support

Please note that all projects released under Databricks Labs are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as issues on the Github Repo.
They will be reviewed as time permits, but there are no formal SLAs for support.

Feedback

Issues with the application? Found a bug? Have a great idea for an addition? Feel free to file an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbldatagen-0.3.1.tar.gz (70.4 kB view details)

Uploaded Source

Built Distribution

dbldatagen-0.3.1-py3-none-any.whl (74.2 kB view details)

Uploaded Python 3

File details

Details for the file dbldatagen-0.3.1.tar.gz.

File metadata

  • Download URL: dbldatagen-0.3.1.tar.gz
  • Upload date:
  • Size: 70.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for dbldatagen-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d8ef2fb3ffd107faf3ef2f6e5bd65d5f8ad31a0f93d4b861b1df6d91a3bb19a4
MD5 7254a7054e2cb67ef5f208ee94cc9a36
BLAKE2b-256 ab0facda2762c7cee17cb481ceaf87928116fc399ef783871a7e2fdbed465562

See more details on using hashes here.

File details

Details for the file dbldatagen-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: dbldatagen-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 74.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for dbldatagen-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8f20ee7dd462ea8c4799bdcf947227ae1eaf869955d7e259db4b1197dcbb92a
MD5 b48151f64ce57e98bbfbd6848c93ba1d
BLAKE2b-256 65aac973da85e6a4ff42dee2860a6cae3610ba8ff36ae874d0dffb4c6b5cd107

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page