Skip to main content

Generate privacy‑safe synthetic data from production Spark DataFrames

Project description

ReSpark

Status: Pre-release (0.0.x)

ReSpark is a Python library built on PySpark for generating privacy-preserving synthetic data from existing Spark DataFrames or schemas. It is designed to run in any environment where Spark is available, whether on a local machine, a cluster, or a cloud platform.

Modern data-driven solutions require realistic datasets for development, testing, and analytics. However, using production data introduces privacy risks and governance challenges. ReSpark provides a privacy-first approach to synthetic data generation, preserving the structure and statistical characteristics of your original data while minimising re-identification risk.

Vision

  • Runs Anywhere Spark Runs: Works in any environment where Spark DataFrames are processed, from local setups to large-scale clusters.
  • Python-Friendly: Built on PySpark for seamless integration into Python workflows.
  • Privacy-First Design: Includes validation reporting to check for residual sensitive information or re-identification risk.
  • Relational Integrity: Maintains join consistency with appropriate handling of sensitive and non-sensitive fields.

Installation

pip install respark

This package requires pyspark (Apache-2.0)

Licence

© Crown Copyright 2025 Department for Education
Licensed under the MIT Licence.

Acknowledgements

Built on Apache Spark / PySpark (Apache License 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

respark-0.0.1.tar.gz (2.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

respark-0.0.1-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file respark-0.0.1.tar.gz.

File metadata

  • Download URL: respark-0.0.1.tar.gz
  • Upload date:
  • Size: 2.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for respark-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d0b320525754055abc61b6bf4afee624d9669d1ac4018354bbfc71cdc00e0993
MD5 3f65ad11ab083ba7d829299c277022b6
BLAKE2b-256 34a53aa3b5c64a573f02da8c8e4a259757b47eb1d6dcb54dacd759ff8f50d2a8

See more details on using hashes here.

File details

Details for the file respark-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: respark-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for respark-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7e2936cf840889f9e2b79cbd9712775995e2c67dbf246689191b994810afb5f4
MD5 0b2d0ab4660a06be180a239503dd3913
BLAKE2b-256 5820cea603b73db33418e2b473d9fa93823b65160ad18a96ed7e33fc6bbf380a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page