Generate privacy‑safe synthetic data from production Spark DataFrames
Project description
ReSpark
Status: Pre-release (0.0.x)
ReSpark is a Python library built on PySpark for generating privacy-preserving synthetic data from existing Spark DataFrames or schemas. It is designed to run in any environment where Spark is available, whether on a local machine, a cluster, or a cloud platform.
Modern data-driven solutions require realistic datasets for development, testing, and analytics. However, using production data introduces privacy risks and governance challenges. ReSpark provides a privacy-first approach to synthetic data generation, preserving the structure and statistical characteristics of your original data while minimising re-identification risk.
Vision
- Runs Anywhere Spark Runs: Works in any environment where Spark DataFrames are processed, from local setups to large-scale clusters.
- Privacy-First Design: Includes validation reporting to check for residual sensitive information or re-identification risk.
- Relational Integrity: Maintains join consistency with appropriate handling of sensitive and non-sensitive fields.
Installation
pip install respark
This package requires pyspark (Apache-2.0)
Licence
© Crown Copyright 2025 Department for Education
Licensed under the MIT Licence.
Acknowledgements
Built on Apache Spark / PySpark (Apache License 2.0).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file respark-0.0.4.tar.gz.
File metadata
- Download URL: respark-0.0.4.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60b62b067d5914e48f91c20d290d45de859290dcfd3a902eed927c7997a04cf9
|
|
| MD5 |
c00814766db74522dc3858914c955e88
|
|
| BLAKE2b-256 |
f0c9d876003b032edaabb8706c9f36ab4b6d79866e6af4b0c00fe411937ca7a6
|
File details
Details for the file respark-0.0.4-py3-none-any.whl.
File metadata
- Download URL: respark-0.0.4-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b011d5362190f276620e32152cc7eaa0c68537c0c7c2ef101d7e339a071dee60
|
|
| MD5 |
02f3ad3570d44870c8e0c9e519147654
|
|
| BLAKE2b-256 |
0ff69f8ee7812805399ed794ca7e2ce206b5a6ca4b6008ecc042abc39ac81270
|