A Python package to create synthetic data from locally estimated distributions
Project description
synloc: An Algorithm to Create Synthetic Tabular Data
Overview | Data Requirements | Installation | A Quick Example | Documentation | How to cite? | Replication
Overview
synloc is an open-source Python package implementing the Local Resampler (LR) algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.
Two Subsampling Strategies
Both approaches provide effective disclosure control. Choose based on your priorities:
| Approach | Best for | Key advantage |
|---|---|---|
| k-Nearest Neighbors (k-NN) | Stronger disclosure control | Naturally underrepresents outliers, reducing privacy risks |
| Clustering-based | Efficiency & accuracy | Better data utility and computational performance |
Key features:
- Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)
- Accurate replication of complex distributions, including multimodal and non-convex-support data
- Flexible trade-off between data utility and privacy protection
- Built-in quality diagnostics, including Kolmogorov-Smirnov distances, Wasserstein distances, summary statistics, and correlation-difference metrics
- Compatible with parametric and nonparametric distributions
This implementation aligns with statistical agencies' safe data regulations, including the k-anonymity criterion and the Five Safes framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the paper referenced below.
Data Requirements
synloc expects a numeric pandas.DataFrame.
- Categorical variables must be encoded before synthesis, for example with
pandas.get_dummies. - Boolean dummy variables are accepted and converted to
0/1. - Missing numeric values are filled with column medians during fitting.
- Columns with only missing values, duplicate column names, infinite values, and non-numeric columns raise clear errors.
- Integer-like variables can be rounded after synthesis with
round_integers.
Installation
synloc can be installed through PyPI:
pip install synloc
A Quick Example
Assume that we have a sample with three variables with the following distributions:
$$x \sim Beta(0.1,,0.1)$$
$$y \sim Beta(0.1,, 0.5)$$
$$z \sim 10 y + Normal(0,,1)$$
The distribution can be generated by tools module in synloc:
from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default.
Initializing the resampler:
from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)
Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."
syn_data = resampler.fit()
syn_data is a pandas.DataFrame where all variables are synthesized. Comparing the original sample using a 3-D Scatter:
resampler.comparePlots(['x','y','z'])
You can also inspect utility diagnostics after fitting:
variable_metrics = resampler.compareStats()
quality = resampler.qualityReport()
print(variable_metrics[["ks_statistic", "wasserstein_distance"]])
print(quality["overall"])
How to cite?
If you use synloc in your research, please cite the following paper:
@article{https://doi.org/10.1111/anzs.70032,
author = {Kalay, Ali Furkan},
title = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},
journal = {Australian \& New Zealand Journal of Statistics},
volume = {68},
number = {1},
pages = {e70032},
doi = {https://doi.org/10.1111/anzs.70032},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/anzs.70032},
year = {2026}
}
Replication
For replication materials of the paper, see the replication folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synloc-1.0.0.tar.gz.
File metadata
- Download URL: synloc-1.0.0.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0d4d7ab8959d3882a31f35bfa6d9d57fb32fd1fb1137de034da7cc0beff1ee9
|
|
| MD5 |
a134d3fff4df9c9477481848b43df316
|
|
| BLAKE2b-256 |
91eb6f693a4189e78f91c73fe783e2ca795c3945d222cf4aa781aa7a59cdf206
|
File details
Details for the file synloc-1.0.0-py3-none-any.whl.
File metadata
- Download URL: synloc-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f48be7b3fbb45e1d898c752b5927716590cd17c80994c90b043411d6d6325c0
|
|
| MD5 |
704cefed0655a2aff5e7221f847cd450
|
|
| BLAKE2b-256 |
90a56a9f96f6b20988ab20b3c0829371acd942fcb8ef4074072cc656e2bed84f
|