Skip to main content

A Python package to create synthetic data from locally estimated distributions

Project description

synloc: An Algorithm to Create Synthetic Tabular Data

synloc

Overview | Data Requirements | Installation | A Quick Example | Documentation | How to cite? | Replication

PyPI Python Downloads

Overview

synloc is an open-source Python package implementing the Local Resampler (LR) algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.

Two Subsampling Strategies

Both approaches provide effective disclosure control. Choose based on your priorities:

Approach Best for Key advantage
k-Nearest Neighbors (k-NN) Stronger disclosure control Naturally underrepresents outliers, reducing privacy risks
Clustering-based Efficiency & accuracy Better data utility and computational performance

Key features:

  • Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)
  • Accurate replication of complex distributions, including multimodal and non-convex-support data
  • Flexible trade-off between data utility and privacy protection
  • Built-in quality diagnostics, including Kolmogorov-Smirnov distances, Wasserstein distances, summary statistics, and correlation-difference metrics
  • Compatible with parametric and nonparametric distributions

This implementation aligns with statistical agencies' safe data regulations, including the k-anonymity criterion and the Five Safes framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the paper referenced below.

Data Requirements

synloc expects a numeric pandas.DataFrame.

  • Categorical variables must be encoded before synthesis, for example with pandas.get_dummies.
  • Boolean dummy variables are accepted and converted to 0/1.
  • Missing numeric values are filled with column medians during fitting.
  • Columns with only missing values, duplicate column names, infinite values, and non-numeric columns raise clear errors.
  • Integer-like variables can be rounded after synthesis with round_integers.

Installation

synloc can be installed through PyPI:

pip install synloc

A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,,0.1)$$

$$y \sim Beta(0.1,, 0.5)$$

$$z \sim 10 y + Normal(0,,1)$$

The distribution can be generated by tools module in synloc:

from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. 

Initializing the resampler:

from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)

Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."

syn_data = resampler.fit() 

syn_data is a pandas.DataFrame where all variables are synthesized. Comparing the original sample using a 3-D Scatter:

resampler.comparePlots(['x','y','z'])

You can also inspect utility diagnostics after fitting:

variable_metrics = resampler.compareStats()
quality = resampler.qualityReport()

print(variable_metrics[["ks_statistic", "wasserstein_distance"]])
print(quality["overall"])

How to cite?

If you use synloc in your research, please cite the following paper:

@article{https://doi.org/10.1111/anzs.70032,
    author = {Kalay, Ali Furkan},
    title = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},
    journal = {Australian \& New Zealand Journal of Statistics},
    volume = {68},
    number = {1},
    pages = {e70032},
    doi = {https://doi.org/10.1111/anzs.70032},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/anzs.70032},
    year = {2026}
}

Replication

For replication materials of the paper, see the replication folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synloc-1.0.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synloc-1.0.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file synloc-1.0.0.tar.gz.

File metadata

  • Download URL: synloc-1.0.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for synloc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f0d4d7ab8959d3882a31f35bfa6d9d57fb32fd1fb1137de034da7cc0beff1ee9
MD5 a134d3fff4df9c9477481848b43df316
BLAKE2b-256 91eb6f693a4189e78f91c73fe783e2ca795c3945d222cf4aa781aa7a59cdf206

See more details on using hashes here.

File details

Details for the file synloc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: synloc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for synloc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f48be7b3fbb45e1d898c752b5927716590cd17c80994c90b043411d6d6325c0
MD5 704cefed0655a2aff5e7221f847cd450
BLAKE2b-256 90a56a9f96f6b20988ab20b3c0829371acd942fcb8ef4074072cc656e2bed84f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page