Skip to main content

Non-Parametric Gaussian Copula synthesizer for tabular data

Project description

NPGC

Python License

Citation

The method underlying this package was first introduced in:

Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk. Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach. Accepted at EDM 2026 | Preprint

BibTeX:

@misc{diazramos2026npgc,
  title={Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach},
  author={Gabriel Diaz Ramos and Lorenzo Luzi and Debshila Basu Mallick and Richard Baraniuk},
  year={2026},
  note={Accepted at EDM 2026 | Preprint}
}

npgc is a Python package for fitting a non-parametric Gaussian copula to tabular data and generating synthetic tabular samples from the learned distribution. The implementation combines empirical marginal models with a Gaussian copula dependence structure and includes an optional differential privacy mechanism controlled by epsilon.

This package is currently published as version 0.2.0 and should be treated as an alpha-stage research software release.

Table of Contents

Overview

NPGC models each column marginally and then couples the marginals through a Pearson correlation matrix in Gaussian latent space:

  1. Each input column is transformed to the unit interval with an empirical CDF.
  2. Uniform scores are mapped into Gaussian latent variables with the probit transform.
  3. A correlation matrix is estimated in latent space.
  4. New latent Gaussian samples are drawn and inverse-transformed back into the original feature domains.

The implementation supports:

  • continuous numeric columns
  • integer-valued numeric columns
  • categorical columns
  • datetime columns (datetime64, tz-naive and tz-aware)
  • mixed-type pandas.DataFrame inputs
  • missing values (NaN / NaT) through column-wise missingness estimation
  • model persistence with save(...) and load(...)
  • optional differential privacy noise through epsilon

Requirements

  • Python >=3.10
  • NumPy
  • pandas
  • SciPy

Installation

PyPI

python -m pip install -U npgc

Google Colab

!python -m pip install -U --no-cache-dir npgc
from npgc import NPGC

VS Code with uv

Add the package to the active project environment:

uv add npgc

Then import it normally:

from npgc import NPGC

Constructor signature:

NPGC(enforce_min_max_values: bool = True, epsilon: float | None = 1.0)

NPGC.__init__(enforce_min_max_values=True, epsilon=1.0)

Initializes an unfitted synthesizer.

Parameter Type Default Technical meaning
enforce_min_max_values bool True Controls tail behavior during inverse ECDF reconstruction. When True, continuous outputs remain within the observed training range and integer outputs are snapped to the observed integer support. When False, continuous and integer-valued variables may extrapolate beyond the observed extrema.
epsilon float | None 1.0 Default differential privacy budget used during fit(...) if no per-fit override is provided. If None or non-positive, the privacy mechanism is disabled and empirical statistics are used directly.

fit(data, epsilon=None, random_state=None)

Method signature:

fit(data: pandas.DataFrame, epsilon: float | None = None, random_state: int | None = None) -> None

Fits the synthesizer to a tabular dataset.

Parameter Type Default Technical meaning
data pandas.DataFrame required Training table. The implementation requires a non-empty DataFrame.
epsilon float | None None Optional fit-time override for the instance privacy budget. If supplied, it takes precedence over self.epsilon.
random_state int | None None Seed used for reproducible privacy noise and randomized empirical CDF tie-breaking during fitting.

Behavior:

  • Raises ValueError if data is not a pandas.DataFrame.
  • Raises ValueError if data is empty.
  • Stores learned marginals, latent correlation matrix, and column order internally.
  • Marks the model as fitted.

sample(num_rows, seed=None)

Method signature:

sample(num_rows: int, seed: int | None = None) -> pandas.DataFrame

Generates synthetic rows from a previously fitted model.

Parameter Type Default Technical meaning
num_rows int required Number of synthetic rows to generate.
seed int | None None Random seed for reproducible sampling from the latent Gaussian model.

Behavior:

  • Raises RuntimeError if called before fit(...).
  • Returns a pandas.DataFrame with the learned column order.
  • Attempts to cast each generated column back to the original training dtype.

save(filepath)

Method signature:

save(filepath: str | os.PathLike[str]) -> None

Serializes the fitted model as a pickle file. Parent directories are created automatically when needed.

load(filepath)

Method signature:

load(filepath: str | os.PathLike[str]) -> None

Loads model state into the current NPGC instance from a pickle file. The loader supports both object-based checkpoints and a legacy dictionary-based state format.

Minimal Usage

import pandas as pd

from npgc import NPGC

df = pd.DataFrame(
    {
        "signup_date": pd.to_datetime(["2022-01-15", "2022-06-03", "2023-02-20", "2023-11-08"]),
        "age": [21, 34, 45, 52],
        "income": [42000.0, 68000.0, 91000.0, 120000.0],
        "segment": ["A", "B", "B", "C"],
    }
)

model = NPGC(enforce_min_max_values=True, epsilon=1.0)
model.fit(df, random_state=42)

synthetic = model.sample(100, seed=42)
print(synthetic.head())

Persistence Example

from npgc import NPGC

model = NPGC(epsilon=1.0)
model.fit(df, random_state=42)
model.save("artifacts/npgc_model.pkl")

reloaded = NPGC()
reloaded.load("artifacts/npgc_model.pkl")
synthetic = reloaded.sample(50, seed=7)

Technical Notes

What epsilon does

epsilon is the differential privacy budget.

  • Smaller epsilon means stronger privacy and more perturbation.
  • Larger epsilon means weaker privacy and less perturbation.
  • epsilon=None disables the privacy mechanism.
  • epsilon<=0 is treated as non-private in the current implementation.

The current implementation splits the total privacy budget equally:

  • epsilon / 2 for marginal estimation
  • epsilon / 2 for latent correlation estimation

Mechanistically, the code applies Laplace noise to:

  • integer support counts
  • continuous histograms
  • categorical counts
  • latent-space correlation estimates

This means privacy is not added only at the final sample stage; it is injected during model fitting into the sufficient statistics used to construct the synthetic generator.

What enforce_min_max_values does

enforce_min_max_values controls whether inverse marginal reconstruction is range-constrained.

When True:

  • continuous columns are reconstructed within the observed training range
  • integer columns are mapped to the nearest observed integer support value
  • generated values remain conservative with respect to the observed empirical support

When False:

  • continuous columns may extrapolate beyond the observed minimum and maximum
  • integer columns may extrapolate before final dtype casting
  • the model can generate values outside the original empirical range

This parameter is especially important when synthetic outputs must remain support-faithful for downstream validation, auditing, or schema-constrained pipelines.

Missing values

Missingness is modeled per column through the observed missing fraction:

  • numeric columns preserve an estimated nan_frac
  • categorical columns reserve probability mass for missing values
  • synthetic samples may therefore contain missing values when the training data does

Type handling

Column handling is determined from the training DataFrame:

  • datetime64 columns (tz-naive or tz-aware) are detected first and modeled as continuous float-seconds since epoch; timezone is preserved in the output dtype
  • numeric dtypes are modeled as either integer or continuous
  • non-numeric dtypes are treated as categorical
  • output columns are cast back toward the original dtype after generation

For datetime columns, DP noise is applied to the 100-bin histogram of float-seconds, which avoids the uniqueness problem of nanosecond integer encoding and keeps sensitivity well-defined (±1 per bin count) regardless of timestamp granularity.

For integer-valued numeric columns, the implementation detects integer structure from the observed non-missing values and uses a dedicated inverse ECDF path.

Data Contract

Expected input:

  • a non-empty pandas.DataFrame
  • tabular columns with numeric or categorical-like values
  • optional missing values

Current implementation details worth knowing:

  • column order is preserved
  • correlations are computed in Gaussian latent space with Pearson correlation
  • the correlation matrix is repaired to the nearest valid correlation matrix when noise or numerical issues make it non-PSD
  • categorical labels are sampled from observed label support

Reproducibility

There are two independent random entry points:

  • random_state in fit(...) controls fitting-time randomness, including privacy noise and randomized ECDF operations
  • seed in sample(...) controls synthetic sample generation after fitting

For exact reproducibility, set both.

Project Metadata

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npgc-0.2.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

npgc-0.2.0-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file npgc-0.2.0.tar.gz.

File metadata

  • Download URL: npgc-0.2.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npgc-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ab250d5ab4e1949f70d88c512b5ede5d822618a5145342170cfe25744c49948d
MD5 7e10de7dead175b244bec45dbdd1a9b0
BLAKE2b-256 04fbbce9c549eb10c372a91303b1488e9605f08306e18531e7dae9a758b91857

See more details on using hashes here.

File details

Details for the file npgc-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: npgc-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npgc-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5153fc53ff1769ab5de8fc9a661523cc4811dd2f09cbdd8c52cf770f1406d20f
MD5 f32a4c50056323447420ba9c610e22d1
BLAKE2b-256 4ec4bf466e48488094f5a72fb7ac1aed1ad8a05c0faddf4288e56f40817b3c9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page