Non-Parametric Gaussian Copula synthesizer for tabular data

These details have not been verified by PyPI

Project links

Project description

NPGC

Citation

The method underlying this package was first introduced in:

Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk. Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach. Accepted at EDM 2026 | Preprint

BibTeX:

@misc{diazramos2026npgc,
  title={Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach},
  author={Gabriel Diaz Ramos and Lorenzo Luzi and Debshila Basu Mallick and Richard Baraniuk},
  year={2026},
  note={Accepted at EDM 2026 | Preprint}
}

npgc is a Python package for fitting a non-parametric Gaussian copula to tabular data and generating synthetic tabular samples from the learned distribution. The implementation combines empirical marginal models with a Gaussian copula dependence structure and includes an optional differential privacy mechanism controlled by epsilon.

This package is currently published as version 0.2.0 and should be treated as an alpha-stage research software release.

Overview
Requirements
Installation
Minimal Usage
Persistence Example
Technical Notes
Data Contract
Reproducibility
Project Metadata

Overview

NPGC models each column marginally and then couples the marginals through a Pearson correlation matrix in Gaussian latent space:

Each input column is transformed to the unit interval with an empirical CDF.
Uniform scores are mapped into Gaussian latent variables with the probit transform.
A correlation matrix is estimated in latent space.
New latent Gaussian samples are drawn and inverse-transformed back into the original feature domains.

The implementation supports:

continuous numeric columns
integer-valued numeric columns
categorical columns
datetime columns (datetime64, tz-naive and tz-aware)
mixed-type pandas.DataFrame inputs
missing values (NaN / NaT) through column-wise missingness estimation
model persistence with save(...) and load(...)
optional differential privacy noise through epsilon

Requirements

Python >=3.10
NumPy
pandas
SciPy

Installation

PyPI

python -m pip install -U npgc

Google Colab

!python -m pip install -U --no-cache-dir npgc
from npgc import NPGC

VS Code with `uv`

Add the package to the active project environment:

uv add npgc

Then import it normally:

from npgc import NPGC

Constructor signature:

NPGC(enforce_min_max_values: bool = True, epsilon: float | None = 1.0)

`NPGC.init(enforce_min_max_values=True, epsilon=1.0)`

Initializes an unfitted synthesizer.

Parameter	Type	Default	Technical meaning
`enforce_min_max_values`	`bool`	`True`	Controls tail behavior during inverse ECDF reconstruction. When `True`, continuous outputs remain within the observed training range and integer outputs are snapped to the observed integer support. When `False`, continuous and integer-valued variables may extrapolate beyond the observed extrema.
`epsilon`	`float \| None`	`1.0`	Default differential privacy budget used during `fit(...)` if no per-fit override is provided. If `None` or non-positive, the privacy mechanism is disabled and empirical statistics are used directly.

`fit(data, epsilon=None, random_state=None)`

Method signature:

fit(data: pandas.DataFrame, epsilon: float | None = None, random_state: int | None = None) -> None

Fits the synthesizer to a tabular dataset.

Parameter	Type	Default	Technical meaning
`data`	`pandas.DataFrame`	required	Training table. The implementation requires a non-empty `DataFrame`.
`epsilon`	`float \| None`	`None`	Optional fit-time override for the instance privacy budget. If supplied, it takes precedence over `self.epsilon`.
`random_state`	`int \| None`	`None`	Seed used for reproducible privacy noise and randomized empirical CDF tie-breaking during fitting.

Behavior:

Raises ValueError if data is not a pandas.DataFrame.
Raises ValueError if data is empty.
Stores learned marginals, latent correlation matrix, and column order internally.
Marks the model as fitted.

`sample(num_rows, seed=None)`

Method signature:

sample(num_rows: int, seed: int | None = None) -> pandas.DataFrame

Generates synthetic rows from a previously fitted model.

Parameter	Type	Default	Technical meaning
`num_rows`	`int`	required	Number of synthetic rows to generate.
`seed`	`int \| None`	`None`	Random seed for reproducible sampling from the latent Gaussian model.

Behavior:

Raises RuntimeError if called before fit(...).
Returns a pandas.DataFrame with the learned column order.
Attempts to cast each generated column back to the original training dtype.

`save(filepath)`

Method signature:

save(filepath: str | os.PathLike[str]) -> None

Serializes the fitted model as a pickle file. Parent directories are created automatically when needed.

`load(filepath)`

Method signature:

load(filepath: str | os.PathLike[str]) -> None

Loads model state into the current NPGC instance from a pickle file. The loader supports both object-based checkpoints and a legacy dictionary-based state format.

Minimal Usage

import pandas as pd

from npgc import NPGC

df = pd.DataFrame(
    {
        "signup_date": pd.to_datetime(["2022-01-15", "2022-06-03", "2023-02-20", "2023-11-08"]),
        "age": [21, 34, 45, 52],
        "income": [42000.0, 68000.0, 91000.0, 120000.0],
        "segment": ["A", "B", "B", "C"],
    }
)

model = NPGC(enforce_min_max_values=True, epsilon=1.0)
model.fit(df, random_state=42)

synthetic = model.sample(100, seed=42)
print(synthetic.head())

Persistence Example

from npgc import NPGC

model = NPGC(epsilon=1.0)
model.fit(df, random_state=42)
model.save("artifacts/npgc_model.pkl")

reloaded = NPGC()
reloaded.load("artifacts/npgc_model.pkl")
synthetic = reloaded.sample(50, seed=7)

Technical Notes

What `epsilon` does

epsilon is the differential privacy budget.

Smaller epsilon means stronger privacy and more perturbation.
Larger epsilon means weaker privacy and less perturbation.
epsilon=None disables the privacy mechanism.
epsilon<=0 is treated as non-private in the current implementation.

The current implementation splits the total privacy budget equally:

epsilon / 2 for marginal estimation
epsilon / 2 for latent correlation estimation

Mechanistically, the code applies Laplace noise to:

integer support counts
continuous histograms
categorical counts
latent-space correlation estimates

This means privacy is not added only at the final sample stage; it is injected during model fitting into the sufficient statistics used to construct the synthetic generator.

What `enforce_min_max_values` does

enforce_min_max_values controls whether inverse marginal reconstruction is range-constrained.

When True:

continuous columns are reconstructed within the observed training range
integer columns are mapped to the nearest observed integer support value
generated values remain conservative with respect to the observed empirical support

When False:

continuous columns may extrapolate beyond the observed minimum and maximum
integer columns may extrapolate before final dtype casting
the model can generate values outside the original empirical range

This parameter is especially important when synthetic outputs must remain support-faithful for downstream validation, auditing, or schema-constrained pipelines.

Missing values

Missingness is modeled per column through the observed missing fraction:

numeric columns preserve an estimated nan_frac
categorical columns reserve probability mass for missing values
synthetic samples may therefore contain missing values when the training data does

Type handling

Column handling is determined from the training DataFrame:

datetime64 columns (tz-naive or tz-aware) are detected first and modeled as continuous float-seconds since epoch; timezone is preserved in the output dtype
numeric dtypes are modeled as either integer or continuous
non-numeric dtypes are treated as categorical
output columns are cast back toward the original dtype after generation

For datetime columns, DP noise is applied to the 100-bin histogram of float-seconds, which avoids the uniqueness problem of nanosecond integer encoding and keeps sensitivity well-defined (±1 per bin count) regardless of timestamp granularity.

For integer-valued numeric columns, the implementation detects integer structure from the observed non-missing values and uses a dedicated inverse ECDF path.

Data Contract

Expected input:

a non-empty pandas.DataFrame
tabular columns with numeric or categorical-like values
optional missing values

Current implementation details worth knowing:

column order is preserved
correlations are computed in Gaussian latent space with Pearson correlation
the correlation matrix is repaired to the nearest valid correlation matrix when noise or numerical issues make it non-PSD
categorical labels are sampled from observed label support

Reproducibility

There are two independent random entry points:

random_state in fit(...) controls fitting-time randomness, including privacy noise and randomized ECDF operations
seed in sample(...) controls synthetic sample generation after fitting

For exact reproducibility, set both.

Project Metadata

Package name: npgc
Current version: 0.2.0
Issue tracker: https://github.com/gdiaz95/NPGC/issues

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 19, 2026

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npgc-0.2.0.tar.gz (61.0 kB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

npgc-0.2.0-py3-none-any.whl (12.5 kB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file npgc-0.2.0.tar.gz.

File metadata

Download URL: npgc-0.2.0.tar.gz
Upload date: Apr 19, 2026
Size: 61.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npgc-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ab250d5ab4e1949f70d88c512b5ede5d822618a5145342170cfe25744c49948d`
MD5	`7e10de7dead175b244bec45dbdd1a9b0`
BLAKE2b-256	`04fbbce9c549eb10c372a91303b1488e9605f08306e18531e7dae9a758b91857`

See more details on using hashes here.

File details

Details for the file npgc-0.2.0-py3-none-any.whl.

File metadata

Download URL: npgc-0.2.0-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for npgc-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5153fc53ff1769ab5de8fc9a661523cc4811dd2f09cbdd8c52cf770f1406d20f`
MD5	`f32a4c50056323447420ba9c610e22d1`
BLAKE2b-256	`4ec4bf466e48488094f5a72fb7ac1aed1ad8a05c0faddf4288e56f40817b3c9a`

See more details on using hashes here.

npgc 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NPGC

Citation

Table of Contents

Overview

Requirements

Installation

PyPI

Google Colab

VS Code with uv

NPGC.__init__(enforce_min_max_values=True, epsilon=1.0)

fit(data, epsilon=None, random_state=None)

sample(num_rows, seed=None)

save(filepath)

load(filepath)

Minimal Usage

Persistence Example

Technical Notes

What epsilon does

What enforce_min_max_values does

Missing values

Type handling

Data Contract

Reproducibility

Project Metadata

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

VS Code with `uv`

`NPGC.init(enforce_min_max_values=True, epsilon=1.0)`

`fit(data, epsilon=None, random_state=None)`

`sample(num_rows, seed=None)`

`save(filepath)`

`load(filepath)`

What `epsilon` does

What `enforce_min_max_values` does