Non-Parametric Gaussian Copula synthesizer for tabular data
Project description
NPGC
Citation
The method underlying this package was first introduced in:
Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk. Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach. Accepted at EDM 2026 | Preprint
BibTeX:
@misc{diazramos2026npgc,
title={Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach},
author={Gabriel Diaz Ramos and Lorenzo Luzi and Debshila Basu Mallick and Richard Baraniuk},
year={2026},
note={Accepted at EDM 2026 | Preprint}
}
npgc is a Python package for fitting a non-parametric Gaussian copula to tabular data and generating synthetic tabular samples from the learned distribution. The implementation combines empirical marginal models with a Gaussian copula dependence structure and includes an optional differential privacy mechanism controlled by epsilon.
This package is currently published as version 0.2.0 and should be treated as an alpha-stage research software release.
Table of Contents
- Overview
- Requirements
- Installation
- Minimal Usage
- Persistence Example
- Technical Notes
- Data Contract
- Reproducibility
- Project Metadata
Overview
NPGC models each column marginally and then couples the marginals through a Pearson correlation matrix in Gaussian latent space:
- Each input column is transformed to the unit interval with an empirical CDF.
- Uniform scores are mapped into Gaussian latent variables with the probit transform.
- A correlation matrix is estimated in latent space.
- New latent Gaussian samples are drawn and inverse-transformed back into the original feature domains.
The implementation supports:
- continuous numeric columns
- integer-valued numeric columns
- categorical columns
- datetime columns (
datetime64, tz-naive and tz-aware) - mixed-type
pandas.DataFrameinputs - missing values (
NaN/NaT) through column-wise missingness estimation - model persistence with
save(...)andload(...) - optional differential privacy noise through
epsilon
Requirements
- Python
>=3.10 - NumPy
- pandas
- SciPy
Installation
PyPI
python -m pip install -U npgc
Google Colab
!python -m pip install -U --no-cache-dir npgc
from npgc import NPGC
VS Code with uv
Add the package to the active project environment:
uv add npgc
Then import it normally:
from npgc import NPGC
Constructor signature:
NPGC(enforce_min_max_values: bool = True, epsilon: float | None = 1.0)
NPGC.__init__(enforce_min_max_values=True, epsilon=1.0)
Initializes an unfitted synthesizer.
| Parameter | Type | Default | Technical meaning |
|---|---|---|---|
enforce_min_max_values |
bool |
True |
Controls tail behavior during inverse ECDF reconstruction. When True, continuous outputs remain within the observed training range and integer outputs are snapped to the observed integer support. When False, continuous and integer-valued variables may extrapolate beyond the observed extrema. |
epsilon |
float | None |
1.0 |
Default differential privacy budget used during fit(...) if no per-fit override is provided. If None or non-positive, the privacy mechanism is disabled and empirical statistics are used directly. |
fit(data, epsilon=None, random_state=None)
Method signature:
fit(data: pandas.DataFrame, epsilon: float | None = None, random_state: int | None = None) -> None
Fits the synthesizer to a tabular dataset.
| Parameter | Type | Default | Technical meaning |
|---|---|---|---|
data |
pandas.DataFrame |
required | Training table. The implementation requires a non-empty DataFrame. |
epsilon |
float | None |
None |
Optional fit-time override for the instance privacy budget. If supplied, it takes precedence over self.epsilon. |
random_state |
int | None |
None |
Seed used for reproducible privacy noise and randomized empirical CDF tie-breaking during fitting. |
Behavior:
- Raises
ValueErrorifdatais not apandas.DataFrame. - Raises
ValueErrorifdatais empty. - Stores learned marginals, latent correlation matrix, and column order internally.
- Marks the model as fitted.
sample(num_rows, seed=None)
Method signature:
sample(num_rows: int, seed: int | None = None) -> pandas.DataFrame
Generates synthetic rows from a previously fitted model.
| Parameter | Type | Default | Technical meaning |
|---|---|---|---|
num_rows |
int |
required | Number of synthetic rows to generate. |
seed |
int | None |
None |
Random seed for reproducible sampling from the latent Gaussian model. |
Behavior:
- Raises
RuntimeErrorif called beforefit(...). - Returns a
pandas.DataFramewith the learned column order. - Attempts to cast each generated column back to the original training dtype.
save(filepath)
Method signature:
save(filepath: str | os.PathLike[str]) -> None
Serializes the fitted model as a pickle file. Parent directories are created automatically when needed.
load(filepath)
Method signature:
load(filepath: str | os.PathLike[str]) -> None
Loads model state into the current NPGC instance from a pickle file. The loader supports both object-based checkpoints and a legacy dictionary-based state format.
Minimal Usage
import pandas as pd
from npgc import NPGC
df = pd.DataFrame(
{
"signup_date": pd.to_datetime(["2022-01-15", "2022-06-03", "2023-02-20", "2023-11-08"]),
"age": [21, 34, 45, 52],
"income": [42000.0, 68000.0, 91000.0, 120000.0],
"segment": ["A", "B", "B", "C"],
}
)
model = NPGC(enforce_min_max_values=True, epsilon=1.0)
model.fit(df, random_state=42)
synthetic = model.sample(100, seed=42)
print(synthetic.head())
Persistence Example
from npgc import NPGC
model = NPGC(epsilon=1.0)
model.fit(df, random_state=42)
model.save("artifacts/npgc_model.pkl")
reloaded = NPGC()
reloaded.load("artifacts/npgc_model.pkl")
synthetic = reloaded.sample(50, seed=7)
Technical Notes
What epsilon does
epsilon is the differential privacy budget.
- Smaller
epsilonmeans stronger privacy and more perturbation. - Larger
epsilonmeans weaker privacy and less perturbation. epsilon=Nonedisables the privacy mechanism.epsilon<=0is treated as non-private in the current implementation.
The current implementation splits the total privacy budget equally:
epsilon / 2for marginal estimationepsilon / 2for latent correlation estimation
Mechanistically, the code applies Laplace noise to:
- integer support counts
- continuous histograms
- categorical counts
- latent-space correlation estimates
This means privacy is not added only at the final sample stage; it is injected during model fitting into the sufficient statistics used to construct the synthetic generator.
What enforce_min_max_values does
enforce_min_max_values controls whether inverse marginal reconstruction is range-constrained.
When True:
- continuous columns are reconstructed within the observed training range
- integer columns are mapped to the nearest observed integer support value
- generated values remain conservative with respect to the observed empirical support
When False:
- continuous columns may extrapolate beyond the observed minimum and maximum
- integer columns may extrapolate before final dtype casting
- the model can generate values outside the original empirical range
This parameter is especially important when synthetic outputs must remain support-faithful for downstream validation, auditing, or schema-constrained pipelines.
Missing values
Missingness is modeled per column through the observed missing fraction:
- numeric columns preserve an estimated
nan_frac - categorical columns reserve probability mass for missing values
- synthetic samples may therefore contain missing values when the training data does
Type handling
Column handling is determined from the training DataFrame:
- datetime64 columns (tz-naive or tz-aware) are detected first and modeled as continuous float-seconds since epoch; timezone is preserved in the output dtype
- numeric dtypes are modeled as either integer or continuous
- non-numeric dtypes are treated as categorical
- output columns are cast back toward the original dtype after generation
For datetime columns, DP noise is applied to the 100-bin histogram of float-seconds, which avoids the uniqueness problem of nanosecond integer encoding and keeps sensitivity well-defined (±1 per bin count) regardless of timestamp granularity.
For integer-valued numeric columns, the implementation detects integer structure from the observed non-missing values and uses a dedicated inverse ECDF path.
Data Contract
Expected input:
- a non-empty
pandas.DataFrame - tabular columns with numeric or categorical-like values
- optional missing values
Current implementation details worth knowing:
- column order is preserved
- correlations are computed in Gaussian latent space with Pearson correlation
- the correlation matrix is repaired to the nearest valid correlation matrix when noise or numerical issues make it non-PSD
- categorical labels are sampled from observed label support
Reproducibility
There are two independent random entry points:
random_stateinfit(...)controls fitting-time randomness, including privacy noise and randomized ECDF operationsseedinsample(...)controls synthetic sample generation after fitting
For exact reproducibility, set both.
Project Metadata
- Package name:
npgc - Current version:
0.2.0 - Issue tracker: https://github.com/gdiaz95/NPGC/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file npgc-0.2.0.tar.gz.
File metadata
- Download URL: npgc-0.2.0.tar.gz
- Upload date:
- Size: 61.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab250d5ab4e1949f70d88c512b5ede5d822618a5145342170cfe25744c49948d
|
|
| MD5 |
7e10de7dead175b244bec45dbdd1a9b0
|
|
| BLAKE2b-256 |
04fbbce9c549eb10c372a91303b1488e9605f08306e18531e7dae9a758b91857
|
File details
Details for the file npgc-0.2.0-py3-none-any.whl.
File metadata
- Download URL: npgc-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5153fc53ff1769ab5de8fc9a661523cc4811dd2f09cbdd8c52cf770f1406d20f
|
|
| MD5 |
f32a4c50056323447420ba9c610e22d1
|
|
| BLAKE2b-256 |
4ec4bf466e48488094f5a72fb7ac1aed1ad8a05c0faddf4288e56f40817b3c9a
|