Zero-dependency sequential CART synthesis for secure research (synthpop tradition), with relational support. An OIS tool.
Project description
oissyntheticdata
Pure-Python sequential CART synthesis — in the synthpop tradition, with zero third-party dependencies.
An OIS tool · ois.co.il · maintained by Dr Yohanan Ouaknine (ORCID 0000-0002-4186-7351)
oissyntheticdata generates a synthetic copy of a sensitive dataset that preserves the
relationships between variables, not just each column's marginal shape. It is
built for the secure-research workflow used by statistical agencies: develop
and debug your analysis on the synthetic data off-site, then run the final code
on the real data on-premises and release only vetted aggregate results.
It imports only the Python standard library (csv, json, math, random,
statistics, zipfile, xml.etree), so it can run inside a locked secure
environment with no pip install and is small enough to read and audit in full.
The approach was first deployed in a secure justice-research setting (a study of terrorist recidivism after the 2011 Shalit prisoner exchange, run on-premises at the Israel Prison Service under Research Committee authorization); this package generalises and opens it. OIS offers deployment, validation, and training services to government research units and academic researchers around the open core.
Why this exists
This follows a well-established paradigm in statistical disclosure control. The
synthetic data is test data that should resemble the real data closely but is
never used for final inference; the code developed on it is what gets run on the
confidential data (Nowok, Raab & Dibben 2016; US Census Bureau SIPP Synthetic
Beta). oissyntheticdata is a dependency-free re-implementation of the core engine those
tools use — sequential CART synthesis (Reiter 2005) — packaged for locked
environments.
It complements a metadata-only synthesizer (which preserves each column's shape
but not the joint structure): oissyntheticdata fits on the real microdata on-premises
and therefore reproduces conditional relationships, at the cost of touching raw
records (so it must run inside the secure environment).
How it works (the engine)
Synthesis proceeds one column at a time in a chosen visit order:
- First column — drawn from its own empirical marginal, with cells smaller
than
min_leafsuppressed. - Each later column
Y— a CART (classification tree ifYis categorical, regression tree if continuous) is grown on the real data to predictYfrom the columns already synthesized. Every leaf keeps the list of realYvalues that reached it (its "donors"). - Drawing — for each synthetic row, route it down the tree using the values already generated for that row, reach a leaf, and sample a donor from that leaf (optionally jittered for continuous columns). Sampling from donors — not predicting a point — is what reproduces the conditional distribution.
Because each column is predicted from the previously synthesized columns, the
joint distribution is assembled sequentially (the standard synthpop approach).
visit: c1 -> c2 -> c3 -> ...
c1 ~ marginal(c1)
c2 ~ leaf_donor( CART(c2 ~ c1) , synthetic c1 )
c3 ~ leaf_donor( CART(c3 ~ c1,c2) , synthetic c1,c2 )
...
Confidentiality model
min_leaf(k, default 5): no leaf and no marginal cell is built from fewer thankreal records, so every drawn value blends ≥ k individuals and is never traceable to one person. This also caps tree depth and prevents the tree from memorizing individuals.smoothing(default 0): optional Gaussian jitter on continuous donors, bounded to the leaf's range, so exact real values are not echoed verbatim.drop: direct identifiers (national ID, names, record keys) should be dropped before synthesis —oissyntheticdatadoes not attempt to anonymize them.- Only synthetic data leaves; the real data never does. The intended use is to take the synthetic file off-site for development and re-run final code on the real data in place.
oissyntheticdata is a disclosure-control aid, not a formal privacy guarantee. For a
mathematical guarantee, combine it with differential privacy or apply output
checking (statistical disclosure control) to anything released.
Design decisions and trade-offs
The value of oissyntheticdata is in its design choices, which are deliberately narrow:
- Where the synthesizer may run is a first-class concern.
oissyntheticdatafits on real microdata to preserve joint structure, so it runs on-premises; only the synthetic output leaves. A metadata-only synthesizer can run off-site but preserves only per-column structure. Choosing fidelity-with-on-prem-execution over portability-with-lower-fidelity is intentional, and the two roles are kept as separate tools so the confidentiality reasoning stays explicit. - Donor-leaf sampling, not point prediction. Drawing a real value from the matching leaf reproduces the conditional distribution; predicting a mean would not.
- One confidentiality invariant.
min_leaf(k) applies the samek-record floor to every marginal cell, tree leaf, fan-out estimate, and surrogate key, instead of scattering ad hoc thresholds. - Relational by conditioning, not joining. Children are synthesized conditioned on the parent's synthetic attributes and linked by surrogate keys, preserving referential integrity without materialising a real join.
- Build on, don't reinvent. The estimator is the established CART-synthesis method; the new work is the dependency-free, auditable, relational realisation for locked environments.
Scope boundaries are equally deliberate: single-parent schemas only, no enforced high-order interactions or arithmetic identities, and no formal privacy guarantee (see Limitations).
Governance, support & contributing
oissyntheticdata is maintained in the open under the MIT license. Questions, bug reports,
and change proposals go through public GitHub Issues and Pull Requests; see
CONTRIBUTING.md. Decisions are made by the maintainer(s)
listed in CITATION.cff via the public issue/PR process. There is
no private support channel — keeping development and discussion public is part of
the project's auditability goal. Releases are versioned and recorded in
CHANGELOG.md.
Generative AI disclosure
A generative AI assistant (Claude, Anthropic) was used to help draft and refactor
parts of the code and documentation. All output was reviewed, tested, and edited
by the author(s), who take full responsibility for the design, correctness, and
integrity of the software. The design decisions and abstractions above, and the
testing and documentation practices, are the author(s)' own. Contributors are
asked to disclose non-trivial AI assistance (see CONTRIBUTING.md).
Install
pip install oissyntheticdata # once published
# or, in a locked environment, just copy the oissyntheticdata/ folder next to your code
No dependencies. Python 3.7+.
Usage
Command line:
python -m oissyntheticdata real.csv -o synthetic.csv --drop national_id --min-leaf 5
python -m oissyntheticdata data.xlsx -o synthetic.csv --visit "age,offense,violent" --smoothing 0.5
Library:
import oissyntheticdata
# one call
oissyntheticdata.synthesize_file("real.csv", "synthetic.csv",
drop=["national_id"], min_leaf=5)
# or step by step
header, cols = oissyntheticdata.read_table("real.xlsx")
out_header, out_cols = oissyntheticdata.synthesize(header, cols,
drop=["national_id"], min_leaf=5)
oissyntheticdata.write_table("synthetic.csv", out_header, out_cols)
Key parameters: n (rows, default = real), visit (column order),
drop (identifiers to exclude), min_leaf (k), max_depth, smoothing, seed.
Related tables (multi-table synthesis)
For data split across linked tables (e.g. one row per inmate, many judgements per
inmate), synthesize_relational keeps referential integrity and the
parent → child structure:
import oissyntheticdata
oissyntheticdata.synthesize_relational_files(
{"inmates": "inmates.csv", "judgements": "judgements.csv"},
schema={
"inmates": {"key": "prisoner_id"},
"judgements": {"key": "judgement_id",
"parent": "inmates", "foreign_key": "prisoner_id"},
},
out_dir="out", min_leaf=5,
)
# -> out/synthetic_inmates.csv, out/synthetic_judgements.csv
How it works: the parent is synthesized first and given fresh surrogate keys; a regression CART models how many children each parent has (the fan-out) from the parent's attributes; and each child's attributes are synthesized conditioned on its parent's synthetic attributes. The result: every synthetic foreign key points at a synthetic parent (0 orphan joins), the number of children per parent is realistic, and parent → child relationships survive (e.g. high-risk parents keep their child-row patterns). Supports a single-parent DAG — star, snowflake, and parent → child → grandchild chains.
Limitations
- Fits on real microdata, so run it on-premises; the synthetic output is what you take off-site.
- Relational synthesis covers a single-parent DAG (star / snowflake / chains). Many-to-many relationships and compound keys are not modelled — pre-join or pre-resolve them to a surrogate key first.
- CART captures pairwise/low-order structure well; very high-order interactions
and exact arithmetic identities (e.g.
rate = a/b) are not enforced. - Pure Python: comfortable to a few thousand rows × a few dozen columns; larger data is slower than a compiled implementation.
Lineage & sources
- Rubin, D.B. (1993). Statistical disclosure limitation. J. Official Statistics 9(2).
- Little, R.J.A. (1993). Statistical analysis of masked data. J. Official Statistics 9(2).
- Reiter, J.P. (2005). Using CART to generate partially synthetic public use microdata. J. Official Statistics 21(3).
- Reiter, Oganian & Karr (2009). Verification servers. Comput. Stat. Data Anal. 53(4):1475–1482. https://doi.org/10.1016/j.csda.2008.10.006
- Nowok, Raab & Dibben (2016). synthpop: Bespoke Creation of Synthetic Data in R. J. Statistical Software 74(11). https://doi.org/10.18637/jss.v074.i11
- Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control. Springer.
- US Census Bureau, SIPP Synthetic Beta + Cornell Synthetic Data Server (synthetic development data + validation on confidential files).
Maintainer
Dr Yohanan Ouaknine — OIS (ois.co.il), yohanan.ouaknine@ois.co.il, ORCID 0000-0002-4186-7351. Department of Criminology, Ashkelon Academic College; formerly Head of the Research Branch, Israel Prison Service.
License
MIT — see LICENSE.
Citation
If you use oissyntheticdata, please cite this software (see CITATION.cff) and
the methodological lineage above (Reiter 2005; Nowok, Raab & Dibben 2016). The
method was first applied in Ouaknine, Elisha & Hasisi (2026), The Effect of Mass
Prisoner Release on Terrorist Recidivism: A Propensity Score Analysis of the Shalit
Deal (in publication).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oissyntheticdata-0.2.0.tar.gz.
File metadata
- Download URL: oissyntheticdata-0.2.0.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7587450e260ec49451255733cfee8326cb04751bbd97ba5946ab31310111c3cd
|
|
| MD5 |
c95bcb1d5ff4fbe8980d39f88cd7d22f
|
|
| BLAKE2b-256 |
42a99657f6188b447d22a40642e59b71754c9efece105f7876a24060446cd048
|
File details
Details for the file oissyntheticdata-0.2.0-py3-none-any.whl.
File metadata
- Download URL: oissyntheticdata-0.2.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a7a9e6cbeef477b6b69df2ad3b20e474baec599655e185fd84277b7239dc356
|
|
| MD5 |
5a5be5c20dea354f991b0c6f2f5f54f5
|
|
| BLAKE2b-256 |
a48b2755e9bbf0aecb9a3ecfa9d0210498db56fc6c2e1170c04544c05430572e
|