Profile-based synthetic data for secure research: a disclosure-safe profile crosses the boundary, the synthesizer never sees the real data. Zero dependencies. An OIS tool.

These details have not been verified by PyPI

Project links

Project description

oissyntheticdata

Profile-based synthetic data for secure research environments. Zero third-party dependencies; Python standard library only.

An OIS tool · ois.co.il · maintained by Dr Yohanan Ouaknine (ORCID 0000-0002-4186-7351)

A sensitive dataset never leaves the secure environment. Instead, a disclosure-safe profile of it crosses the boundary, and the synthesizer rebuilds a structurally faithful synthetic copy from the profile alone, so it never sees the real data. You develop and debug your analysis off-site on the synthetic copy, then run the final, unchanged script on the real data on-premises and release only vetted aggregate results.

New to this and just want to use it? Start with the researcher tutorial, a plain-language, end-to-end walk-through.

READ THIS FIRST

The synthetic data is ONLY for testing that your code runs. It exists so your script executes end to end: types line up, joins resolve, every category and edge case appears, nothing throws an error. Do not analyse it. Do not run statistics or regressions on it, do not fit or train models on it, and do not report any number from it. The numbers are deliberately meaningless; only their structure is real. Every result you report must come from running your finished, unchanged code on the real data, on-premises.

   INSIDE (real data)            OUTSIDE (no real data)          INSIDE (control)
   ..................            ......................          ................
   01 profile        --->        02 synthesize        ..>        03 compare
   real  -->  profile            profile  -->  synthetic         real vs synthetic
   (the ONLY artefact            (never reads the real           structural fidelity
    that leaves)                  data)                           + referential integrity

Only two things ever cross the boundary, each after the data owner authorises it: the profile (going out) and, at the very end, your aggregate results (coming back). The microdata stays put.

Provenance

The disclosure-control concept this tool implements (develop your analysis on disclosure-safe synthetic data, then run the final code on the real data in place) was first applied in research at the Israel Prison Service research unit, in a study of terrorist recidivism following the 2011 Shalit prisoner exchange, under Research Committee authorization. This package is a later, general, open implementation of that concept; the package itself was not used in that research. OIS offers deployment, validation, and training services to government research units and academic researchers around the open core.

Why

Secure research environments forbid pip/conda and have no internet. This package installs by copying one directory, runs on the standard library alone, and is small enough for a data owner to read and audit in full. The synthetic data is built for code-path coverage: every branch, filter, join and edge case your analysis will hit on the real data, not statistical realism. Synthetic numbers are never reported.

A higher-fidelity, joint-distribution synthesizer (sequential CART, in the synthpop tradition) reproduces conditional relationships but must read the real microdata, so it can only run on-premises. This profile pipeline takes the opposite trade: lower joint fidelity in exchange for a stronger, simpler boundary, because the component that leaves the environment never touched a real record.

Install

pip install oissyntheticdata

Or, for a locked environment with no internet: copy the four files in scripts/ onto the machine and run them directly. No install, no dependencies. Python 3.7+.

The four stages

Stage	Where	Reads real data?	Output
`00 add-month`	anywhere	no	`<file>_with_month.csv` (derives `<date>_month`)
`01 profile`	inside	yes	`profile_<base>.json`, `profile_summary.md`
`02 synthesize`	outside	no	`synthetic_<base>.csv`
`03 compare`	inside only	yes	`comparison_report.md`, `comparison_<base>.csv`

03 compare is an inside-the-premises control, not a researcher step: it reads the real data to confirm the synthetic bed is faithful enough, and only column-level scores leave. It is never run off-site.

Command line

# INSIDE: write a disclosure-safe profile of the real data (one or many files)
oissyntheticdata profile inmates.csv incidents.csv judgements.csv

# OUTSIDE: rebuild synthetic data from the profile only (acts on the newest run)
oissyntheticdata synthesize

# INSIDE-ONLY control: structural fidelity + cross-file referential integrity
oissyntheticdata compare

Stage 01 creates output/run_NNN_YYYY-MM-DD/; stages 02 and 03 act on the newest run folder by default (or pass an explicit one). oissd is a short alias, and python -m oissyntheticdata ... works too.

Python API

import oissyntheticdata as oisd

# INSIDE
run_dir, reports = oisd.profile(["inmates.csv", "incidents.csv"], base_dir="work")

# OUTSIDE (profile only)
oisd.synthesize(run_dir=run_dir)

# INSIDE-ONLY control
run_dir, fidelity, integrity = oisd.compare(run_dir=run_dir, base_dir="work")

What the profile keeps (and hides)

The profile is the whole privacy posture. Per column it keeps a shape, never identifiable values:

Unique integer key: only "this is a unique key" plus a length range.
Fan-out / foreign key (e.g. prisoner_id): only the distribution of group sizes; never an id tied to its count.
Numeric: mean, standard deviation, and a quantile grid, with robust bounds (P1/P99 stand in for the true min/max so extremes do not leak).
Categorical: level frequencies, but any level with fewer than k records (default k = 5) is relabelled RARE_###, count kept, label dropped.
High-cardinality text / id: only a format signature (e.g. DD-DDDDDD) and a length range. Values are never enumerated.
Dates: format, range, per-year and per-month shape (for seasonality).

Across files, columns that are key-like in two or more profiles become shared relational keys, so the synthetic children join correctly onto the synthetic parent.

Relational example

oissyntheticdata profile inmates.csv incidents.csv judgements.csv
oissyntheticdata synthesize
oissyntheticdata compare

inmates is the parent (prisoner_id unique); incidents and judgements are children sharing prisoner_id and incident_id. The synthesizer mints one shared key pool per shared key, so synthetic child keys are a subset of the synthetic parent keys. compare reports referential integrity (orphan keys, if any) alongside per-column fidelity.

Rebuilding joined or merged files

Stage 01 detects the relationships between your files from the data itself. A shared column is treated as a link only when one file holds it uniquely (the parent) and another file's values are a repeating subset of it (the child); this is type-agnostic, so integer keys, string ids, and dates are all found, and a column that merely shares a name but is a plain attribute is ignored. The detected schema is printed and written to schema.json (names and fan-out quantiles only, so it is disclosure-safe). If detection ever picks the wrong parent on an unusual schema, you can override it.

Stage 02 then synthesizes in parent-before-child order and attaches each child row to a real synthetic parent, copying the link column and any inherited columns from that parent. This gives three things at once:

Referential integrity: every synthetic child key resolves to a synthetic parent, so a group-by or a single-key join runs with zero orphan keys, and you can rebuild a merged or aggregated table from the synthetic files exactly as you would from the real ones.
Realistic fan-out: the number of children per parent follows the real group-size distribution.
Within-row key pairing: when a child carries a second key that the real data shows is fixed by its parent (for example a judgement's prisoner_id is fixed once its incident_id is known), that key is inherited from the matched parent, so the pairing is exact. A judgement's incident now belongs to that judgement's prisoner, not just to some valid prisoner.

This supports both the hierarchical model (files describing facets of one object, merged at synthesis time) and the simple shared-id model (several files that the research unit merges by grouping on a common id to check their interpretation).

Confidentiality model

k (min cell count, default 5): no categorical level, fan-out estimate, or surrogate key is reported from fewer than k real records, so nothing in the profile is traceable to one person.
Robust bounds: numeric extremes are reported at P1/P99, not the true min/max, so a single outlier cannot leak through the range.
Identifiers are never enumerated: only a format signature and length range leave; the synthesizer mints fresh keys.
Only the profile leaves; the real data never does. Develop on the synthetic copy off-site, run final code on the real data in place, release only vetted aggregates.

oissyntheticdata is a disclosure-control aid, not a formal privacy guarantee. For a mathematical guarantee, combine it with differential privacy or apply output checking (statistical disclosure control) to anything released.

How the standalone scripts are built

scripts/00 to scripts/03 are auto-generated from src/oissyntheticdata/ by tools/build_standalone.py, inlining the shared _common and _io modules so each script is a single self-contained file. The package and the standalone scripts therefore produce identical output; tests/test_roundtrip.py enforces that they stay in sync.

Government use of synthetic data

The develop-on-synthetic, run-on-real workflow is established practice at national statistical agencies:

U.S. Census Bureau, SIPP Synthetic Beta (SSB): researchers develop code on synthetic linked survey and administrative data, and Census staff run the validated code on the confidential data and release only vetted output. https://census.gov/programs-surveys/sipp/guidance/sipp-synthetic-beta-data-product.html
U.S. Census Bureau, OnTheMap and LEHD LODES: the first production deployment of formal privacy (2008), built on partially synthetic origin-destination data. https://lehd.ces.census.gov/
U.S. Census Bureau, "What Are Synthetic Data?" (2021 factsheet), covering the Longitudinal Business Database, SIPP, OnTheMap, and 2020 Decennial Census uses. https://www.census.gov/content/dam/Census/library/factsheets/2021/what-are-synthetic-data/what-are-synthetic-data.pdf

Governance, support and contributing

oissyntheticdata is maintained in the open under the MIT license. Questions, bug reports, and change proposals go through public GitHub Issues and Pull Requests; see CONTRIBUTING.md. Decisions are made by the maintainer(s) listed in CITATION.cff via the public issue/PR process. There is no private support channel; keeping development and discussion public is part of the project's auditability goal. Releases are versioned and recorded in CHANGELOG.md.

Generative AI disclosure

A generative AI assistant (Claude, Anthropic) was used to help draft and refactor parts of the code and documentation. All output was reviewed, tested, and edited by the author(s), who take full responsibility for the design, correctness, and integrity of the software. Contributors are asked to disclose non-trivial AI assistance (see CONTRIBUTING.md).

Maintainer

Dr Yohanan Ouaknine, OIS (ois.co.il), yohanan.ouaknine@ois.co.il, ORCID 0000-0002-4186-7351. formerly Head of the Research Branch, Israel Prison Service.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.0

Jun 11, 2026

2.1.0

Jun 11, 2026

1.0.0

Jun 11, 2026

0.2.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oissyntheticdata-2.2.0.tar.gz (67.2 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oissyntheticdata-2.2.0-py3-none-any.whl (34.0 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file oissyntheticdata-2.2.0.tar.gz.

File metadata

Download URL: oissyntheticdata-2.2.0.tar.gz
Upload date: Jun 11, 2026
Size: 67.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for oissyntheticdata-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d96edf2f2f9555ef756405b7df0a3834298396e417a8df914a8370f7a19c8c93`
MD5	`b216ef2d4fa0ed03baecb3c6d905d187`
BLAKE2b-256	`411410e23e114624b66cbcad387d153d57a70d69a8b92f30a9257eb18ac13f9f`

See more details on using hashes here.

File details

Details for the file oissyntheticdata-2.2.0-py3-none-any.whl.

File metadata

Download URL: oissyntheticdata-2.2.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 34.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for oissyntheticdata-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb7f35fcb15d684462854422491c084b9b5893a16010ea31f60465da1259e85d`
MD5	`f36b5aedeb4755b995f7448dca302040`
BLAKE2b-256	`c20e71437fea6088537d88f8f18feb1bb130b436d80049dfc0d2b237e52fb3e2`

See more details on using hashes here.

oissyntheticdata 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

oissyntheticdata

READ THIS FIRST

Provenance

Why

Install

The four stages

Command line

Python API

What the profile keeps (and hides)

Relational example

Rebuilding joined or merged files

Confidentiality model

How the standalone scripts are built

Government use of synthetic data

Governance, support and contributing

Generative AI disclosure

Maintainer

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes