Skip to main content

Profile-based synthetic data for secure research: a disclosure-safe profile crosses the boundary, the synthesizer never sees the real data. Zero dependencies. An OIS tool.

Project description

oissyntheticdata

Profile-based synthetic data for secure research environments. Zero third-party dependencies; Python standard library only.

An OIS tool · ois.co.il · maintained by Dr Yohanan Ouaknine (ORCID 0000-0002-4186-7351)

DOI

A sensitive dataset never leaves the secure environment. Instead, a disclosure-safe profile of it crosses the boundary, and the synthesizer rebuilds a structurally faithful synthetic copy from the profile alone, so it never sees the real data. You develop and debug your analysis off-site on the synthetic copy, then run the final, unchanged script on the real data on-premises and release only vetted aggregate results.

New to this and just want to use it? Start with the researcher tutorial, a plain-language, end-to-end walk-through.

   INSIDE (real data)            OUTSIDE (no real data)          INSIDE (control)
   ..................            ......................          ................
   01 profile        --->        02 synthesize        ..>        03 compare
   real  -->  profile            profile  -->  synthetic         real vs synthetic
   (the ONLY artefact            (never reads the real           structural fidelity
    that leaves)                  data)                           + referential integrity

Only two things ever cross the boundary, each after the data owner authorises it: the profile (going out) and, at the very end, your aggregate results (coming back). The microdata stays put.

Provenance

The disclosure-control concept this tool implements (develop your analysis on disclosure-safe synthetic data, then run the final code on the real data in place) was first applied in research at the Israel Prison Service research unit, in a study of terrorist recidivism following the 2011 Shalit prisoner exchange, under Research Committee authorization. This package is a later, general, open implementation of that concept; the package itself was not used in that research. OIS offers deployment, validation, and training services to government research units and academic researchers around the open core.

Why

Secure research environments forbid pip/conda and have no internet. This package installs by copying one directory, runs on the standard library alone, and is small enough for a data owner to read and audit in full. The synthetic data is built for code-path coverage: every branch, filter, join and edge case your analysis will hit on the real data, not statistical realism. Synthetic numbers are never reported.

A higher-fidelity, joint-distribution synthesizer (sequential CART, in the synthpop tradition) reproduces conditional relationships but must read the real microdata, so it can only run on-premises. This profile pipeline takes the opposite trade: lower joint fidelity in exchange for a stronger, simpler boundary, because the component that leaves the environment never touched a real record.

Install

pip install oissyntheticdata

Or, for a locked environment with no internet: copy the four files in scripts/ onto the machine and run them directly. No install, no dependencies. Python 3.7+.

The four stages

Stage Where Reads real data? Output
00 add-month anywhere no <file>_with_month.csv (derives <date>_month)
01 profile inside yes profile_<base>.json, profile_summary.md
02 synthesize outside no synthetic_<base>.csv
03 compare inside only yes comparison_report.md, comparison_<base>.csv

03 compare is an inside-the-premises control, not a researcher step: it reads the real data to confirm the synthetic bed is faithful enough, and only column-level scores leave. It is never run off-site.

Command line

# INSIDE: write a disclosure-safe profile of the real data (one or many files)
oissyntheticdata profile inmates.csv incidents.csv judgements.csv

# OUTSIDE: rebuild synthetic data from the profile only (acts on the newest run)
oissyntheticdata synthesize

# INSIDE-ONLY control: structural fidelity + cross-file referential integrity
oissyntheticdata compare

Stage 01 creates output/run_NNN_YYYY-MM-DD/; stages 02 and 03 act on the newest run folder by default (or pass an explicit one). oissd is a short alias, and python -m oissyntheticdata ... works too.

Python API

import oissyntheticdata as oisd

# INSIDE
run_dir, reports = oisd.profile(["inmates.csv", "incidents.csv"], base_dir="work")

# OUTSIDE (profile only)
oisd.synthesize(run_dir=run_dir)

# INSIDE-ONLY control
run_dir, fidelity, integrity = oisd.compare(run_dir=run_dir, base_dir="work")

What the profile keeps (and hides)

The profile is the whole privacy posture. Per column it keeps a shape, never identifiable values:

  • Unique integer key: only "this is a unique key" plus a length range.
  • Fan-out / foreign key (e.g. prisoner_id): only the distribution of group sizes; never an id tied to its count.
  • Numeric: mean, standard deviation, and a quantile grid, with robust bounds (P1/P99 stand in for the true min/max so extremes do not leak).
  • Categorical: level frequencies, but any level with fewer than k records (default k = 5) is relabelled RARE_###, count kept, label dropped.
  • High-cardinality text / id: only a format signature (e.g. DD-DDDDDD) and a length range. Values are never enumerated.
  • Dates: format, range, per-year and per-month shape (for seasonality).

Across files, columns that are key-like in two or more profiles become shared relational keys, so the synthetic children join correctly onto the synthetic parent.

Relational example

oissyntheticdata profile inmates.csv incidents.csv judgements.csv
oissyntheticdata synthesize
oissyntheticdata compare

inmates is the parent (prisoner_id unique); incidents and judgements are children sharing prisoner_id and incident_id. The synthesizer mints one shared key pool per shared key, so synthetic child keys are a subset of the synthetic parent keys. compare reports referential integrity (orphan keys, if any) alongside per-column fidelity.

Confidentiality model

  • k (min cell count, default 5): no categorical level, fan-out estimate, or surrogate key is reported from fewer than k real records, so nothing in the profile is traceable to one person.
  • Robust bounds: numeric extremes are reported at P1/P99, not the true min/max, so a single outlier cannot leak through the range.
  • Identifiers are never enumerated: only a format signature and length range leave; the synthesizer mints fresh keys.
  • Only the profile leaves; the real data never does. Develop on the synthetic copy off-site, run final code on the real data in place, release only vetted aggregates.

oissyntheticdata is a disclosure-control aid, not a formal privacy guarantee. For a mathematical guarantee, combine it with differential privacy or apply output checking (statistical disclosure control) to anything released.

How the standalone scripts are built

scripts/00 to scripts/03 are auto-generated from src/oissyntheticdata/ by tools/build_standalone.py, inlining the shared _common and _io modules so each script is a single self-contained file. The package and the standalone scripts therefore produce identical output; tests/test_roundtrip.py enforces that they stay in sync.

Government use of synthetic data

The develop-on-synthetic, run-on-real workflow is established practice at national statistical agencies:

Governance, support and contributing

oissyntheticdata is maintained in the open under the MIT license. Questions, bug reports, and change proposals go through public GitHub Issues and Pull Requests; see CONTRIBUTING.md. Decisions are made by the maintainer(s) listed in CITATION.cff via the public issue/PR process. There is no private support channel; keeping development and discussion public is part of the project's auditability goal. Releases are versioned and recorded in CHANGELOG.md.

Generative AI disclosure

A generative AI assistant (Claude, Anthropic) was used to help draft and refactor parts of the code and documentation. All output was reviewed, tested, and edited by the author(s), who take full responsibility for the design, correctness, and integrity of the software. Contributors are asked to disclose non-trivial AI assistance (see CONTRIBUTING.md).

Maintainer

Dr Yohanan Ouaknine, OIS (ois.co.il), yohanan.ouaknine@ois.co.il, ORCID 0000-0002-4186-7351. formerly Head of the Research Branch, Israel Prison Service.

License

MIT (c) 2026 Yohanan Ouaknine and OIS. See LICENSE. If you use this software in research, please cite it; see CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oissyntheticdata-2.1.0.tar.gz (57.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oissyntheticdata-2.1.0-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file oissyntheticdata-2.1.0.tar.gz.

File metadata

  • Download URL: oissyntheticdata-2.1.0.tar.gz
  • Upload date:
  • Size: 57.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for oissyntheticdata-2.1.0.tar.gz
Algorithm Hash digest
SHA256 1c8ac8262fe39edd13c823ef8a0ec9c25d1113347366c6c778dea4d650356375
MD5 6ec94d090ca66b83d1a1925707e48a09
BLAKE2b-256 bc4bf2313b74ff7251dc0942f3b5ac83fa2253b09ca4e75ac9bc817439db409c

See more details on using hashes here.

File details

Details for the file oissyntheticdata-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for oissyntheticdata-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74257955fef46a88698339841bb5ab2d815c53e9b0ebce7065145558738d436c
MD5 9e8d511cbfa522a7dcc2422c50ef372b
BLAKE2b-256 49f9b801ee7ec921fe207ce904dd94dd2444e6b5a9e5b0cdd0bdfa3704cc73d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page