Skip to main content

Plugin package for metasyn that applies the disclosure control.

Project description

Metasyn disclosure control

Python package Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

A privacy plugin for metasyn, based on statistical disclosure control (SDC) rules of thumb as found in the following documents:

Producing synthetic data with metasyn is already a great first step towards protecting privacy, but it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in particularly sensitive data. This plugin solves these kinds of problems.

[!WARNING] Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim any responsibility as a result of using this plugin.

Installing the plugin

To install the package with pip, run the following:

pip install metasyn-disclosure

For the development, installed the package directly through git with the following command:

pip install git+https://github.com/sodascience/metasyn-disclosure-control.git

Usage

Basic usage for our built-in titanic dataset is as follows:

from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.string import DisclosureFaker

from metasyn import MetaFrame, VarSpec, demo_dataframe

df = demo_dataframe("titanic")

spec = [
    VarSpec(name="PassengerId", unique=True),
    VarSpec(name="Name", distribution=DisclosureFaker("name")),
]

mf = MetaFrame.fit_dataframe(
    df=df,
    var_specs=spec,
    privacy=DisclosurePrivacy(),
)

mf.synthesize(5)
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name               ┆ Sex    ┆ Age  ┆ … ┆ Birthday   ┆ Board time ┆ Married since       ┆ all_NA │
│ ---         ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---        ┆ ---        ┆ ---                 ┆ ---    │
│ i64         ┆ str                ┆ cat    ┆ i64  ┆   ┆ date       ┆ time       ┆ datetime[μs]        ┆ f32    │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0           ┆ Benjamin Cox       ┆ female ┆ 27   ┆ … ┆ 1931-12-01 ┆ 14:33:06   ┆ 2022-07-30 02:16:37 ┆ null   │
│ 1           ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null       ┆ 2022-08-03 13:09:19 ┆ null   │
│ 2           ┆ Randy Mosley       ┆ male   ┆ 24   ┆ … ┆ 1933-01-06 ┆ 15:52:54   ┆ 2022-07-18 18:52:05 ┆ null   │
│ 3           ┆ Vincent Maddox     ┆ female ┆ 24   ┆ … ┆ 1937-02-10 ┆ 16:58:30   ┆ 2022-07-23 20:29:49 ┆ null   │
│ 4           ┆ Kristin Holland    ┆ male   ┆ 17   ┆ … ┆ 1939-12-09 ┆ 18:07:45   ┆ 2022-08-05 02:41:51 ┆ null   │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘

Implementation details

The rules of thumb, roughly, are:

  • at least 10 units
  • at least 10 degrees of freedom
  • no group disclosure
  • no dominance

For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data.

Contributing

You can contribute to this metasyn plugin by giving feedback in the "Issues" tab, or by creating a pull request.

To create a pull request:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

This is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Raoul Schram or Erik-Jan van Kesteren.

SoDa logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metasyn_disclosure-0.2.0.tar.gz (442.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metasyn_disclosure-0.2.0-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file metasyn_disclosure-0.2.0.tar.gz.

File metadata

  • Download URL: metasyn_disclosure-0.2.0.tar.gz
  • Upload date:
  • Size: 442.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for metasyn_disclosure-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4d40298d290175ba8c4157bca8d262c31fb4eadd12760df1f78425438cab4e64
MD5 d214400e532caf3dcb7b6fd23aae50b2
BLAKE2b-256 76879f6993812e60710b1d9748690e2656ef5ba639e0406c513bac0084a972dc

See more details on using hashes here.

File details

Details for the file metasyn_disclosure-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for metasyn_disclosure-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 869f02da5b4ea3a1427a817b9ce946e7423ecc7f7773495276f586a43922c27a
MD5 c7b5f01bff45dd6007398ffaa4d0df93
BLAKE2b-256 ec8a1cf2f56c0705ac51f2ed7099881bbbc33bdf34339aab56338e8f72dec1a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page