Plugin package for metasyn that applies the disclosure control.
Project description
Metasyn disclosure control
A privacy plugin for metasyn, based on statistical disclosure control (SDC) rules of thumb as found in the following documents:
- The SDC handbook of the Secure Data group in the UK
- The Data Without Boundaries document Guidelines for output checking (pdf)
- Statistics Netherlands' output guidelines
Producing synthetic data with metasyn is already a great first step towards protecting privacy, but it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in particularly sensitive data. This plugin solves these kinds of problems.
[!WARNING] Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim any responsibility as a result of using this plugin.
Installing the plugin
To install the package with pip, run the following:
pip install metasyn-disclosure
For the development, installed the package directly through git with the following command:
pip install git+https://github.com/sodascience/metasyn-disclosure-control.git
Usage
Basic usage for our built-in titanic dataset is as follows:
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.string import DisclosureFaker
from metasyn import MetaFrame, VarSpec, demo_dataframe
df = demo_dataframe("titanic")
spec = [
VarSpec(name="PassengerId", unique=True),
VarSpec(name="Name", distribution=DisclosureFaker("name")),
]
mf = MetaFrame.fit_dataframe(
df=df,
var_specs=spec,
privacy=DisclosurePrivacy(),
)
mf.synthesize(5)
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name ┆ Sex ┆ Age ┆ … ┆ Birthday ┆ Board time ┆ Married since ┆ all_NA │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ cat ┆ i64 ┆ ┆ date ┆ time ┆ datetime[μs] ┆ f32 │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0 ┆ Benjamin Cox ┆ female ┆ 27 ┆ … ┆ 1931-12-01 ┆ 14:33:06 ┆ 2022-07-30 02:16:37 ┆ null │
│ 1 ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null ┆ 2022-08-03 13:09:19 ┆ null │
│ 2 ┆ Randy Mosley ┆ male ┆ 24 ┆ … ┆ 1933-01-06 ┆ 15:52:54 ┆ 2022-07-18 18:52:05 ┆ null │
│ 3 ┆ Vincent Maddox ┆ female ┆ 24 ┆ … ┆ 1937-02-10 ┆ 16:58:30 ┆ 2022-07-23 20:29:49 ┆ null │
│ 4 ┆ Kristin Holland ┆ male ┆ 17 ┆ … ┆ 1939-12-09 ┆ 18:07:45 ┆ 2022-08-05 02:41:51 ┆ null │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘
Implementation details
The rules of thumb, roughly, are:
- at least 10 units
- at least 10 degrees of freedom
- no group disclosure
- no dominance
For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data.
Contributing
You can contribute to this metasyn plugin by giving feedback in the "Issues" tab, or by creating a pull request.
To create a pull request:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Contact
This is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Raoul Schram or Erik-Jan van Kesteren.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metasyn_disclosure-0.2.0.tar.gz.
File metadata
- Download URL: metasyn_disclosure-0.2.0.tar.gz
- Upload date:
- Size: 442.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d40298d290175ba8c4157bca8d262c31fb4eadd12760df1f78425438cab4e64
|
|
| MD5 |
d214400e532caf3dcb7b6fd23aae50b2
|
|
| BLAKE2b-256 |
76879f6993812e60710b1d9748690e2656ef5ba639e0406c513bac0084a972dc
|
File details
Details for the file metasyn_disclosure-0.2.0-py3-none-any.whl.
File metadata
- Download URL: metasyn_disclosure-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
869f02da5b4ea3a1427a817b9ce946e7423ecc7f7773495276f586a43922c27a
|
|
| MD5 |
c7b5f01bff45dd6007398ffaa4d0df93
|
|
| BLAKE2b-256 |
ec8a1cf2f56c0705ac51f2ed7099881bbbc33bdf34339aab56338e8f72dec1a5
|