Skip to main content

Command line tool to generate anonymised demonstrator data

Project description

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status Language grade: Python Coverage Status


Exhibit: Command line tool to create anonymised demonstrator data


The goal of Exhibit is to make it easier to generate anonymised data at scale in a controlled and reproducible way.

Key features:

  • Control all aspects of the anonymisation process: which columns to anonymise and to what degree
  • Rapidly iterate on the anonymisation options
  • Set categorical weights to create custom distributions
  • Use regular expressions to bulk-anonymise identifiers
  • Add columns derived from newly anonymised data
  • Preserve important relationships between your columns (paired, hierarchical, custom)
  • Add outliers to any subset of the generated data
  • Generate and manipulate missing data and timeseries
  • Generate geo-spatial data using H3 hexes
  • Augment your synthetic data with compiled machine learning models and custom functions

Installation:

To install using pip, enter the following command at a Bash or Windows command prompt:

pip install exhibit

Alternatively, download or clone the repository and run pip install . from the root folder.


Quickstart

Exhibit has two principal modes of operation:

  • fromdata produces a detailed, user-editable .yml specification
  • fromspec which produces the anonymised dataset from the supplied specification

See the -h listing for the full list of optional command line parameters.

The repository includes a few sample datasets and specifications.
You can find them in exhibit/sample/_data and exhibit/sample/_spec

To create a demo dataset, run:
exhibit fromspec exhibit/sample/_spec/inpatients_demo.yml -o demo.csv

To create a demo specification that equialises all probabilities and weights, run:
exhibit fromdata exhibit/sample/_data/inpatients.csv -ew -o demo.yml


Database

Exhibit is bundled with a SQLite3 database and a Python utility tool to interact with it. Alternatively, you can connect directly to /exhbit/db/anon.db. The database contains three sample aliasing datasets: mountains, birds and patients designed to help you quickly alias original values without manually editing individual column values.

  • mountains has 15 mountain ranges and their top 10 peaks making it useful for aliasing hierarchical pairs, like NHS Boards and Hospitals.
  • birds has 150 pairs of common / scientific bird names. This can be useful for 1:1 paired columns.
  • patients has 360 made-up patient records with details such as gender, 5-year age band, date of birth and CHI number. Fields from this dataset can be selectively pulled in when linked data is required.

The database is also used to store temporary data for columns where the number of unique values exceeds user threshold and thus not available for editing directly in the yml file.

Note that original, confidential data might be saved in the exhibit/db/anon.db file on your local machine. You can purge all temporary tables by calling --purge command from the included utility tool or by interfacing with the database directly.


Disclaimer

Please note that the degree of anonymisation for each dataset produced by the tool will depend heavily on user choices in the specification. As such, there is no guarantee that confidential data will be suitably masked under all scenarios. If you intend to work with sensitive data, make sure to thoroughly evaluate the output before making it public.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exhibit-0.9.6.tar.gz (588.9 kB view details)

Uploaded Source

Built Distribution

exhibit-0.9.6-py3-none-any.whl (620.0 kB view details)

Uploaded Python 3

File details

Details for the file exhibit-0.9.6.tar.gz.

File metadata

  • Download URL: exhibit-0.9.6.tar.gz
  • Upload date:
  • Size: 588.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.12

File hashes

Hashes for exhibit-0.9.6.tar.gz
Algorithm Hash digest
SHA256 ae9fb1cc1407e0f53f0e3325b3369f03400a9784013f2b40c9459d2fffcfa5a8
MD5 9dc266e9a33243355f3306f9396246b3
BLAKE2b-256 c69a0bd52f9e28b6e3b916eb08cf90ef7e349f98ec239787f6451550b4153815

See more details on using hashes here.

File details

Details for the file exhibit-0.9.6-py3-none-any.whl.

File metadata

  • Download URL: exhibit-0.9.6-py3-none-any.whl
  • Upload date:
  • Size: 620.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.12

File hashes

Hashes for exhibit-0.9.6-py3-none-any.whl
Algorithm Hash digest
SHA256 dedf8da377627d741f13cdda259d7cb311eab53a27c0a70d4720032d8844bd71
MD5 ff770e77156bcf2c39d8724eb5245f91
BLAKE2b-256 2bfb8258772601f7e26185e7a56268d774de37024f4e7901aa926624f1c77699

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page