Skip to main content

Synthetic data generation pipeline leveraging a Differentially Private Variational Auto Encoder assessed using a variety of metrics

Project description

Coverage Tests Passing Lines of Code Percentage Comments Snyk Package Health

PyPI - Python Version PyPI - Package Status PyPI - Latest Release PyPI - Wheel PyPI - License Code style: black Imports: isort

NHS Synth

About

This repository currently consists of a Python package alongside research and investigative materials covering the effectiveness of the package and synthetic data more generally when applied to NHS use cases. See the internal project description for more information.

Note: No data, public or private are shared in this repository.

Getting Started

Project Structure

  • The main package and codebase is found in src/nhssynth (see Usage below for more information)
  • Accompanying materials are available in the docs folder:
    • The components used to create the GitHub Pages documentation site
    • A report summarising the previous iteration of this project
    • A model card providing more information about the VAE with Differential Privacy
  • Numerous exemplar configurations are found in config
  • Empty data and experiments folders are provided; these are the default locations for inputs and outputs when running the project using the provided CLI module
  • Pre-processing notebooks for specific datasets used to assess the approach and other non-core code can be found in auxiliary

Installation

For general usage, we recommend installing the package via pip install nhssynth in a supported python version environment. You can then import the package's modules and use them in your projects, or interact with the package directly via the CLI, which is accessed using the nhssynth command (see Usage for more information).

Secure Mode

Note that in order to train a generator in secure mode (see the documentation for details) you will need to install the PyTorch extension package csprng separately. Currently this package's dependencies are not compatible with recent versions of PyTorch (the author's plan on rectifying this - watch this space), so you will need to install it manually; for this we recommend following the instructions below:

git clone git@github.com:pytorch/csprng.git
cd csprng
git branch release "v0.2.2-rc1"
git checkout release
python setup.py install

Advanced Usage

If you intend on contributing or working with the codebase directly, or if you want to reproduce the results of this project, follow the steps below:

  1. Clone the repo
  2. Ensure one of the required versions of Python is installed
  3. Install poetry
  4. Instantiate a virtual environment, e.g. via python -m venv nhssynth
  5. Activate the virtual environment, e.g. via source nhssynth/bin/activate
  6. Install the project dependencies with poetry install (optionally install the dev dependencies --with dev to work with the auxiliary notebooks, or --with docs to work with the documentation)
  7. You can then interact with the package in one of two ways:
    • Via the CLI module using nhssynth ...
    • Through building the package with poetry build and using it in an existing project (import nhssynth). You can then actively develop the package and test it.

Usage

This package comprises a set of modules that can be run individually, as part of a pipeline, or via a configuration file. These options are available via the nhssynth command:

nhssynth <module name> --<args>
nhssynth pipeline --<args>
nhssynth config -c <name> --<overrides>

To see the modules that are available and their corresponding arguments, run nhssynth --help and nhssynth <module name> --help respectively.

Configuration files can be used to run the pipeline or a chosen set of modules. They should be placed in the config folder and their layout should match that of the examples provided. They can be run as in the latter case above by providing their filename (without the .yaml extension). You can also override any of the arguments provided in the configuration file by passing them as arguments in the command line.

To ensure reproducibility, you should always specify a --seed value and provide the --save-config flag to dump the exact configuration specified / inferred for the run. This configuration file can then be used in the future to reproduce the exact same run or shared with others to run the same configuration on their dataset, etc.

The figure below shows the structure and workflow of the package and its modules.

View a visualisation of the codebase here!

Roadmap

See the open issues for a list of proposed features (and known bugs). Our milestones represent longer term goals for the project.

Contributing

Any contributions you wish to make are greatly appreciated, we encourage you to first raise an issue to discuss with the maintainers. If you are interested in contributing, please follow these steps:

  1. Fork the project
  2. Create your branch (git checkout -b <yourusername>/<featurename>)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin <yourusername>/<featurename>)
  5. Open a PR and we will try to get it merged!

See CONTRIBUTING.md for detailed guidance.

Thanks to everyone that has contributed so far!

This codebase builds on previous NHSX Analytics Unit PhD internships contextualising and investigating the potential use of Variational Auto Encoders (VAEs) for synthetic data generation. These were undertaken by Dominic Danks and David Brind.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

This project is under active development by @HarrisonWilde. For feature requests and bugs, please raise an issue; for security concerns, please open a draft security advisory. Alternatively, contact NHS England TDAU.

To find out more about the Analytics Unit visit our project website or get in touch at england.tdau@nhs.net.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nhssynth-0.3.1.tar.gz (41.1 kB view hashes)

Uploaded Source

Built Distribution

nhssynth-0.3.1-py3-none-any.whl (49.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page