Synthetic data generation pipeline leveraging a Differentially Private Variational Auto Encoder assessed using a variety of metrics
Project description
NHS Synth
About
This repository currently consists of a Python package alongside research and investigative materials covering the effectiveness of the package and synthetic data more generally when applied to NHS use cases. See the internal project description for more information.
Note: No data, public or private are shared in this repository.
Getting Started
Project Structure
- The main package and codebase is found in
src/nhssynth
(see Usage below for more information) - Accompanying materials are available in the
docs
folder:- The components used to create the GitHub Pages documentation site
- A report summarising the previous iteration of this project
- A model card providing more information about the VAE with Differential Privacy
- Numerous exemplar configurations are found in
config
- Empty
data
andexperiments
folders are provided; these are the default locations for inputs and outputs when running the project using the provided CLI module - Pre-processing notebooks for specific datasets used to assess the approach and other non-core code can be found in
auxiliary
Installation
For general usage, we recommend installing the package via pip install nhssynth
in a supported python version environment. You can then import
the package's modules and use them in your projects, or interact with the package directly via the CLI, which is accessed using the nhssynth
command (see Usage for more information).
Secure Mode
Note that in order to train a generator in secure mode (see the documentation for details) you will need to install the PyTorch extension package csprng
separately. Currently this package's dependencies are not compatible with recent versions of PyTorch (the author's plan on rectifying this - watch this space), so you will need to install it manually; for this we recommend following the instructions below:
git clone git@github.com:pytorch/csprng.git
cd csprng
git branch release "v0.2.2-rc1"
git checkout release
python setup.py install
Advanced Usage
If you intend on contributing or working with the codebase directly, or if you want to reproduce the results of this project, follow the steps below:
- Clone the repo
- Ensure one of the required versions of Python is installed
- Install
poetry
- Instantiate a virtual environment, e.g. via
python -m venv nhssynth
- Activate the virtual environment, e.g. via
source nhssynth/bin/activate
- Install the project dependencies with
poetry install
(optionally install the dev dependencies--with dev
to work with the auxiliary notebooks, or--with docs
to work with the documentation) - You can then interact with the package in one of two ways:
- Via the CLI module using
nhssynth ...
- Through building the package with
poetry build
and using it in an existing project (import nhssynth
). You can then actively develop the package and test it.
- Via the CLI module using
Usage
This package comprises a set of modules that can be run individually, as part of a pipeline, or via a configuration file. These options are available via the nhssynth
command:
nhssynth <module name> --<args>
nhssynth pipeline --<args>
nhssynth config -c <name> --<overrides>
To see the modules that are available and their corresponding arguments, run nhssynth --help
and nhssynth <module name> --help
respectively.
Configuration files can be used to run the pipeline or a chosen set of modules. They should be placed in the config
folder and their layout should match that of the examples provided. They can be run as in the latter case above by providing their filename (without the .yaml
extension). You can also override any of the arguments provided in the configuration file by passing them as arguments in the command line.
To ensure reproducibility, you should always specify a --seed
value and provide the --save-config
flag to dump the exact configuration specified / inferred for the run. This configuration file can then be used in the future to reproduce the exact same run or shared with others to run the same configuration on their dataset, etc.
The figure below shows the structure and workflow of the package and its modules.
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Any contributions you wish to make are greatly appreciated, we encourage you to first raise an issue to discuss with the maintainers. If you are interested in contributing, please follow these steps:
- Fork the project
- Create your branch (
git checkout -b <yourusername>/<featurename>
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin <yourusername>/<featurename>
) - Open a PR and we will try to get it merged!
See CONTRIBUTING.md for detailed guidance.
Thanks to everyone that has contributed so far!
This codebase builds on previous NHSX Analytics Unit PhD internships contextualising and investigating the potential use of Variational Auto Encoders (VAEs) for synthetic data generation. These were undertaken by Dominic Danks and David Brind.
License
Distributed under the MIT License. See LICENSE for more information.
Contact
This project is under active development by @HarrisonWilde, for any questions or security concerns contact him or raise an issue. Alternatively, contact NHS England TDAU.
To find out more about the Analytics Unit visit our project website or get in touch at england.tdau@nhs.net.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.