Synthetic data generation pipeline leveraging a Differentially Private Variational Auto Encoder assessed using a variety of metrics
Project description
NHS Synth
About
This repository currently consists of a Python package alongside research and investigative materials covering the effectiveness of the package and synthetic data more generally when applied to NHS use cases. See the internal project description for more information.
Getting Started
Project Structure
- The main package and codebase is found in
src/nhssynth
(see Usage below for more information) - Accompanying materials are available in the
docs
folder:- The components used to create the GitHub Pages documentation site
- A report summarising the previous iteration of this project
- A model card providing more information about the VAE with Differential Privacy
- Numerous exemplar configurations are found in
config
- Empty
data
andexperiments
folders are provided; these are the default locations for inputs and outputs when running the project using the provided CLI module - Pre-processing notebooks for specific datasets used to assess the approach and other non-core code can be found in
auxiliary
Installation
For general usage, we recommend installing the package via pip install nhssynth
in a supported python version environment. You can then import
the package's modules and use them in your projects, or interact with the package directly via the CLI, which is accessed using the nhssynth
command (see Usage for more information).
Secure Mode
Note that in order to train a generator in secure mode (see the documentation for details) you will need to install the PyTorch extension package csprng
separately. Currently this package's dependencies are not compatible with recent versions of PyTorch (the author's plan on rectifying this - watch this space), so you will need to install it manually; for this we recommend following the instructions below:
git clone git@github.com:pytorch/csprng.git
cd csprng
git branch release "v0.2.2-rc1"
git checkout release
python setup.py install
Advanced Installation
If you intend on contributing or working with the codebase directly, or if you want to reproduce the results of this project, follow the steps below:
-
Clone the repo
-
Ensure one of the required versions of Python is installed
-
Install
poetry
and either:-
Skip to step four (and have
poetry
control the installation's virtual environment in their proprietary way) -
Change
poetry
's configuration to manage your own virtual environments:poetry config virtualenvs.create false poetry config virtualenvs.in-project false
You can now instantiate a virtual environment in the usual way (e.g. via
python -m venv nhssynth
) and activate it viasource nhssynth/bin/activate
before moving to the next step
-
-
Install the project dependencies with
poetry install
(add optional flags:--with dev
when developing and testing the package,--with aux
to work with the auxiliary notebooks,--with docs
to work with the documentation) -
You can then interact with the package in one of two ways:
-
Via the CLI module, which is accessed using the
nhssynth
command, e.g.poetry run nhssynth ...
Note that you can omit the
poetry run
part and just typenhssynth
if you followed the optional steps above to manage and activate your own virtual environment, or if you have executedpoetry shell
beforehand. -
Through directly importing parts of the package to use in an existing project (
from nhssynth.modules... import ...
).
-
Usage
CLI
This package comprises a set of modules that can be run using the CLI
individually, as part of a pipeline, or via a configuration file. These options are available via the aforementioned (poetry run) nhssynth
command:
nhssynth <module name> --<args>
nhssynth pipeline --<args>
nhssynth config -c <name> --<overrides>
To see the modules that are available and their corresponding arguments, run nhssynth --help
and nhssynth <module name> --help
respectively.
Configuration files can be used to run the pipeline or a chosen set of modules. They should be placed in the config
folder and their layout should match that of the examples provided. They can be run as in the latter case above by providing their filename (without the .yaml
extension). You can also override any of the arguments provided in the configuration file by passing them as arguments in the command line.
To ensure reproducibility, you should always specify a --seed
value and provide the --save-config
flag to dump the exact configuration specified / inferred for the run (missing options will be populated in the outputted config, so it may be larger than one you would specify yourself). This configuration file can then be used in the future to reproduce the exact same run or shared with others to run the same configuration on their dataset, etc.
Python API
Alternatively, you may want to import parts of the package into your own project or notebook. There is a minimum working example of this in the auxiliary folder. You can learn more about the API and structure of the package and its modules in the docs to reuse components as you see fit.
Package Structure
The figure below shows the structure and workflow of the package and its modules.
View a visualisation of the codebase here!
Roadmap
See the open issues for a list of proposed features (and known bugs). Our milestones represent longer term goals for the project.
Contributing
Contributions are welcome! We encourage you to first raise an issue with your proposed contribution to enable discussion with the maintainers. After that, please follow these steps:
- Fork the project
- Create your branch (
git checkout -b <yourusername>/<featurename>
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin <yourusername>/<featurename>
) - Open a PR and we will try to get it merged!
See CONTRIBUTING.md for detailed guidance.
Thanks to everyone that has contributed so far!
This codebase builds on previous NHSX Analytics Unit PhD internships contextualising and investigating the potential use of Variational Auto Encoders (VAEs) for synthetic data generation. These were undertaken by Dominic Danks and David Brind.
License
Distributed under the MIT License. See LICENSE for more information.
Contact
This project is under active development by @HarrisonWilde. For feature requests and bugs, please raise an issue; for security concerns, please open a draft security advisory. Alternatively, contact NHS England TDAU.
To find out more about the Analytics Unit visit our project website or get in touch at england.tdau@nhs.net.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nhssynth-0.10.2.tar.gz
.
File metadata
- Download URL: nhssynth-0.10.2.tar.gz
- Upload date:
- Size: 69.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.10 Darwin/24.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6724d5834fc0a3a4eec014cf0d643e3a75e3a99c39d7284686c3fceac07af2c9 |
|
MD5 | 8ea1adeba2f1728090df1afb2ddc054e |
|
BLAKE2b-256 | 711d9b44f66897787d2826e881f4f6d4643cccf0e4d22c95e9a8b78027043028 |
File details
Details for the file nhssynth-0.10.2-py3-none-any.whl
.
File metadata
- Download URL: nhssynth-0.10.2-py3-none-any.whl
- Upload date:
- Size: 85.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.11.10 Darwin/24.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f35247386cd526cc635ab12446bca51e623512cb607afb4a56650228f5f40d12 |
|
MD5 | 9fdcc4ba412734de763207d1daf20b63 |
|
BLAKE2b-256 | 89b30f67bcde69c45f865c8effa6f995b025978e0b5386821440835b6de43b4c |