Synthetic data generation pipeline leveraging a Differentially Private Variational Auto Encoder assessed using a variety of metrics
Project description
NHS Synth
About the Project
The project currently consists of a Python package alongside research and investigative materials covering the effectiveness of the package and synthetic data more generally when applied to NHS use cases.
Project Description - Synthetic Data Exploration: Variational Autoencoders
The codebase builds on previous NHSX Analytics Unit PhD internships contextualising and investigating the potential use of Variational Auto Encoders (VAEs) for synthetic data generation. These were undertaken by Dominic Danks (last commit to the repository: 88a4bdf) and David Brind (last commit to the repository: ).
Note: No data, public or private are shared in this repository.
Getting Started
Project Stucture
- The main package and codebase is found in
src/nhssynth
(see Usage below for more information) - Accompanying materials are available in the
docs
folder:- A report summarising the previous iteration of this project
- A model card providing more information about the VAE with Differential Privacy
- Numerous exemplar configurations are found in
config
- Empty
data
andexperiments
folders are provided; these are the default locations for inputs and outputs when running the project using the providedcli
module - Pre-processing notebooks for specific datasets used to assess the approach and other non-core code can be found in
auxiliary
Installation
As it stands, we recommend the following steps to reproduce our experiments and fully work with this project:
- Clone the repo
- Ensure one of the required versions of Python is installed
- Install
poetry
- Instantiate a virtual environment, e.g. via
python -m venv nhssynth
- Activate the virtual environment, e.g. via
source nhssynth/bin/activate
- Install project dependencies with
poetry install
(optionally installjupyter
andnotebook
to work with some of the preprocessing files inauxiliary
) - Interact with the package in one of two ways:
- Via the
cli
module usingpoetry run cli
- Through building the package with
poetry build
and using it in an existing project (import nhssynth
). However, if you intend on doing the latter it may be preferable to instead follow the second, simpler setup below.
- Via the
For more standard usage of the package:
- Run
pip install nhssynth
within a supported Python installation - Use the modules exported by the package as you would any other. Note that in this setup you will have to work more closely with the configuration and code to ensure you are handling inputs and outputs for each module appropriately. The cli handles a lot of this complexity, and interacting with the modules directly is considered advanced usage.
Usage
This package comprises a pipeline that is runnable via poetry run cli pipeline <args>
or poetry run cli config <config filepath>
. You can run the modules that make up this pipeline independently via poetry run cli <module name>
. To see the modules that are available and their corresponding arguments and function, run poetry run cli --help
/ poetry run cli <module name> --help
.
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the project
- Create your branch (
git checkout -b <yourusername>/<featurename>
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin <yourusername>/<featurename>
) - Open a PR and we will try to get it merged!
See CONTRIBUTING.md for detailed guidance.
License
Distributed under the MIT License. See LICENSE for more information.
Contact
To find out more about the Analytics Unit visit our project website or get in touch at analytics-unit@nhsx.nhs.uk.
Modules
This folder contains all of the modules contained in this package. They can be used together or independently - through importing them into your existing codebase or using the cli
module and runner.py
to select which / all modules to run.
Importing a module from this package
After installing the package, you can simply do:
from nhssynth import <module>
and you will be able to use it in your code!
Creating a new module and folding it into the CLI
The following instructions specify how to extend this package with a new module:
- Create a folder for your module within the package, i.e.
src/nhssynth/mymodule
- Include within it a main executor that accepts arguments from the
cli
module, e.g.def myexecutor(args): ...
inmymodule/executor.py
and export this by addingfrom .executor import myexecutor
inmymodule/__init__.py
. - In the
cli
module folder, add the following code blocks torun.py
(the second is optional depending on whether this module should be executed as part of a full pipeline run):from modules import ..., mymodule, ... ... def run() ... parser_mymodule = subparsers.add_parser( name="mymodule", description=..., help=..., ) add_mymodule_args(parser_mymodule) parser_mymodule.set_defaults(func=mymodule.executor) ...
def run_pipeline(args): ... mymodule.executor(args) ...
- Similarly, add the following code blocks to
arguments.py
(again, the second block is optional):def add_mymodule_args(parser: argparse.ArgumentParser): ...
def add_all_module_args(parser: argparse.ArgumentParser): ... mymodule_group = parser.add_argument_group(title="mymodule") add_mymodule_args(mymodule_group) ... ... def add_mymodule_args(parser: argparse.ArgumentParser, override=False): ... add_mymodule_args(overrides_group) ...
- After populating the functions in a similar fashion to the existing modules, your module will work as part of the CLI!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.