Skip to main content

CatBoost Classifer for predicting outpatient non-attendance.

Project description

DNAttend - ML framework for predicting patient non-attendance

Train, test and validate a CatBoost Classifier for predicting patient non-attendance (DNA)

status: experimental build: status

Table of contents

Installation

Installation is possible via pip as shown below. To manage dependencies and avoid conflicts it is recommended to install within a virtual environment or a Docker container as described.

pip install dnattend

Virtual Environment

Unix/macOS

Run the following commands via Terminal.

python -m venv dnattend
source dnattend/bin/activate
pip install dnattend

Windows

Run the following commands via PowerShell.

py -m venv dnattend
dnattend/Scripts/Activate.ps1
pip install dnattend

If running scripts is disabled on your system then run the following command before activating your environment.

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Docker

git clone --depth 1 https://github.com/nhsx/dna-risk-predict.git
cd dna-risk-predict/
docker build -t dnattend .
docker run dnattend --help

To run the following example via Docker we recommended using docker volumes to access local data from the docker container. The following command mounts the current directory to the directory /out within the container. Following this we update the container working directory (-w /out) and run the commands as normal.

docker run -v $(pwd):/out -w /out \
  dnattend process config.yaml

Worklow

workflow
Overview of DNAttend workflow

Refer to the additional documentation for further details of the underlying classifier framework.

Usage

The following sections document the built-in example workflow provided. It is recommended that users follow this workflow to verify proper installation.

Generate Example Data

The simulate sub-command generates suitably formatted input data for testing functionality. It also writes an example config file in YAML format. Both of these output files can serve as templates for building real-world models.

dnattend simulate --config config.yaml > DNAttend-example.csv

Train Model

DNAttend trains two models independently; a baseline logistic regression model and a CatBoost model. The baseline model is simple model that acts as reference to assess performance improvements of CatBoost. Refer to the additional documentation for further details of the model workflow.

dnattend train config.yaml

Evaluate Model

Following initial training, the dnattend test command can be used to assess performance of both the logistic regression and CatBoost models against the hold-out testing data set. Refer to the additional documentation for example output visualisation and performance metrics.

dnattend test config.yaml

Refit Model with All Data

The previous steps have trained two models: a baseline logistic regression model and a more advanced CatBoost. Following parameterisation and assessment of model performance, a final model can be retrained using the entire data set. The user may build a logistic regression or CatBoost model depending on the performance metrics. This choice must be specified by the user in the finalModel: option of the configuration file.

dnattend retrain config.yaml

Generate Predictions

The trained model is now ready to be used. Predictions should be made with the predict module - this ensures the tuned decision threshold is correctly applied when assigning classes. The output of predict includes the decision class (i.e.Attend and DNA) and the underlying probabilities of theses classes. The output results of this example can be found here

dnattend predict --verify DNAttend-example.csv catboost-final.pkl > FinalPredictions.csv

Note: the --verify flag is only required when running the example workflow (see below).

Example Workflow Verification

Following initial installation, it is recommended that users run the example workflow, as described, to verify that the pipeline is functioning as expected. The --verify flag of dnattend predict, as shown above, will check the results against the expected output and notify the user if the output matches or not.

Configuration

DNAttend utilises a single configuration file, in YAML, which documents all model parameter and ensure reproducibility of the analysis. The dnattend simulate command writes an example documented configuration file that the user can use as a template. A copy of this file is shown below and available to download here.

input: DNAttend-example.csv    # Path to input data (Mandatory).
target: status                 # Column name of target (Mandatory).
DNAclass: 1                    # Value of target corresponding to DNA.
out: .                         # Output directory to save results.
finalModel: catboost           # Method to train final model (catboost or logistic).
catCols:                       # Column names of categorical features.
    - day
    - priority
    - speciality
    - consultationMedia
    - site
boolCols:                      # Column names of boolean features.
    - firstAppointment
numericCols:                   # Column names of numeric features.
    - age
train_size: 0.7                # Proportion of data for training.
test_size: 0.15                # Proportion of data for testing.
val_size: 0.15                 # Proportion of data for validation.
tuneThresholdBy: f1            # Metric to tune decision threshold (f1 or roc).
cvFolds: 5                     # Hyper-tuning cross-validations.
catboostIterations: 100        # Hyper-tuning CatBoost iterations.
hypertuneIterations: 5         # Hyper-tuning parameter samples.
evalIterations: 10_000         # Upper-limit over-fit iterations.
earlyStoppingRounds: 10        # Over-fit detection early stopping rounds.
seed: 42                       # Seed to ensure workflow reproducibility.

Further Documentation

Refer to the additional documentation for further technical details of the modeling framework and visualisations from the example data set.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

If you have any other questions please contact the author Stephen Richer at stephen.richer@proton.me

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dnattend-0.1.2.tar.gz (18.3 kB view details)

Uploaded Source

File details

Details for the file dnattend-0.1.2.tar.gz.

File metadata

  • Download URL: dnattend-0.1.2.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for dnattend-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ba4e55befe11e18f65be40555e20323e93357d70625a219bbaee9afed06804d9
MD5 adadc45ec9da3d16cea326c297611bc6
BLAKE2b-256 ba50ebd45f5fd2bff21965080ac9af3125f67946ffec000e315217251d275df3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page