Text Classification of Patient Experience feedback

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pxtextmining: Text Classification of Patient Experience feedback

Project description

Nottinghamshire Healthcare NHS Foundation Trust hold patient feedback that is currently manually labelled by our "coders" (i.e. the staff who read the feedback and decide what it is about). As we hold thousands of patient feedback records, we (the Data Science team) are running this project to aid the coders with a text classification pipeline that will semi-automate the labelling process. We are also working in partnership with other NHS trusts who hold patient feedback text. Read more here and here.

We are working openly by open-sourcing the analysis code and data where possible to promote replication, reproducibility and further developments (pull requests are more than welcome!). We are also automating common steps in our workflow by shipping the pipeline as a Python package broken down into sub-modules and helper functions to increase usability and documentation.

Documentation

Installation;
Execution;
Pipeline description;
Function/class documentation;

Installation

We will show how to install pxtextmining from both PyPI or the GitHub repo.

Before doing so, it is best to create a Python Virtual Environment[^1] in which to install pxtextmining and its dependencies. Let's call the virtual environment text_venv:

Open a terminal, navigate to the folder where you want to put the virtual environment and run:
- python3 -m venv text_venv (Linux & MacOS);
- python -m venv text_venv (Windows);
Activate the virtual environment. In the folder containing folder text_venv run:
- source text_venv/bin/activate (Linux & MacOS);
- text_venv\Scripts\activate (Windows);

Install from PyPI

Install pxtextmining and its PyPI dependencies:
- pip3 install pxtextmining==0.3.4 (Linux & MacOS);
- pip install pxtextmining==0.3.4 (Windows);

We also need to install a couple of spaCy models.

These are obtained from URL links and thus need to be installed separately:

Linux & MacOS

pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

Windows

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

Note that the second model is pretty large, so the installation may take a while.

All steps in one go:

Linux & MacOS

python3 -m venv text_venv
source text_venv/bin/activate
pip3 install pxtextmining==0.3.4
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

Windows

python -m venv text_venv
text_venv\Scripts\activate
pip install pxtextmining==0.3.4
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.3.1/en_core_web_lg-2.3.1.tar.gz

Install from GitHub

To begin with, download the repo.
Install wheel:
- pip3 install wheel (Linux & MacOS);
- pip install wheel (Windows);
Install all the dependencies of pxtextmining. Inside the repo's folder, run:
- pip3 install -r requirements.txt (Linux & MacOS);
- pip install -r requirements.txt (Windows);
This will also install the spaCy models, so no additional commands are required as when installing from PyPI. Note that the second model is pretty large, so the installation may take a while.
Install pxtextmining as a Python package. Inside the repo's folder, run:
- python3 setup.py install (Linux & MacOS);
- python setup.py install (Windows);

All steps in one go:

Linux & MacOS

python3 -m venv text_venv
source text_venv/bin/activate
pip3 install wheel
pip3 install -r requirements.txt
python3 setup.py install

Windows

python -m venv text_venv
text_venv\Scripts\activate
pip install wheel
pip install -r requirements.txt
python setup.py install

Execution

Our example scripts are saved in folder execution. The execution scripts are nothing more than a call of function pxtextmining.pipelines.text_classification_pipeline with user-specified arguments. The two example scripts, execution_label.py and execution_criticality.py run the pipeline for each of the two target variables in datasets. Note that execution_criticality.py runs ordinal classification (ordinal=True).

Users can create their own execution script(s). Run the script in a Python IDE (Integrated Development Environment) or on the terminal (do not forget to activate the virtual environment first) with:

python3 execution/<script_name.py> (Linux & MacOS).
python execution/<script_name.py> (Windows);

For example:

python3 execution/execution_label.py (Linux & MacOS).
python execution/execution_label.py (Windows);

The results will be saved in a "results" folder such as results_label.

Pipeline

The pipeline is built with Python's Scikit-learn (Pedregosa et al., 2011). During fitting, both the "Bag-of-Words" approach and a word embedding-based approach are tried out. The pipeline performs a random grid search (RandomizedSearchCV()) to identify the best-performing learner and (hyper)parameter values. The process also involves a few pre- and post-fitting steps:

Data load and split into training and test sets (factory_data_load_and_split.py).
Text pre-processing (e.g. remove special characters, whitespaces and line breaks) and tokenization, token lemmatization, calculation of Term Frequency-Inverse Document Frequencies (TF-IDFs), up-balancing of rare classes, feature selection, pipeline training and learner benchmarking (factory_pipeline.py).
Evaluation of pipeline performance on test set, production of evaluation metrics (Accuracy score, Class Balance Accuracy (Mosley, 2013), Balanced Accuracy (Guyon et al., 2015, Kelleher et al., 2015) or Matthews Correlation Coefficient (Baldi et al., 2000, Matthews, 1975)) and plots, and fitting of best performer on whole dataset (factory_model_performance.py).
Writing the results: fitted pipeline, tuning results, predictions, accuracy per class, model comparison bar plot, training data index, and test data index (factory_write_results.py).
Predicting unlabelled text (factory_predict_unlabelled_text.py).

There are a few helper functions and classes available in the helpers folder that the aforementioned factories make use of.

The factories are brought together in a single function text_classification_pipeline.py that runs the whole process. This function can be run in a user-made script such as execution/execution_label.py. The text dataset is loaded either as CSV from folder datasets or is loaded directly from the database. (Loading from/writing to the database is for internal use only.) Because Excel can cause all sorts of issues with text encodings, it may be best to use LibreOffice. The results folders (e.g. results_label) always contain a SAV of the fitted model and a PNG of the learner comparison bar plot. Results tables are written as CSV files in a "results_" folder. All results files and folders have a "_target_variable_name" suffix, for example "tuning_results_label.csv" if the dependent variable is label.

Here is a visual display of the process:

References

Baldi P., Brunak S., Chauvin Y., Andersen C.A.F. & Nielsen H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412--424.

Guyon I., Bennett K. Cawley G., Escalante H.J., Escalera S., Ho T.K., Macià N., Ray B., Saeed M., Statnikov A.R, & Viegas E. (2015). Design of the 2015 ChaLearn AutoML Challenge, International Joint Conference on Neural Networks (IJCNN).

Kelleher J.D., Mac Namee B. & D’Arcy A.(2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press.

Matthews B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2):442--451.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825--2830

[^1]: A vritual environment can also be created using Conda, where the commands for creating and activating it are a little different. See this.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.1

Feb 15, 2024

1.0.0

Oct 9, 2023

0.5.5

Sep 1, 2023

0.5.4

Jun 15, 2023

0.5.3

May 31, 2023

0.5.2

Apr 27, 2023

0.5.1

Apr 27, 2023

0.5.0

Apr 3, 2023

0.4.2

Jan 6, 2023

0.4.1

Jan 5, 2023

0.4.0

Dec 9, 2022

This version

0.3.4

Sep 28, 2021

0.3.3

Sep 28, 2021

0.3.2

Sep 15, 2021

0.3.1

Sep 15, 2021

0.3.0

Aug 27, 2021

0.2.16

Aug 24, 2021

0.2.15

Aug 20, 2021

0.2.14

Aug 20, 2021

0.2.13

Aug 20, 2021

0.2.12

Aug 19, 2021

0.2.11

Aug 19, 2021

0.2.10

Aug 19, 2021

0.2.9

Aug 9, 2021

0.2.8

Aug 9, 2021

0.2.7.36 yanked

Aug 9, 2021

Reason this release was yanked:

TestPyPI version mistakenly left in dist folder.

0.2.7

Aug 9, 2021

0.2.6

Jul 14, 2021

0.2.5

Jul 13, 2021

0.2.0

Jul 13, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pxtextmining-0.3.4.tar.gz (33.3 kB view hashes)

Uploaded Sep 28, 2021 Source

Built Distribution

pxtextmining-0.3.4-py3-none-any.whl (39.1 kB view hashes)

Uploaded Sep 28, 2021 Python 3

Hashes for pxtextmining-0.3.4.tar.gz

Hashes for pxtextmining-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`12b4a56d6e5e372ca594d4c43cc47d1d007102bd3daa504e3447b25442d3abd0`
MD5	`889214b95205ff72e8c0d1df513b7b51`
BLAKE2b-256	`a87f55161521fdd8facbd3be81f40717132aed2f265f0e8bb03bd2b5fdb73944`

Hashes for pxtextmining-0.3.4-py3-none-any.whl

Hashes for pxtextmining-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`426ca31787bbf3a92d0f1cc52bd4a57ac340334a70418fe020d556791921f0c2`
MD5	`05089effc33b0a835d0ba343fee8f5a2`
BLAKE2b-256	`6f17c6e3806b78d1b3d9852673ad639018be32ab7030f9e30682163dc5fc4060`