Text Classification of Patient Experience feedback
Project description
pxtextmining: Text Classification of Patient Experience feedback
Project description
Nottinghamshire Healthcare NHS Foundation Trust hold patient feedback that is currently manually labelled by our "coders" (i.e. the staff who read the feedback and decide what it is about). As we hold thousands of patient feedback records, we (the Data Science team) are running this project to aid the coders with a text classification pipeline that will semi-automate the labelling process. We are also working in partnership with other NHS trusts who hold patient feedback text. Read more here and here.
We are working openly by open-sourcing the analysis code and data where possible to promote replication, reproducibility and further developments (pull requests are more than welcome!). We are also automating common steps in our workflow by shipping the pipeline as a Python
package broken down into sub-modules and helper functions to increase usability and documentation.
Documentation
- Installation;
- Execution;
- Pipeline description;
- Function/class documentation;
Installation
- To begin with, download the repo.
- It is best to create a
Python
virtual environment. Let's call ittext_venv
. Open a terminal, navigate to the folder where you want to put the virtual environment and runpython3 -m venv text_venv
(Linux & MacOS);python -m venv text_venv
(Windows);
- Activate the virtual environment. In the folder containing folder
text_venv
run:source text_venv/bin/activate
(Linux & MacOS);source text_venv/Scripts/activate
(Windows);
- Install wheel:
pip3 install wheel
(Linux & MacOS);pip install wheel
(Windows);
- Install all the necessary
Python
packages in the virtual environmenttext_venv
). Inside the repo's folder, run:pip3 install -r requirements.txt
(Linux & MacOS);pip install -r requirements.txt
(Windows);
- Install
pxtextmining
as aPython
package. Inside the repo's folder, run:python3 setup.py install
(Linux & MacOS);python setup.py install
(Windows);
All steps in one go:
- Linux & MacOS.
python3 -m venv text_venv source text_venv/bin/activate pip3 install wheel pip3 install -r requirements.txt python3 setup.py install
- Windows.
python -m venv text_venv source text_venv/Scripts/activate pip install wheel pip install -r requirements.txt python setup.py install
Execution
All execution scripts may be saved in folder "execution". The execution scripts
are nothing more than a call of function
pxtextmining.pipelines.text_classification_pipeline
with user-specified
arguments. The two example scripts, execution_label.py
and
execution_criticality.py
run the pipeline for each of the two target variables
in datasets/text_data.csv
. Note that execution_criticality.py
runs ordinal
classification (ordinal=True
).
The user can create their own execution script(s). Run the script in a Python
IDE (Integrated Development Environment) or on the terminal (do not forget to
activate the virtual environment first) with:
python3 execution/<script_name.py>
(Linux & MacOS).python execution/<script_name.py>
(Windows);
For example:
python3 execution/execution_label.py
(Linux & MacOS).python execution/execution_label.py
(Windows);
Pipeline
The pipeline is built with Python
's
Scikit-learn
(Pedregosa et al., 2011).
During fitting, both the "Bag-of-Words" approach and a word embedding-based
approach are tried out. The pipeline performs a random grid search (RandomizedSearchCV()
) to identify the best-performing learner
and (hyper)parameter values. The process also involves a few pre- and post-fitting steps:
-
Data load and split into training and test sets (
factory_data_load_and_split.py
). -
Text pre-processing (e.g. remove special characters, whitespaces and line breaks) and tokenization, token lemmatization, calculation of Term Frequency–Inverse Document Frequencies (TF-IDFs), up-balancing of rare classes, feature selection, pipeline training and learner benchmarking (
factory_pipeline.py
). -
Evaluation of pipeline performance on test set, production of evaluation metrics (Accuracy score, Class Balance Accuracy (Mosley, 2013), Balanced Accuracy (Guyon et al., 2015, Kelleher et al., 2015) or Matthews Correlation Coefficient (Baldi et al., 2000, Matthews, 1975)) and plots, and fitting of best performer on whole dataset (
factory_model_performance.py
). -
Writing the results: fitted pipeline, tuning results, predictions, accuracy per class, model comparison bar plot, training data index, and test data index (
factory_write_results.py
).
There are a few helper functions and classes available in the helpers folder that the aforementioned factories make use of.
The factories are brought together in a single function text_classification_pipeline.py
that runs the whole process. This function can be run in a user-made
script such as
execution/execution_label.py
.
The text dataset is loaded either as CSV from folder datasets
or is loaded directly from the database. (Loading from/writing to the database
is for internal use only and this feature will be removed when a proper API is
developed for the functions that read/write data.) Because Excel
can cause all
sorts of issues with text encodings, it may be best to use
LibreOffice
.
The results
folders (e.g. results_label
) always contain a SAV
of the fitted model and a PNG of the learner comparison bar plot. Results tables
are written as CSV files in a "results_" folder. All results files and folders
have a "_target_variable_name" suffix, for example "tuning_results_label.csv"
if the dependent variable is label
.
Here is a visual display of the process:
References
Baldi P., Brunak S., Chauvin Y., Andersen C.A.F. & Nielsen H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424.
Guyon I., Bennett K. Cawley G., Escalante H.J., Escalera S., Ho T.K., Macià N., Ray B., Saeed M., Statnikov A.R, & Viegas E. (2015). Design of the 2015 ChaLearn AutoML Challenge, International Joint Conference on Neural Networks (IJCNN).
Kelleher J.D., Mac Namee B. & D’Arcy A.(2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press.
Matthews B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2):442–451.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pxtextmining-0.2.5.tar.gz
.
File metadata
- Download URL: pxtextmining-0.2.5.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6160213273d7fdbb0738e87a36b3ff83c075dd6131b97b420696a93d2aab116d |
|
MD5 | 7490da948d04286d52adc32e7b9c811b |
|
BLAKE2b-256 | 5c6c62b694e5e1d05760ad7aa7f9c3edf11d0fc7302bf7d5fef2149bcee002b5 |
File details
Details for the file pxtextmining-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: pxtextmining-0.2.5-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0479b40b3f72225955b046db489bbcf2a9afc1348b4a85fde3a1e45b3c4ca0d |
|
MD5 | ea3bc49e7a272ecae32dab184245380e |
|
BLAKE2b-256 | 867c2aeeaa0b928e5223b64d343c3b16a95a8ca76a3bdd3d158343f12af55c12 |