Skip to main content

Text classification of patient experience feedback.

Project description

pxtextmining: Text Classification of Patient Experience feedback

Project description

pxtextmining is a Python package for classifying and conducting sentiment analysis of patient feedback comments, collected via the NHS England Friends and Family Test (FFT).

There are two parts to the package. The first, comprising the majority of the codebase, is a machine learning pipeline that trains a model using labelled data. This pipeline outputs a fully trained model which can predict either 'criticality' scores or a thematic 'label' category for some feedback text. Examples of this can be found in the 'execution' folder.

The second part utilises the trained model to make predictions on unlabelled feedback text, outputting predicted labels or criticality scores. An example of how this works using a model trained to predict 'label' is given below:

dataset = pd.read_csv('datasets/text_data.csv')
predictions = factory_predict_unlabelled_text(dataset=dataset, predictor="feedback", pipe_path_or_object="results_label/pipeline_label.sav")

We are working openly by open-sourcing the analysis code and data where possible to promote replication, reproducibility and further developments (pull requests are more than welcome!). We are also automating common steps in our workflow by shipping the pipeline as a Python package broken down into sub-modules and helper functions to increase usability and documentation.

Documentation

Full documentation, including installation instructions, is available on our documentation page.

Pipeline to train a new model

The pipeline is built with Python's Scikit-learn (Pedregosa et al., 2011). The pipeline performs a randomized search (RandomizedSearchCV()) to identify the best-performing learner and (hyper)parameter values.

Breakdown of the pipeline process, built by the functions in pxtextmining.factories:

  1. The data is loaded and split into training and test sets factory_data_load_and_split. This module also conducts some basic text preprocessing, such as removing special characters, whitespaces and linebreaks. It produces additional features through the creation of 'text_length' and sentiment scores using vaderSentiment and textblob.

  2. The function in factory_pipeline creates an sklearn pipeline. This pipeline is comprised of the following steps: first, the preprocessed text input is upsampled to help compensate for the unbalanced dataset. The text is then tokenized and vectorised using either spacy or wordnet. Feature selection is then conducted. A hyperparameter grid is constructed with potential hyperparameter values, depending on the learners/classification models to be tested in the Randomized Search. The pipeline is then fitted on the dataset to identify the best model.

  3. The fitted pipeline is then evaluated on the test set in factory_model_performance. The evaluation metrics used are: (Accuracy, Class Balance Accuracy (Mosley, 2013), Balanced Accuracy (Guyon et al., 2015, Kelleher et al., 2015) and Matthews Correlation Coefficient (Baldi et al., 2000, Matthews, 1975)). A visual representation of the performance evaluation is output in the form of a barchart.

  4. Writing the results: The fitted pipeline, tuning results, predictions, accuracy per class, model comparison barchart, training data index, and test data index are output by factory_write_results.

The four steps above are all pulled together in pxtextmining.pipelines.text_classification_pipeline.

References

Baldi P., Brunak S., Chauvin Y., Andersen C.A.F. & Nielsen H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412--424.

Guyon I., Bennett K. Cawley G., Escalante H.J., Escalera S., Ho T.K., Macià N., Ray B., Saeed M., Statnikov A.R, & Viegas E. (2015). Design of the 2015 ChaLearn AutoML Challenge, International Joint Conference on Neural Networks (IJCNN).

Kelleher J.D., Mac Namee B. & D’Arcy A.(2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press.

Matthews B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2):442--451.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825--2830

[^1]: A vritual environment can also be created using Conda, where the commands for creating and activating it are a little different. See this.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pxtextmining-0.4.1.tar.gz (65.1 kB view hashes)

Uploaded Source

Built Distribution

pxtextmining-0.4.1-py3-none-any.whl (71.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page