Text classification of patient experience feedback.
Project description
pxtextmining: Text Classification of Patient Experience feedback
Project description
pxtextmining is a Python package for classifying and conducting sentiment analysis of patient feedback comments, collected via the NHS England Friends and Family Test (FFT).
There are two parts to the package. The first, comprising the majority of the codebase, is a machine learning pipeline that trains a model using labelled data. This pipeline outputs a fully trained model which can predict either 'criticality' scores or a thematic 'label' category for some feedback text. Examples of this can be found in the 'execution' folder.
The second part utilises the trained model to make predictions on unlabelled feedback text, outputting predicted labels or criticality scores. An example of how this works using a model trained to predict 'label' is given below:
dataset = pd.read_csv('datasets/text_data.csv')
predictions = factory_predict_unlabelled_text(dataset=dataset, predictor="feedback", pipe_path_or_object="results_label/pipeline_label.sav")
We are working openly by open-sourcing the analysis code and data where possible to promote replication, reproducibility and further developments (pull requests are more than welcome!). We are also automating common steps in our workflow by shipping the pipeline as a Python package broken down into sub-modules and helper functions to increase usability and documentation.
Documentation
Full documentation, including installation instructions, is available on our documentation page.
Pipeline to train a new model
The pipeline is built with Python's Scikit-learn
(Pedregosa et al., 2011). The pipeline performs a randomized search (RandomizedSearchCV()
) to identify the best-performing learner and (hyper)parameter values.
Breakdown of the pipeline process, built by the functions in pxtextmining.factories:
-
The data is loaded and split into training and test sets
factory_data_load_and_split
. This module also conducts some basic text preprocessing, such as removing special characters, whitespaces and linebreaks. It produces additional features through the creation of 'text_length' and sentiment scores using vaderSentiment and textblob. -
The function in
factory_pipeline
creates an sklearn pipeline. This pipeline is comprised of the following steps: first, the preprocessed text input is upsampled to help compensate for the unbalanced dataset. The text is then tokenized and vectorised using either spacy or wordnet. Feature selection is then conducted. A hyperparameter grid is constructed with potential hyperparameter values, depending on the learners/classification models to be tested in the Randomized Search. The pipeline is then fitted on the dataset to identify the best model. -
The fitted pipeline is then evaluated on the test set in
factory_model_performance
. The evaluation metrics used are: (Accuracy, Class Balance Accuracy (Mosley, 2013), Balanced Accuracy (Guyon et al., 2015, Kelleher et al., 2015) and Matthews Correlation Coefficient (Baldi et al., 2000, Matthews, 1975)). A visual representation of the performance evaluation is output in the form of a barchart. -
Writing the results: The fitted pipeline, tuning results, predictions, accuracy per class, model comparison barchart, training data index, and test data index are output by
factory_write_results
.
The four steps above are all pulled together in pxtextmining.pipelines.text_classification_pipeline
.
References
Baldi P., Brunak S., Chauvin Y., Andersen C.A.F. & Nielsen H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412--424.
Guyon I., Bennett K. Cawley G., Escalante H.J., Escalera S., Ho T.K., Macià N., Ray B., Saeed M., Statnikov A.R, & Viegas E. (2015). Design of the 2015 ChaLearn AutoML Challenge, International Joint Conference on Neural Networks (IJCNN).
Kelleher J.D., Mac Namee B. & D’Arcy A.(2015). Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT Press.
Matthews B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405(2):442--451.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825--2830
[^1]: A vritual environment can also be created using Conda, where the commands for creating and activating it are a little different. See this.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pxtextmining-0.4.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3dac0e864093b134cc2150e736cffc960c7694d173b8c298c9668d02331c6592 |
|
MD5 | d27b7f09fbe9cc62b6050d6a2daadf76 |
|
BLAKE2b-256 | e57125a48713064df27968f1964b34de570a8d66e9a6314bf43f704559a566cc |