Use the MedSecId section annotations with MIMIC-III corpus parsing.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

MIMIC-III corpus parsing and section prediction with MedSecId

This repository contains the a Python package to automatically segment and identify sections of clinical notes, such as electronic health record (EHR) medical documents. It also provides access to the MedSecId section annotations with MIMIC-III corpus parsing from the paper A New Public Corpus for Clinical Section Identification: MedSecId. See the medsecid repository to reproduce the results from the paper.

This package provides the following:

The same access to MIMIC-III data as provided in the mimic package.
Access to the annotated MedSecId notes as an easy to use Python object graph.
The pretrained model inferencing, which produces a similar Python object graph to the annotations (provides the class PredictedNote instead of an AnnotatedNote class.

Obtaining
Documentation
Installation
Usage
- Prediction Usage
- Annotation Access
Differences from the Paper Repository
Training
- Preprocessing Step
- Training and Testing
Training Production Models
Models
- MedCAT Models
- Performance Metrics
  - Version 0.0.2
  - Version 0.0.3
Citation
Docker
Changelog
Community
License

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install zensols.mimicsid

Binaries are also available on pypi.

A docker image is now available as well.

Documentation

See the full documentation. The API reference is also available.

Installation

If you only want to predict sections using the pretrained model, you need only to install the package. However, if you want to access the annotated notes, you must install a Postgres MIMIC-III database as mimic package install section.

Usage

This package provides models to predict sections of a medical note and access to the MIMIC-III section annotations available on Zenodo. The first time it is run it will take a while to download the annotation set and the pretrained models.

See the examples for the complete code and additional documentation.

Prediction Usage

The SectionPredictor class creates section annotation span IDs/types and header token spans. See the example below:

from zensols.nlp import FeatureToken
from zensols.mimic import Section
from zensols.mimicsid import PredictedNote, ApplicationFactory
from zensols.mimicsid.pred import SectionPredictor

if (__name__ == '__main__'):
    # get the section predictor from the application context in the app
    section_predictor: SectionPredictor = ApplicationFactory.section_predictor()

    # read in a test note to predict
    with open('../../test-resources/note.txt') as f:
        content: str = f.read().strip()

    # predict the sections of read in note and print it
    note: PredictedNote = section_predictor.predict([content])[0]
    note.write()

    # iterate through the note object graph
    sec: Section
    for sec in note.sections.values():
        print(sec.id, sec.name)

    # concepts or special MIMIC tokens from the addendum section
    sec = note.sections_by_name['addendum'][0]
    tok: FeatureToken
    for tok in sec.body_doc.token_iter():
        print(tok, tok.mimic_, tok.cui_)

Annotation Access

Annotated notes are provided as a Python Note class, which contains most of the MIMIC-III data from the NOTEEVENTS table. This includes not only the text, but parsed FeatureDocument instances. However, you must build a Postgres database and provide a login to it in the application as detailed below:

from zensols.config import IniConfig
from zensols.mimic import Section
from zensols.mimicsid import ApplicationFactory
from zensols.mimic import Note
from zensols.mimicsid import AnnotatedNote, NoteStash

if (__name__ == '__main__'):
    # create a configuration with the Postgres database login
    config = IniConfig('db.conf')
    # get the `dict` like data structure that has notes by `row_id`
    note_stash: NoteStash = ApplicationFactory.note_stash(
        **config.get_options(section='mimic_postgres_conn_manager'))

    # get a note by `row_id`
    note: Note = note_stash[14793]

    # iterate through the note object graph
    sec: Section
    for sec in note.sections.values():
        print(sec.id, sec.name)

Differences from the Paper Repository

The paper medsecid repository has quite a few differences, mostly around reproducibility. However, this repository is designed to be a package used for research that applies the model. To reproduce the results of the paper, please refer to the [medsicid repository]. To use the best performing model (BiLSTM-CRF token model) from that paper, then use this repository.

Perhaps the largest difference is that this repository has a pretrained model and code for header tokens. This is a separate model whose header token predictions are "merged" with the section ID/type predictions.

The differences in performance between the section ID/type models and metrics reported involve several factors. The primary difference being that released models were trained on the test data with only validation performance metrics reported to increase the pretrained model performance. Other changes include:

Uses the mednlp package, which uses MedCAT to parse clinical medical text. This includes changes such as fixing misspellings and expanding acronyms.
Uses the mimic package, which builds on the mednlp package and parses [MIMIC-III] text by configuring the spaCy tokenizer to deal with pseudo tokens (i.e. [**First Name**]). This is a significant change given how these tokens are treated between the models and term mapping (Pt. becomes patient). This was changed so the model will work well on non-MIMIC data.
Feature sets differences such as provided by the Zensols Deep NLP package.
Model changes include LSTM hidden layer parameter size and activation function.
White space tokens are removed in medsecid repository and added back in this package to give additional cues to the model on when to break a section. However, this might have had the opposite effect.

There are also changes in the libraries used:

PyTorch was upgraded from 1.9.1 to 1.12.1
spaCy was upgraded from 3.0.7 to 3.2.4
Python version 3.9 to 3.10.

Training

This document explains how to create and package models for distribution.

Preprocessing Step

To train the model, first install the MIMIC-III Postgres database per the mimic package instructions in the Installation section.
Add the MIMIC-III Postgres credentials and database configuration to etc/batch.conf.
Comment out the line resource(zensols.mimicsid): resources/model/adm.conf in resources/app.conf.
Vectorize the batches using the preprocessing script: $ ./src/bin/preprocess.sh. This also creates cached hospital admission and spaCy data parse files.

Training and Testing

To get performance metrics on the test set by training on the training, use the command: ./mimicsid traintest -c models/glove300.conf for the section ID model. The configuration file can be any of those in the models directory. For the header model use:

./mimicsid traintest -c models/glove300.conf --override mimicsid_default.model_type=header

Training Production Models

To train models used in your projects, train the model on both the training and test sets. This still leaves the validation set to inform when to save for epochs where the loss decreases:

Update the deeplearn_model_packer:version in resources/app.conf.
Preprocess (see the preprocessing) section.
Run the script that trains the models and packages them: src/bin/package.sh.
Check for errors and verify models: $ ./src/bin/verify-model.py.
Don't forget to revert files etc/batch.conf and resources/app.conf.

Models

You can mix and match models across section vs. header models (see Performance Metrics). By default the package uses the best performing models but you can select the model you want by adding a configuration file and specifying it on the command line with -c:

[mimicsid_default]
section_prediction_model = bilstm-crf-tok-fasttext
header_prediction_model = bilstm-crf-tok-glove-300d

The resources live on Zenodo and are automatically downloaded on the first time the program is used in the ~/.cache directory (or similar home directory on Windows).

MedCAT Models

The dependency mednlp package package uses the default MedCAT model.

Performance Metrics

The distributed models add in the test set to the training set to improve the performance for inferencing, which is why only the validation metrics are given. The validation set performance of the pretrained models are given below, where:

wF1 is the weighted F1
mF1 is the micro F1
Mf1 is the macro F1
acc is the accuracy

Fundamental API changes have necessitated subsequent versions of the model. Each version of this package is tied to a model version. While some minor changes of each version might present language parsing differences such as sentence chunking, metrics are most likely statistically insignificant.

Version 0.0.2

Name	Type	Id	wF1	mF1	MF1	acc
`BiLSTM-CRF_tok (fastText)`	Section	bilstm-crf-tok-fasttext-section-type	0.918	0.925	0.797	0.925
`BiLSTM-CRF_tok (GloVE 300D)`	Section	bilstm-crf-tok-glove-300d-section-type	0.917	0.922	0.809	0.922
`BiLSTM-CRF_tok (fastText)`	Header	bilstm-crf-tok-fasttext-header	0.996	0.996	0.959	0.996
`BiLSTM-CRF_tok (GloVE 300D)`	Header	bilstm-crf-tok-glove-300d-header	0.996	0.996	0.962	0.996

Version 0.0.3

Name	Type	Id	wF1	mF1	MF1	acc
`BiLSTM-CRF_tok (fastText)`	Section	bilstm-crf-tok-fasttext-section-type	0.911	0.917	0.792	0.917
`BiLSTM-CRF_tok (GloVE 300D)`	Section	bilstm-crf-tok-glove-300d-section-type	0.929	0.933	0.810	0.933
`BiLSTM-CRF_tok (fastText)`	Header	bilstm-crf-tok-fasttext-header	0.996	0.996	0.965	0.996
`BiLSTM-CRF_tok (GloVE 300D)`	Header	bilstm-crf-tok-glove-300d-header	0.996	0.996	0.962	0.996

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2022-new,
    title = "A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d",
    author = "Landes, Paul  and
      Patel, Kunal  and
      Huang, Sean S.  and
      Webb, Adam  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.326",
    pages = "3709--3721"
}

Also please cite the Zensols Framework:

@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}

Docker

A docker image is now available as well.

To use the docker image, do the following:

Create (or obtain) the Postgres docker image
Clone this repository git clone --recurse-submodules https://github.com/plandes/mimicsid
Set the working directory to the repo: cd mimicsid
Copy the configuration from the installed mimicdb image configuration: make -C docker/mimicdb SRC_DIR=<cloned mimicdb directory> cpconfig
Start the container: make -C docker/app up
Test sectioning a document: make -C docker/app testdumpsec
Log in to the container: make -C docker/app devlogin
Output a note to a temporary file: mimic note 1118471 > note.txt
Predict the sections on the note: mimicsid predict note.txt
Look at the section predictions: cat preds/note-pred.txt

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.7.0

Mar 7, 2024

1.5.1

Jan 17, 2024

1.5.0

Dec 6, 2023

1.4.3

Aug 26, 2023

1.4.2

Aug 17, 2023

1.4.1

Aug 16, 2023

1.4.0

Aug 16, 2023

1.3.1

Jun 28, 2023

1.3.0

Jun 20, 2023

1.2.0

Jun 9, 2023

1.0.0

Feb 10, 2023

0.0.1

Oct 13, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

zensols.mimicsid-1.7.0-py3-none-any.whl (38.9 kB view hashes)

Uploaded Mar 7, 2024 Python 3

Hashes for zensols.mimicsid-1.7.0-py3-none-any.whl

Hashes for zensols.mimicsid-1.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cfd905529a39fbe7ccbf24ea6a06eae3bac997cae92c4f0a44ee365e47b8c4c`
MD5	`1b071ce9c939a49dd3f4be64fc6be9ac`
BLAKE2b-256	`c1a30a126179708c186b60969dbad7027f3d0bd353f92c1f592159c175e0fc83`

zensols.mimicsid 1.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

MIMIC-III corpus parsing and section prediction with MedSecId

Table of Contents

Obtaining

Documentation

Installation

Usage

Prediction Usage

Annotation Access

Differences from the Paper Repository

Training

Preprocessing Step

Training and Testing

Training Production Models

Models

MedCAT Models

Performance Metrics

Version 0.0.2

Version 0.0.3

Citation

Docker

Changelog

Community

License

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution