Skip to main content

MIMIC III Corpus Parsing

Project description

MIMIC III Corpus Parsing

PyPI Python 3.10 Python 3.11 Build Status

A utility library for parsing the MIMIC-III corpus. This uses spaCy and extends the zensols.mednlp to parse the MIMIC-III medical note dataset. Features include:

  • Creates both natural language and medical features from medical notes. The latter is generated using linked entity concepts parsed with MedCAT via zensols.mednlp.
  • Modifies the spaCy tokenizer to chunk masked tokens. For example, [, **, First, Name ** ] becomes [**First Name**].
  • Provides a clean Pythonic object oriented representation of MIMIC-III admissions and medical notes.
  • Interfaces MIMIC-III data as a relational database (either PostgreSQL or SQLite).

Documentation

See the full documentation. The API reference is also available.

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install --use-deprecated=legacy-resolver zensols.mimic

Binaries are also available on pypi.

Installation

  1. Install the package: pip3 install zensols.mimic
  2. Install the database (either PostgreSQL or SQLite).

MedCAT Models

The dependency zensols.mednlp package uses the default MedCAT model.

PostgreSQL

For PostgreSQL, load MIMIC-III by following the PostgreSQL instructions or consider the PostgreSQL Docker image. The Python PostgreSQL client package is also needed, which can be installed with pip3 install zensols.dbpg.

SQLite Configuration

A SQLite can also be used, but it is slower an not as well tested. However, it is faster to set up and could also be useful when a database is not available. I have also created a repository to create the SQLite database file using the SQLite instructions and repository.

The following additional configuration in the --config file is also necessary (or in ~/.mimicrc):

[import]
sections = list: mimic_sqlite_res_imp

[mimic_sqlite_res_imp]
type = import
config_file = resource(zensols.mednlp): resources/sqlite.conf

[mimic_sqlite_conn_manager]
db_file = path: <some directory>/mimic3.sqlite3

Usage

The Corpus class is the data access object used to read and parse the corpus:

# get the MIMIC-III corpus data acceess object
>>> from zensols.mimic import ApplicationFactory
>>> corpus = ApplicationFactory.get_corpus()

# get an admission by hadm_id
>>> adm = corpus.hospital_adm_stash['165315']

# get the first discharge note (some have admissions have addendums)
>>> from zensols.mimic.regexnote import DischargeSummaryNote
>>> ds = adm.notes_by_category[DischargeSummaryNote.CATEGORY][0]

# dump the note as a human readable section-by-section
>>> ds.write()
row_id: 12144
category: Discharge summary
description: Report
annotator: regular_expression
----------------------0:chief-complaint (CHIEF COMPLAINT)-----------------------
Unresponsiveness
-----------1:history-of-present-illness (HISTORY OF PRESENT ILLNESS)------------
The patient is a ...

# get features of the note useful in ML models as a Pandas dataframe
>>> df = ds.feature_dataframe

# get only medical features (CUI, entity, NER and POS tag) for the HPI section
>>> df[(df['section'] == 'history-of-present-illness') & (df['cui_'] != '-<N>-')]['norm cui_ detected_name_ ent_ tag_'.split()]
             norm      cui_           detected_name_     ent_ tag_
15        history  C0455527  history~of~hypertension  concept   NN

See the application example, which gives a fine grain way of configuring the API.

Medical Note Segmentation

This package uses regular expressions to segment notes. However, the zensols.mimicsid uses annotations and a model trained by clinical informatics physicians. Using this package gives this enhanced segmentation without any API changes.

Citation

If you use this project in your research please use the following BibTeX entry:

@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Copyright (c) 2022 - 2023 Paul Landes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

zensols.mimic-1.5.1-py3-none-any.whl (46.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page