MIMIC III Corpus Parsing
Project description
MIMIC III Corpus Parsing
A utility library for parsing the MIMIC-III corpus. This uses spaCy and extends the zensols.mednlp to parse the MIMIC-III medical note dataset. Features include:
- Creates both natural language and medical features from medical notes. The latter is generated using linked entity concepts parsed with MedCAT via zensols.mednlp.
- Modifies the spaCy tokenizer to chunk masked tokens. For example,
[
,**
,First
,Name
**
]
becomes[**First Name**]
. - Provides a clean Pythonic object oriented representation of MIMIC-III admissions and medical notes.
- Interfaces MIMIC-III data as a relational database (either PostgreSQL or SQLite).
- Paragraph chunking using the most common syntax/physician templates provided in the MIMIC-III dataset.
Documentation
See the full documentation. The API reference is also available.
Obtaining
The easiest way to install the command line program is via the pip
installer:
pip3 install zensols.mimic
Binaries are also available on pypi.
Installation
- Install the package:
pip3 install zensols.mimic
- Install the database (either PostgreSQL or SQLite).
Configuration
After a database is installed it must be configured in a new file ~/.mimicrc
that you create. This INI formatted file also specifies where to cache data:
[default]
# the directory where cached data is stored
data_dir = ~/directory/to/cached/data
If this file doesn't exist, it must be specified with the --config
option.
SQLite
SQLite is the default database used for MIMIC-III access, but, it is slower and not as well tested compared to the PostgreSQL driver. See the SQLite database file using the SQLite instructions to create the SQLite file from MIMIC-III if you need database access.
Once you create the file, configure it with the API using the following
additional configuration in the --config
specified file is also necessary (or in
~/.mimicrc
):
[mimic_sqlite_conn_manager]
db_file = path: <some directory>/mimic3.sqlite3
PostgreSQL
PostgreSQL is the preferred way to access MIMIC-II for this API. The MIMIC-III
database can be loaded by following the PostgreSQL instructions, or consider
the PostgreSQL Docker image. Then configure the database by adding the
following to ~/.mimicrc
:
[mimic_default]
resources_dir = resource(zensols.mimic): resources
sql_resources = ${resources_dir}/postgres
conn_manager = mimic_postgres_conn_manager
[mimic_db]
database = <needs a value>
host = <needs a value>
port = <needs a value>
user = <needs a value>
password = <needs a value>
The Python PostgreSQL client package is also needed (not needed for the SQLite installs), which can be installed with:
pip3 install zensols.dbpg
Usage
The Corpus class is the data access object used to read and parse the corpus:
# get the MIMIC-III corpus data acceess object
>>> from zensols.mimic import ApplicationFactory
>>> corpus = ApplicationFactory.get_corpus()
# get an admission by hadm_id
>>> adm = corpus.hospital_adm_stash['165315']
# get the first discharge note (some have admissions have addendums)
>>> from zensols.mimic.regexnote import DischargeSummaryNote
>>> ds = adm.notes_by_category[DischargeSummaryNote.CATEGORY][0]
# dump the note as a human readable section-by-section
>>> ds.write()
row_id: 12144
category: Discharge summary
description: Report
annotator: regular_expression
----------------------0:chief-complaint (CHIEF COMPLAINT)-----------------------
Unresponsiveness
-----------1:history-of-present-illness (HISTORY OF PRESENT ILLNESS)------------
The patient is a ...
# get features of the note useful in ML models as a Pandas dataframe
>>> df = ds.feature_dataframe
# get only medical features (CUI, entity, NER and POS tag) for the HPI section
>>> df[(df['section'] == 'history-of-present-illness') & (df['cui_'] != '-<N>-')]['norm cui_ detected_name_ ent_ tag_'.split()]
norm cui_ detected_name_ ent_ tag_
15 history C0455527 history~of~hypertension concept NN
See the application example, which gives a fine grain way of configuring the API.
Medical Note Segmentation
This package uses regular expressions to segment notes. However, the zensols.mimicsid uses annotations and a model trained by clinical informatics physicians. Using this package gives this enhanced segmentation without any API changes.
Citation
If you use this project in your research please use the following BibTeX entry:
@inproceedings{landes-etal-2023-deepzensols,
title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
author = "Landes, Paul and
Di Eugenio, Barbara and
Caragea, Cornelia",
editor = "Tan, Liling and
Milajevs, Dmitrijs and
Chauhan, Geeticka and
Gwinnup, Jeremy and
Rippeth, Elijah",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.16",
pages = "141--146"
}
Changelog
An extensive changelog is available here.
Community
Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.
License
Copyright (c) 2022 - 2024 Paul Landes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file zensols.mimic-1.7.1-py3-none-any.whl
.
File metadata
- Download URL: zensols.mimic-1.7.1-py3-none-any.whl
- Upload date:
- Size: 49.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac46fe9a4bcc0fd5b87697fca35ff56941ac266efeb80a76f0361c80099a8c9e |
|
MD5 | 6164ac5c650db7d1fd9ef0c46c0da3ae |
|
BLAKE2b-256 | 38b78cc96a15e37f5883945c7e1507494e7b1f871ba944f26700655b7721835f |