Document section detector using spaCy for clinical NLP
Project description
Clinical Sectionizer
This package offers a component for tagging clinical section titles in docs. There are two different flavors of the sectionizer:
Sectionizer
: A spaCy component which is run on aDoc
object and adds attributes to spaCy objects. This can be added to an NLP pipeline and be executed as part ofnlp(text)
TextSectionizer
: A stand-alone object, independent of spaCy which takes a text and returns a list of tuples, where each tuple corresponds to a section in the text.
The sectionizer
takes a list of
patterns for section titles and searches for matches in a doc
. When a section is found, it generates three outputs:
section_title
: The normalized name of a section, astring
section_header
: The span of the doc containing the header, aSpan
section_span
: The entire span of the doc containing the section, aSpan
When using the spaCy Sectionizer
, calling sectionizer(doc)
adds the
following extensions to spaCy objects:
Doc.sections
: A list of 3-tuples of (name
,header
,section
)Token.section_span
: Thespan
of the entire section which the token occurs inToken.section_header
: Thespan
of the section header of the section a token occurs inToken.section_title
: The name of the section header defined by a patternSpan
attributes correspondingsection_span
,section_header
, andsection_title
to the first token in a span
When using TextSectionizer
, calling sectionizer(text)
returns a list of 3-tuples which correspond to the outputs
described above, but each as texts rather than spaCy objects: (section_title, section_header, section_text)
.
Sectionizng works best when you have robust and complete patterns which are designed specifically to your data. Each EHR uses different section titles, so you should adjust your patterns accordingly. A default list of patterns is loaded when using patterns="default"
. You can also set max_scope
, which will limit the size of a section to a certain number of tokens. This can be useful to prevent sections for running on too far if following section headers aren't recognized.
Installation
You can install clinical_sectionizer
via pip:
pip install clinical-sectionizer
Or by cloning this repository and running:
python setup.py install
Example
See notebooks/
for more detailed examples.
>>> text = """Family History:
Diabetes
Past Medical History:
Pneumonia
Assessment and Plan:
Atrial fibrillation. There is no evidence of pneumonia.
"""
>>> import spacy
>>> nlp = spacy.load(...) # Load a model which will match clinical concepts
>>> from clinical_sectionizer import Sectionizer
>>> sectionizer = nlp.add_pipe(Sectionizer(nlp))
>>> section_patterns = [
{"section_title": "family_history", "pattern": "Family History:"},
{"section_title": "past_medical_history",
"pattern": [
{"LOWER": "past", "OP": "?"},
{"LOWER": "medical"},
{"LOWER": "history"},
{"LOWER": ":"},
]
},
{"section_title": "assessment_and_plan", "pattern": "Assessment and Plan:"},
]
>>> sectionizer.add(section_patterns)
>>> nlp.add_pipe(sectionizer)
>>> doc = nlp(text)
>>> print(nlp.ents)
(Diabetes, Pneumonia, Atrial fibrillation, pneumonia)
>>> for (section_name, section_header, section) in doc._.sections:
print(section_name, section_header, section, sep="\n")
family_history
Family History:
Family History:
Diabetes
past_medical_history
Past Medical History:
Past Medical History:
Pneumonia
assessment_and_plan
Assessment and Plan:
Assessment and Plan:
Atrial fibrillation. There is no evidence of pneumonia.
>>> for ent in doc.ents:
print(ent, ent._.section_title)
Diabetes family_history
Pneumonia past_medical_history
Atrial fibrillation assessment_and_plan
pneumonia assessment_and_plan
Using cycontext, you can also use a visualizer which shows section headers, along with any extracted entities and optionally cycontext modifiers, in an NER-style visualization.
from cycontext.viz import visualize_ent
visualize_ent(doc, sections=True, context=False)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for clinical_sectionizer-0.1.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73e0651c62065b4f0e19e4d6a72fa457a7bb00f454fda03a2be5952b387c663c |
|
MD5 | b4d1f730c93cbb1a0ee5627ca7993c8f |
|
BLAKE2b-256 | bc74fd443ba3e84db0ecaabc100fc9cf6e08a58d2a16f6a2fd6a7c0271e0c6e6 |