Document section detector using spaCy for clinical NLP
Project description
Clinical Sectionizer
This package offers a component for tagging clinical section titles in docs. There are two different flavors of the sectionizer:
Sectionizer
: A spaCy component which is run on aDoc
object and adds attributes to spaCy objects. This can be added to an NLP pipeline and be executed as part ofnlp(text)
TextSectionizer
: A stand-alone object, independent of spaCy which takes a text and returns a list of tuples, where each tuple corresponds to a section in the text.
The sectionizer
takes a list of
patterns for section titles and searches for matches in a doc
. When a section is found, it generates three outputs:
section_title
: The normalized name of a section, astring
section_header
: The span of the doc containing the header, aSpan
section_span
: The entire span of the doc containing the section, aSpan
When using the spaCy Sectionizer
, calling sectionizer(doc)
adds the
following extensions to spaCy objects:
Doc.sections
: A list of 3-tuples of (name
,header
,section
)Token.section_span
: Thespan
of the entire section which the token occurs inToken.section_header
: Thespan
of the section header of the section a token occurs inToken.section_title
: The name of the section header defined by a patternSpan
attributes correspondingsection_span
,section_header
, andsection_title
to the first token in a span
When using TextSectionizer
, calling sectionizer(text)
returns a list of 3-tuples which correspond to the outputs
described above, but each as texts rather than spaCy objects: (section_title, section_header, section_text)
Example
See notebooks/
for more detailed examples.
>>> text = """Family History:
Diabetes
Past Medical History:
Pneumonia
Assessment and Plan:
Atrial fibrillation. There is no evidence of pneumonia.
"""
>>> import spacy
>>> nlp = spacy.load(...) # Load a model which will match clinical concepts
>>> from clinical_sectionizer import Sectionizer
>>> sectionizer = nlp.add_pipe(Sectionizer(nlp))
>>> section_patterns = [
{"section_title": "family_history", "pattern": "Family History:"},
{"section_title": "past_medical_history",
"pattern": [
{"LOWER": "past", "OP": "?"},
{"LOWER": "medical"},
{"LOWER": "history"},
{"LOWER": ":"},
]
},
{"section_title": "assessment_and_plan", "pattern": "Assessment and Plan:"},
]
>>> sectionizer.add(section_patterns)
>>> nlp.add_pipe(sectionizer)
>>> doc = nlp(text)
>>> print(nlp.ents)
(Diabetes, Pneumonia, Atrial fibrillation, pneumonia)
>>> for (section_name, section_header, section) in doc._.sections:
print(section_name, section_header, section, sep="\n")
family_history
Family History:
Family History:
Diabetes
past_medical_history
Past Medical History:
Past Medical History:
Pneumonia
assessment_and_plan
Assessment and Plan:
Assessment and Plan:
Atrial fibrillation. There is no evidence of pneumonia.
>>> for ent in doc.ents:
print(ent, ent._.section_title)
Diabetes family_history
Pneumonia past_medical_history
Atrial fibrillation assessment_and_plan
pneumonia assessment_and_plan
Using cycontext, you can also use a visualizer which shows section headers, along with any extracted entities and optionally cycontext modifiers, in an NER-style visualization.
from cycontext.viz import visualize_ent
visualize_ent(doc, sections=True, context=False)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for clinical_sectionizer-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf210e1988ccb29efc0006880817b45065e760b9eb8ccd08391744e8728ca708 |
|
MD5 | d2b6a80f011df56cbb4721f0a9e3a487 |
|
BLAKE2b-256 | 70cb9f0e705f3a522b766c4056c6ff5d722fee8a01148ac4bb320bb1294f2040 |
Hashes for clinical_sectionizer-0.1.0-py3.7.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7add4b9734f59c7997dab45a321915e01658616bd891b384e17f9c42fd833bc4 |
|
MD5 | fd90fb2dcf5f6ab7a95b341fa6a3dd92 |
|
BLAKE2b-256 | eed1e0d7bffe6241201dcff2a0ed5145b95533db8e687b5307e8c518778eb40a |