Document section detector using spaCy for clinical NLP

Project description

Clinical Sectionizer

This package offers a component for tagging clinical section titles in docs. There are two different flavors of the sectionizer:

Sectionizer: A spaCy component which is run on a Doc object and adds attributes to spaCy objects. This can be added to an NLP pipeline and be executed as part of nlp(text)
TextSectionizer: A stand-alone object, independent of spaCy which takes a text and returns a list of tuples, where each tuple corresponds to a section in the text.

The sectionizer takes a list of patterns for section titles and searches for matches in a doc. When a section is found, it generates three outputs:

section_title: The normalized name of a section, a string
section_header: The span of the doc containing the header, a Span
section_span: The entire span of the doc containing the section, a Span

When using the spaCy Sectionizer, calling sectionizer(doc) adds the following extensions to spaCy objects:

Doc.sections: A list of 3-tuples of (name, header, section)
Token.section_span: The span of the entire section which the token occurs in
Token.section_header: The span of the section header of the section a token occurs in
Token.section_title: The name of the section header defined by a pattern
Span attributes corresponding section_span, section_header, and section_title to the first token in a span

When using TextSectionizer, calling sectionizer(text) returns a list of 3-tuples which correspond to the outputs described above, but each as texts rather than spaCy objects: (section_title, section_header, section_text).

Sectionizng works best when you have robust and complete patterns which are designed specifically to your data. Each EHR uses different section titles, so you should adjust your patterns accordingly. A default list of patterns is loaded when using patterns="default". You can also set max_scope, which will limit the size of a section to a certain number of tokens. This can be useful to prevent sections for running on too far if following section headers aren't recognized.

Installation

You can install clinical_sectionizer via pip:

pip install clinical-sectionizer

Or by cloning this repository and running:

python setup.py install

Example

See notebooks/for more detailed examples.

>>> text = """Family History:
    Diabetes
    
    Past Medical History:
    Pneumonia
    
    Assessment and Plan:
    Atrial fibrillation. There is no evidence of pneumonia.
    """
>>> import spacy
>>> nlp = spacy.load(...) # Load a model which will match clinical concepts

>>> from clinical_sectionizer import Sectionizer
>>> sectionizer = nlp.add_pipe(Sectionizer(nlp))


>>> section_patterns = [
        {"section_title": "family_history", "pattern": "Family History:"},
        {"section_title": "past_medical_history", 
            "pattern": [
                {"LOWER": "past", "OP": "?"}, 
                {"LOWER": "medical"},
                {"LOWER": "history"}, 
                {"LOWER": ":"},
            ]
            
        },
        {"section_title": "assessment_and_plan", "pattern": "Assessment and Plan:"},
    ]
>>> sectionizer.add(section_patterns)

>>> nlp.add_pipe(sectionizer)
>>> doc = nlp(text)
>>> print(nlp.ents)
(Diabetes, Pneumonia, Atrial fibrillation, pneumonia)



>>> for (section_name, section_header, section) in doc._.sections:
        print(section_name, section_header, section, sep="\n")

family_history
Family History:
Family History:
Diabetes

past_medical_history
Past Medical History:
Past Medical History:
Pneumonia

assessment_and_plan
Assessment and Plan:
Assessment and Plan:
Atrial fibrillation. There is no evidence of pneumonia.

>>> for ent in doc.ents:
        print(ent, ent._.section_title)
    
Diabetes family_history
Pneumonia past_medical_history
Atrial fibrillation assessment_and_plan
pneumonia assessment_and_plan

Using cycontext, you can also use a visualizer which shows section headers, along with any extracted entities and optionally cycontext modifiers, in an NER-style visualization.

from cycontext.viz import visualize_ent
visualize_ent(doc, sections=True, context=False)

Project details

Release history Release notifications | RSS feed

1.0.0.1

Nov 14, 2020

1.0.0.0

Oct 24, 2020

0.1.3

Jul 4, 2020

0.1.2

Jun 27, 2020

This version

0.1.1

Jun 19, 2020

0.1.0

Apr 1, 2020

0.0.1

Apr 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinical_sectionizer-0.1.1.tar.gz (15.8 kB view hashes)

Uploaded Jun 19, 2020 Source

Hashes for clinical_sectionizer-0.1.1.tar.gz

Hashes for clinical_sectionizer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`73e0651c62065b4f0e19e4d6a72fa457a7bb00f454fda03a2be5952b387c663c`
MD5	`b4d1f730c93cbb1a0ee5627ca7993c8f`
BLAKE2b-256	`bc74fd443ba3e84db0ecaabc100fc9cf6e08a58d2a16f6a2fd6a7c0271e0c6e6`