Performant and production-ready NLP pipelines for clinical text written in Dutch

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

clinlp

:hospital: clinical + :netherlands: nl + :clipboard: NLP = :sparkles: clinlp
:star: Performant and production-ready NLP pipelines for clinical text written in Dutch
:rocket: Open source, created and maintained by the Dutch Clinical NLP community
:triangular_ruler: Useful out of the box, but customization highly recommended

Read the principles and goals, futher down :arrow_down:

Contact and contributing

clinlp is very much still being shaped, so if you are enthusiastic about using or contributing to clinlp, please don't hesitate to get in touch (email | issue). We would be very happy to discuss your ideas and needs, whether its from the perspective of an (end) user, engineer or clinician, and formulate a roadmap with next steps together.

Getting started

Installation

pip install clinlp

Example

import clinlp
import spacy

nlp = spacy.blank("clinlp")

# Normalization
nlp.add_pipe('clinlp_normalizer')

# Sentences
nlp.add_pipe('clinlp_sentencizer')

# Entities
ruler = nlp.add_pipe('entity_ruler', config={'phrase_matcher_attr': "NORM"})

terms = {
    'covid_19_symptomen': [
        'verkoudheid', 'neusverkoudheid', 'loopneus', 'niezen', 'vermoeidheid',
        'keelpijn', 'hoesten', 'benauwdheid', 'kortademigheid', 'verhoging', 
        'koorts', 'verlies van reuk', 'verlies van smaak'
    ]
}

for term_description, terms in terms.items():
    ruler.add_patterns([{'label': term_description, 'pattern': term} for term in terms])

# Qualifiers
nlp.add_pipe('clinlp_context_algorithm', config={'phrase_matcher_attr': 'NORM'})

text = (
    "Patiente bij mij gezien op spreekuur, omdat zij vorige maand verlies van "
    "reuk na covid infectie aangaf. Zij had geen last meer van kortademigheid, "
    "wel was er nog sprake van hoesten, geen afname vermoeidheid."
)


doc = nlp(text)

Find information in the doc object:

from spacy import displacy

displacy.render(doc, style='ent')

With relevant qualifiers:

for ent in doc.ents:
  print(ent, ent.start, ent.end, ent._.qualifiers)

11 14 verlies van reuk {'Temporality.HISTORICAL'}
25 26 kortademigheid {'Negation.NEGATED'}
33 34 hoesten {}
37 38 vermoeidheid {}

Documentation

Introduction

clinlp is built on top of spaCy, a widely used library for Natural Language Processing. Before getting started with clinlp, it may be useful to read spaCy 101: Everything you need to know (~10 mins). Main things to know are that spaCy consists of a tokenizer (breaks a text up into small pieces, i.e. words), and various components that further process the text.

Currently, clinlp offers the following components, tailored to Dutch Clinical text, further discussed below:

Tokenizer
Normalizer
Sentence splitter
Entity matcher (builtin Spacy)
Qualifier detection (=context)
- Context Algorithm
- Transformer based negation detection

Tokenizer

The clinlp tokenizer is built into the blank model:

nlp = spacy.blank('clinlp')

It employs some custom rule based logic, including:

Clinical text-specific logic for splitting punctuation, units, dosages (e.g. 20mg/dag :arrow_right: 20 mg / dag)
Custom lists of abbreviations, units (e.g. pt., zn., mmHg)
Custom tokenizing rules (e.g. xdd :arrow_right: x dd)
Regarding DEDUCE tags as a single token (e.g. [DATUM-1]).
- Deidentification is not builtin clinlp and should be done as a preprocessing step.

Normalizer

The normalizer sets the token.norm attribute, which can be used by further components (entity recognition, qualification) for matching. It currently has two options (enabled by default):

Lowercasing
Removing diacritings, where possible. For instance, it will map ë -> e, but keeps most other non-ascii characters intact (e.g. µ, ²).

Note that this component only has effect when explicitly configuring successor components to match on the token.norm attribute.

Sentence splitter

The sentence splitter can be added as follows:

nlp.add_pipe('clinlp_sentencizer')

It is designed to detect sentence boundaries in clinical text, whenever a character that demarks a sentence ending is matched (e.g. newline, period, question mark). It also correctly detects items in an enumerations (e.g. starting with - or *).

Entity matcher

Currently, the spaCy builtin PhraseMatcher and Matcher can be used for finding (named) entities in text. The first one accepts literal phrases only, that are matched in the tokenized text, while the second one also accepts spaCy patterns. These are not tailored for the clinical domain, but nevertheless useful when a somewhat coherent list of relevant patterns can be generated/obtained.

For instance, a matcher that helps recognize COVID19 symptoms:

ruler = nlp.add_pipe('entity_ruler', config={'phrase_matcher_attr': "NORM"})

terms = {
    'covid_19_symptomen': [
        'verkouden', 'neusverkouden', 'loopneus', 'niezen', 
        'keelpijn', 'hoesten', 'benauwd', 'kortademig', 'verhoging', 
        'koorts', 'verlies van reuk', 'verlies van smaak'
    ]
}

for term_description, terms in terms.items():
    ruler.add_patterns([{'label': term_description, 'pattern': term} for term in terms])

For more info, it's useful to check out these spaCy documentation pages:

Note that the DependencyMatcher cannot be used, and neither are part of speech tags available, as no good models for determining this information for clinical text exist (yet).

Qualifier detection

After finding entities, it's often useful to qualify these entities, e.g.: are they negated or affirmed, historical or current? clinlp currently implements the rule-based Context algorithm, and a transformer-based negation detector.

Context Algorithm

The rule-based Context Algorithm is fairly accurate, and quite transparent and fast. A set of rules, that checks for negation, temporality, plausibility and experiencer, is loaded by default:

nlp.add_pipe('clinlp_context_algorithm', config={'phrase_matcher_attr': 'NORM'})

A custom set of rules, including different types of qualifiers, can easily be defined. See clinlp/resources/psynlp_context_rules.json for an example, and load it as follows:

cm = nlp.add_pipe('clinlp_context_algorithm', config={'rules': '/path/to/my_own_ruleset.json'})

Transformer based negation detection

clinlp also includes a wrapper around the transformer based negation detector, as described in van Es et al, 2022. The underlying transformer can be found on huggingface. It is reported as more accurate than the rule-based version (see paper for details), at the cost of less transparency and additional computational cost.

First, install the additional dependencies:

pip install "clinlp[transformers]"

Then add it using:

tn = nlp.add_pipe('clinlp_negation_transformer')

Some configuration options, like the number of tokens to consider, can be specified in the config argument.

Where to go from here

We hope to extend clinlp with new functionality and more complete documentation in the near future. In the meantime, if any questions or problems arise, we recommend:

Checking the source code
Getting in touch (email | issue)

Principles and goals

Functional:

Provides NLP pipelines optimized for Dutch clinical text
- Performant and production-ready
- Useful out-of-the-box, but highly configurable
A single place to visit for your Dutch clinical NLP needs
(Re-)uses existing components where possible, implements new components where needed
Not intended for annotating, training, and analysis — already covered by existing packages

Development:

Free and open source
Targeted towards the technical user
Curated and maintained by the Dutch Clinical NLP community
Built using the spaCy framework (>3.0.0)
- Therefore non-destructive
Work towards some level of standardization of components (abstraction, protocols)
Follows industry best practices (system design, code, documentation, testing, CI/CD)

Overarching goals:

Improve the quality of Dutch Clinical NLP pipelines
Enable easier (re)use/valorization of efforts
Help mature the field of Dutch Clinical NLP
Help develop the Dutch Clinical NLP community

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.0

May 16, 2024

0.6.6

Apr 24, 2024

0.6.5

Apr 24, 2024

0.6.4

Feb 13, 2024

0.6.3

Jan 18, 2024

0.6.2

Oct 6, 2023

0.6.1

Oct 6, 2023

0.6.0

Oct 3, 2023

0.5.3

Oct 2, 2023

0.5.2

Sep 27, 2023

0.5.1

Sep 27, 2023

0.5.0

Aug 17, 2023

0.4.0

Aug 5, 2023

This version

0.3.1

Jun 30, 2023

0.3.0

Jun 30, 2023

0.2.0

Jun 7, 2023

0.1.1

May 23, 2023

0.1.0

May 23, 2023

0.0.1

Dec 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinlp-0.3.1.tar.gz (32.4 kB view hashes)

Uploaded Jun 30, 2023 Source

Built Distribution

clinlp-0.3.1-py3-none-any.whl (31.7 kB view hashes)

Uploaded Jun 30, 2023 Python 3

Hashes for clinlp-0.3.1.tar.gz

Hashes for clinlp-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`ddd61e68813b4b59623f1d341cbd84be78f883d61c823e799d6286f60fd80ecb`
MD5	`a76e6baea50cf1a4ee9d76effba42f22`
BLAKE2b-256	`852b39cd4fd47decee2d710846cbc7c6cb2c9d8905a7cdd2ba32862551cacb08`

Hashes for clinlp-0.3.1-py3-none-any.whl

Hashes for clinlp-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f51870a0185a89470c2189b423d8a7addcbe097a5212d0b6294216cf623b401`
MD5	`6e3ae9480623e7072909698cc1a8df82`
BLAKE2b-256	`4ad19d00ac7d22b4d90868fdf23d4b23552907769a24fec22d0ee9a53b6e70c5`