Performant and production-ready NLP pipelines for clinical text written in Dutch
Project description
clinlp
- :hospital:
clinical
+ :netherlands:nl
+ :clipboard:NLP
= :sparkles:clinlp
- :star: Performant and production-ready NLP pipelines for clinical text written in Dutch
- :rocket: Open source, created and maintained by the Dutch Clinical NLP community
- :triangular_ruler: Useful out of the box, but customization highly recommended
Read the principles and goals, futher down :arrow_down:
Contact and contributing
clinlp
is very much still being shaped, so if you are enthusiastic about using or contributing to clinlp
, please don't hesitate to get in touch (email | issue). We would be very happy to discuss your ideas and needs, whether its from the perspective of an (end) user, engineer or clinician, and formulate a roadmap with next steps together.
Getting started
Installation
pip install clinlp
Example
import clinlp
import spacy
nlp = spacy.blank("clinlp")
# Normalization
nlp.add_pipe('clinlp_normalizer')
# Sentences
nlp.add_pipe('clinlp_sentencizer')
# Entities
ruler = nlp.add_pipe('entity_ruler', config={'phrase_matcher_attr': "NORM"})
terms = {
'covid_19_symptomen': [
'verkoudheid', 'neusverkoudheid', 'loopneus', 'niezen', 'vermoeidheid',
'keelpijn', 'hoesten', 'benauwdheid', 'kortademigheid', 'verhoging',
'koorts', 'verlies van reuk', 'verlies van smaak'
]
}
for term_description, terms in terms.items():
ruler.add_patterns([{'label': term_description, 'pattern': term} for term in terms])
# Qualifiers
nlp.add_pipe('clinlp_context_algorithm', config={'phrase_matcher_attr': 'NORM'})
text = (
"Patiente bij mij gezien op spreekuur, omdat zij vorige maand verlies van "
"reuk na covid infectie aangaf. Zij had geen last meer van kortademigheid, "
"wel was er nog sprake van hoesten, geen afname vermoeidheid."
)
doc = nlp(text)
Find information in the doc object:
from spacy import displacy
displacy.render(doc, style='ent')
With relevant qualifiers:
for ent in doc.ents:
print(ent.start, ent.end, ent, ent._.qualifiers_str)
11
14
verlies van reuk
{'Temporality.HISTORICAL'}
25
26
kortademigheid
{'Negation.NEGATED'}
33
34
hoesten
{}
37
38
vermoeidheid
{}
Documentation
Introduction
clinlp
is built on top of spaCy, a widely used library for Natural Language Processing. Before getting started with clinlp
, it may be useful to read spaCy 101: Everything you need to know (~10 mins). Main things to know are that spaCy consists of a tokenizer (breaks a text up into small pieces, i.e. words), and various components that further process the text.
Currently, clinlp
offers the following components, tailored to Dutch Clinical text, further discussed below:
- Tokenizer
- Normalizer
- Sentence splitter
- Entity matcher (builtin Spacy)
- Qualifier detection (=context)
Tokenizer
The clinlp
tokenizer is built into the blank model:
nlp = spacy.blank('clinlp')
It employs some custom rule based logic, including:
- Clinical text-specific logic for splitting punctuation, units, dosages (e.g.
20mg/dag
:arrow_right:20
mg
/
dag
) - Custom lists of abbreviations, units (e.g.
pt.
,zn.
,mmHg
) - Custom tokenizing rules (e.g.
xdd
:arrow_right:x
dd
) - Regarding DEDUCE tags as a single token (e.g.
[DATUM-1]
).- Deidentification is not builtin
clinlp
and should be done as a preprocessing step.
- Deidentification is not builtin
Normalizer
The normalizer sets the token.norm
attribute, which can be used by further components (entity recognition, qualification) for matching. It currently has two options (enabled by default):
- Lowercasing
- Removing diacritics, where possible. For instance, it will map
ë
->
e
, but keeps most other non-ascii characters intact (e.g.µ
,²
).
Note that this component only has effect when explicitly configuring successor components to match on the token.norm
attribute.
Sentence splitter
The sentence splitter can be added as follows:
nlp.add_pipe('clinlp_sentencizer')
It is designed to detect sentence boundaries in clinical text, whenever a character that demarks a sentence ending is matched (e.g. newline, period, question mark). It also correctly detects items in an enumerations (e.g. starting with -
or *
).
Entity matcher
Currently, the spaCy builtin EntityRuler
can be used for finding (named) entities in text. It accepts both literal phrases (single terms or multi-word expressions) and spaCy patterns, which give more control over the specific sequence of tokens to match. The spaCy EntityRuler
is not necessarily tailored for the clinical domain, but nevertheless useful when a somewhat coherent list of relevant patterns can be generated/obtained. A better or more specific NER module will hopefully be added in the future.
For instance, a matcher that helps recognize COVID-19 symptoms:
ruler = nlp.add_pipe('entity_ruler', config={'phrase_matcher_attr': "NORM"})
terms = {
'covid_19_symptomen': [
'verkouden', 'neusverkouden', 'loopneus', 'niezen',
'keelpijn', 'hoesten', 'benauwd', 'kortademig', 'verhoging',
'koorts', 'verlies van reuk', 'verlies van smaak'
]
}
for term_description, terms in terms.items():
ruler.add_patterns([{'label': term_description, 'pattern': term} for term in terms])
For more info, it's useful to check out these spaCy documentation pages:
Note that Part of Speech tags and dependency trees and cannot be used in clinlp
, as no good models for determining this information for clinical text exist (yet).
Qualifier detection
After finding entities, it's often useful to qualify these entities, e.g.: are they negated or affirmed, historical or current? clinlp
currently implements two options: the rule-based Context Algorithm, and a transformer-based negation detector.
Context Algorithm
The rule-based Context Algorithm is fairly accurate, and quite transparent and fast. A set of rules, that checks for negation, temporality, plausibility and experiencer, is loaded by default:
nlp.add_pipe('clinlp_context_algorithm', config={'phrase_matcher_attr': 'NORM'})
A custom set of rules, including different types of qualifiers, can easily be defined. See clinlp/resources/psynlp_context_rules.json
for an example, and load it as follows:
cm = nlp.add_pipe('clinlp_context_algorithm', config={'rules': '/path/to/my_own_ruleset.json'})
Transformer based negation detection
clinlp
also includes a wrapper around the transformer based negation detector, as described in van Es et al, 2022. The underlying transformer can be found on huggingface. It is reported as more accurate than the rule-based version (see paper for details), at the cost of less transparency and additional computational cost.
First, install the additional dependencies:
pip install "clinlp[transformers]"
Then add it using:
tn = nlp.add_pipe('clinlp_negation_transformer')
Some configuration options, like the number of tokens to consider, can be specified in the config
argument.
Where to go from here
We hope to extend clinlp
with new functionality and more complete documentation in the near future. In the meantime, if any questions or problems arise, we recommend:
Principles and goals
Functional:
- Provides NLP pipelines optimized for Dutch clinical text
- Performant and production-ready
- Useful out-of-the-box, but highly configurable
- A single place to visit for your Dutch clinical NLP needs
- (Re-)uses existing components where possible, implements new components where needed
- Not intended for annotating, training, and analysis — already covered by existing packages
Development:
- Free and open source
- Targeted towards the technical user
- Curated and maintained by the Dutch Clinical NLP community
- Built using the
spaCy
framework (>3.0.0
)- Therefore non-destructive
- Work towards some level of standardization of components (abstraction, protocols)
- Follows industry best practices (system design, code, documentation, testing, CI/CD)
Overarching goals:
- Improve the quality of Dutch Clinical NLP pipelines
- Enable easier (re)use/valorization of efforts
- Help mature the field of Dutch Clinical NLP
- Help develop the Dutch Clinical NLP community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.