De-identify free-text medical records

These details have not been verified by PyPI

Project links

Homepage

Project description

deidentify

A Python library to de-identify medical records with state-of-the-art NLP methods. Pre-trained models for the Dutch language are available.

This repository shares the resources developed in the following paper:

J. Trienes, D. Trieschnigg, C. Seifert, and D. Hiemstra. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In: Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM), 2020.

Read more about the work in our paper or blog post.

Quick Start

Installation

Create a new virtual environment with an environment manager of your choice. Then, install deidentify:

pip install deidentify

We use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy version that we used to train the de-identification models.

pip install -U "spacy<3" https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0

Example Usage

The code below shows how to apply a pre-trained de-identification pipeline to an example document. We provide a list of available models below.

from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory

# Create some text
text = (
    "Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: "
    "j.jnsen@email.com, t: 06-12345678) is 64 jaar oud en woonachtig in Utrecht. Hij werd op 10 "
    "oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."
)

# Wrap text in document
documents = [
    Document(name='doc_01', text=text)
]

# Select downloaded model
model = 'model_bilstmcrf_ons_fast-v0.2.0'

# Instantiate tokenizer
tokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=("tagger", "ner"))

# Load tagger with a downloaded model file and tokenizer
tagger = FlairTagger(model=model, tokenizer=tokenizer, verbose=False)

# Annotate your documents
annotated_docs = tagger.annotate(documents)

This completes the annotation stage. Let's inspect the entities that the tagger found:

from pprint import pprint

first_doc = annotated_docs[0]
pprint(first_doc.annotations)

This should print the entities of the first document.

[Annotation(text='Jan Jansen', start=39, end=49, tag='Name', doc_id='', ann_id='T0'),
 Annotation(text='J. Jansen', start=62, end=71, tag='Name', doc_id='', ann_id='T1'),
 Annotation(text='j.jnsen@email.com', start=76, end=93, tag='Email', doc_id='', ann_id='T2'),
 Annotation(text='06-12345678', start=98, end=109, tag='Phone_fax', doc_id='', ann_id='T3'),
 Annotation(text='64 jaar', start=114, end=121, tag='Age', doc_id='', ann_id='T4'),
 Annotation(text='Utrecht', start=143, end=150, tag='Address', doc_id='', ann_id='T5'),
 Annotation(text='10 oktober', start=164, end=174, tag='Date', doc_id='', ann_id='T6'),
 Annotation(text='Peter de Visser', start=185, end=200, tag='Name', doc_id='', ann_id='T7'),
 Annotation(text='UMCU', start=234, end=238, tag='Hospital', doc_id='', ann_id='T8')]

Mask Annotations

Use masking to replace annotations with placeholders. Example: Jan Jansen -> [NAME]

from deidentify.util import mask_annotations

masked_doc = mask_annotations(first_doc)
print(masked_doc.text)

Which should print:

Dit is stukje tekst met daarin de naam [NAME]. De patient [NAME] (e: [EMAIL], t: [PHONE_FAX]) is [AGE] oud en woonachtig in [ADDRESS]. Hij werd op [DATE] door arts [NAME] ontslagen van de kliniek van het [HOSPITAL].

Replace Annotations with Surrogates [experimental]

Use sorrogate generation to replace annotations with random but realistic alternatives. Example: Jan Jansen -> Bart Bakker. The surrogate replacement strategy follows Stubbs et al. (2015).

from deidentify.util import surrogate_annotations

# The surrogate generation process involves some randomness.
# You can set a seed to make the process deterministic.
iter_docs = surrogate_annotations(docs=[first_doc], seed=1)
surrogate_doc = list(iter_docs)[0]
print(surrogate_doc.text)

This code should print:

Dit is stukje tekst met daarin de naam Gijs Hermelink. De patient G. Hermelink (e: n.qvgjj@spqms.com, t: 06-83662585) is 64 jaar oud en woonachtig in Cothen. Hij werd op 28 juni door arts Jullian van Troost ontslagen van de kliniek van het UMCU.

Available Taggers

There are currently three taggers that you can use:

DeduceTagger: A wrapper around the DEDUCE tagger by Menger et al. (2018, code, paper)
CRFTagger: A CRF tagger using the feature set by Liu et al. (2015, paper)
FlairTagger: A wrapper around the Flair SequenceTagger allowing the use of neural architectures such as BiLSTM-CRF. The pre-trained models below use contextualized string embeddings by Akbik et al. (2018, paper)

All taggers implement the deidentify.taggers.TextTagger interface which you can implement to provide your own taggers.

Tag Set

Use the TextTagger.tags to get a list of supported tags. For the FlairTagger in above demo this looks as follows:

>>> tagger.tags
['Internal_Location', 'Age', 'Phone_fax', 'Name', 'SSN', 'Hospital', 'Email', 'Initials', 'O',
'Organization_Company', 'ID', 'Profession', 'Care_Institute', 'Other', 'Date', 'URL_IP', 'Address']

Pre-trained Models

We provide a number of pre-trained models for the Dutch language. The models were developed on the Nedap/University of Twente (NUT) dataset. The dataset consists of 1260 documents from three domains of Dutch healthcare: elderly care, mental care and disabled care (note: in the codebase we sometimes also refer to this dataset as ons). More information on the design of the dataset can be found in our paper.

Name	Tagger	Lang	Dataset	F1*	Precision*	Recall*	Tags
DEDUCE (Menger et al., 2018)**	`DeduceTagger`	NL	NUT	0.6649	0.8192	0.5595	8 PHI Tags
model_crf_ons_tuned-v0.2.0	`CRFTagger`	NL	NUT	0.8511	0.9337	0.7820	15 PHI Tags
model_bilstmcrf_ons_fast-v0.2.0	`FlairTagger`	NL	NUT	0.8914	0.9101	0.8735	15 PHI Tags
model_bilstmcrf_ons_large-v0.2.0	`FlairTagger`	NL	NUT	0.8990	0.9240	0.8754	15 PHI Tags

*All scores are micro-averaged entity-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.

**DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by Menger et al. (2018).

Running Experiments and Training Models

If you have your own dataset of annotated documents and you want to train your own models on it, you can take a look at the following guides:

If you want more information on the experiments in our paper, have a look here:

Computational Environment

When you want to run your own experiments, we assume that you clone this code base locally and execute all scripts under deidentify/ within the following conda environment:

# Install package dependencies and add local files to the Python path of that environment.
conda env create -f environment.yml
conda activate deidentify && export PYTHONPATH="${PYTHONPATH}:$(pwd)"

Citation

Please cite the following paper when using deidentify:

@inproceedings{Trienes:2020:CRF,
  title={Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records},
  author={Trienes, Jan and Trieschnigg, Dolf and Seifert, Christin and Hiemstra, Djoerd},
  booktitle = {Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop},
  series = {{HSDM} 2020},
  year = {2020}
}

Contact

If you have any question, please contact Jan Trienes at jan.trienes@gmail.com.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.7.3

May 5, 2022

0.7.2

Jun 3, 2021

0.7.1

Feb 15, 2021

0.7.0

Dec 16, 2020

0.6.1

Oct 13, 2020

0.6.0

Sep 10, 2020

0.5.2

Sep 7, 2020

0.5.1

Sep 4, 2020

0.5.0

Sep 4, 2020

0.4.0

Sep 4, 2020

0.3.3

Aug 7, 2020

0.3.2

Jan 16, 2020

0.3.1

Jan 16, 2020

0.3.0

Jan 16, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deidentify-0.7.3.tar.gz (5.1 MB view details)

Uploaded May 5, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deidentify-0.7.3-py3-none-any.whl (5.3 MB view details)

Uploaded May 5, 2022 Python 3

File details

Details for the file deidentify-0.7.3.tar.gz.

File metadata

Download URL: deidentify-0.7.3.tar.gz
Upload date: May 5, 2022
Size: 5.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for deidentify-0.7.3.tar.gz
Algorithm	Hash digest
SHA256	`e7403501456e15e80c1f95333adf57e64dfd9dff38b424d431773941e1942022`
MD5	`fa9856369de7e14359cd46bc04f5102f`
BLAKE2b-256	`d5fb430c2b6a27880a44b2228430a70b3362cf8997e8f8f9911a7b9269ce7004`

See more details on using hashes here.

File details

Details for the file deidentify-0.7.3-py3-none-any.whl.

File metadata

Download URL: deidentify-0.7.3-py3-none-any.whl
Upload date: May 5, 2022
Size: 5.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for deidentify-0.7.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1944786098360c44a4aa4aab4336b80d47cb8f77e66ae48871252e5fc027ca45`
MD5	`06fc08c0219606f20817e24de2584336`
BLAKE2b-256	`da133d8c49835aa04c399c5ff77d00d2ed3bd0d7db2d35f01d5d65ef350ce13f`

See more details on using hashes here.

deidentify 0.7.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

deidentify

Quick Start

Installation

Example Usage

Mask Annotations

Replace Annotations with Surrogates [experimental]

Available Taggers

Tag Set

Pre-trained Models

Running Experiments and Training Models

Computational Environment

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes