Deduce: de-identification method for Dutch medical text
Project description
deduce
Installation - Versions - Getting Started - Documentation - Contributiong - Authors - License
Deduce 2.0.0 has been released! It includes a 10x speedup, and way more features for customizing and tailoring. Some small changes are needed to keep going from version 1, read more about it here: docs/migrating-to-v2
De-identify clinial text written in Dutch using deduce, a rule-based de-identification method for Dutch clinical text.
The development, principles and validation of deduce were initially described in Menger et al. (2017). De-identification of clinical text is needed for using text data for analysis, to comply with legal requirements and to protect the privacy of patients. Our rule-based method removes Protected Health Information (PHI) in the following categories:
- Person names, including initials
- Geographical locations smaller than a country
- Names of institutions that are related to patient treatment
- Dates
- Ages
- Patient numbers
- Telephone numbers
- E-mail addresses and URLs
If you use deduce, please cite the following paper:
Installation
pip install deduce
Versions
For most cases the latest version is suitable, but some specific milestones are:
2.0.0- Major refactor, with speedups, many new options for customizing, functionally very similar to original1.0.8- Small bugfixes compared to original release1.0.1- Original release with Menger et al. (2017)
Detailed versioning information is accessible in the changelog.
Getting started
The basic way to use deduce, is to pass text to the deidentify method of a Deduce object:
from deduce import Deduce
deduce = Deduce()
text = """Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen
(e: j.jnsen@email.com, t: 06-12345678) is 64 jaar oud en woonachtig
in Utrecht. Hij werd op 10 oktober door arts Peter de Visser ontslagen
van de kliniek van het UMCU."""
doc = deduce.deidentify(text)
The output is available in the Document object:
from pprint import pprint
pprint(doc.annotations)
AnnotationSet({Annotation(text='Jan Jansen', start_char=39, end_char=49, tag='persoon', length=10),
Annotation(text='Peter de Visser', start_char=185, end_char=200, tag='persoon', length=15),
Annotation(text='j.jnsen@email.com', start_char=76, end_char=93, tag='url', length=17),
Annotation(text='10 oktober', start_char=164, end_char=174, tag='datum', length=10),
Annotation(text='patient J. Jansen', start_char=54, end_char=71, tag='persoon', length=17),
Annotation(text='64', start_char=114, end_char=116, tag='leeftijd', length=2),
Annotation(text='UMCU', start_char=234, end_char=238, tag='instelling', length=4),
Annotation(text='06-12345678', start_char=98, end_char=109, tag='telefoonnummer', length=11),
Annotation(text='Utrecht', start_char=143, end_char=150, tag='locatie', length=7)})
print(doc.deidentified_text)
"""Dit is stukje tekst met daarin de naam <PERSOON-1>. De <PERSOON-2>
(e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig
in <LOCATIE-1>. Hij werd op <DATUM-1> door arts <PERSOON-3> ontslagen
van de kliniek van het <INSTELLING-1>."""
Aditionally, if the names of the patient are known, they may be added as metadata, where they will be picked up by deduce:
from deduce.person import Person
patient = Person(first_names=["Jan"], initials="JJ", surname="Jansen")
doc = deduce.deidentify(text, metadata={'patient': patient})
print (doc.deidentified_text)
"""Dit is stukje tekst met daarin de naam <PATIENT>. De <PATIENT>
(e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig
in <LOCATIE-1>. Hij werd op <DATUM-1> door arts <PERSOON-1> ontslagen
van de kliniek van het <INSTELLING-1>."""
As you can see, adding known names keeps references to <PATIENT> in text. It also increases recall, as not all known names are contained in the lookup lists.
Documentation
A more extensive tutorial on using, configuring and modifying deduce is available at: docs/tutorial
Basic documentation and API are available at: docs
Contributing
For setting up the dev environment and contributing guidelines, see: docs/contributing
Authors
- Vincent Menger - Initial work
- Jonathan de Bruin - Code review
- Pablo Mosteiro - Bug fixes, structured annotations
License
This project is licensed under the GNU LGPLv3 license - see the LICENSE.md file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deduce-2.0.2.tar.gz.
File metadata
- Download URL: deduce-2.0.2.tar.gz
- Upload date:
- Size: 99.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.1 CPython/3.10.10 Linux/5.15.0-1034-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0053222fd6bc144a06f7379bb2771d3d2944050b3c54954715bb5b69968cfb8b
|
|
| MD5 |
cceb72ecc8ca674d28150ba7f63af875
|
|
| BLAKE2b-256 |
5d5f23759d24e9d5533207c4d640c214c5a43741b46ea6a372e7828dc5a99ab6
|
File details
Details for the file deduce-2.0.2-py3-none-any.whl.
File metadata
- Download URL: deduce-2.0.2-py3-none-any.whl
- Upload date:
- Size: 101.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.1 CPython/3.10.10 Linux/5.15.0-1034-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffac49dc04d405a5c46892b8a874a6ca2ca7d68fdb14a913496324f332d1d371
|
|
| MD5 |
781eab70ec606f156f79a62e403685e4
|
|
| BLAKE2b-256 |
3552c9f9352679bf4b7eddc4849eec4cf181e51cfa49c2883b9880c5e656ad4f
|