Skip to main content

Deduce: de-identification method for Dutch medical text

Project description

Deduce: de-identification method for Dutch medical text

Code style: black

If you are looking for the version of DEDUCE as published with Menger et al (2017), please visit vmenger/deduce-classic, where the original is archived. This version is maintained and improved, thus possibly differing from the validated original.

This project contains the code for DEDUCE: de-identification method for Dutch medical text, initially described in Menger et al (2017). De-identification of medical text is needed for using text data for analysis, to comply with legal requirements and to protect the privacy of patients. Our pattern matching based method removes Protected Health Information (PHI) in the following categories:

  1. Person names, including initials
  2. Geographical locations smaller than a country
  3. Names of institutions that are related to patient treatment
  4. Dates
  5. Ages
  6. Patient numbers
  7. Telephone numbers
  8. E-mail addresses and URLs

The details of the development and workings of the initial method, and its validation can be found in:

Menger, V.J., Scheepers, F., van Wijk, L.M., Spruit, M. (2017). DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text, Telematics and Informatics, 2017, ISSN 0736-5853

Prerequisites

  • nltk

Installing

Installing can be done through pip and git:

>>> pip install deduce

Or from source, simply download and use python to install:

>>> python setup.py install

Getting started

The package has a method for annotating (annotate_text) and for removing the annotations (deidentify_annotations).

import deduce 

deduce.annotate_text(
        text,                       # The text to be annotated
        patient_first_names="",     # First names (separated by whitespace)
        patient_initials="",        # Initial
        patient_surname="",         # Surname(s)
        patient_given_name="",      # Given name
        names=True,                 # Person names, including initials
        locations=True,             # Geographical locations
        institutions=True,          # Institutions
        dates=True,                 # Dates
        ages=True,                  # Ages
        patient_numbers=True,       # Patient numbers
        phone_numbers=True,         # Phone numbers
        urls=True,                  # Urls and e-mail addresses
        flatten=True                # Debug option
    )    
    
deduce.deidentify_annotations(
        text                        # The annotated text that should be de-identified
    )
    

Examples

>>> import deduce

>>> text = u"Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: j.jnsen@email.com, t: 06-12345678) is 64 jaar oud 
    en woonachtig in Utrecht. Hij werd op 10 oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU."
>>> annotated = deduce.annotate_text(text, patient_first_names="Jan", patient_surname="Jansen")
>>> deidentified = deduce.deidentify_annotations(annotated)

>>> print (annotated)
"Dit is stukje tekst met daarin de naam <PATIENT Jan Jansen>. De <PATIENT patient J. Jansen> (e: <URL j.jnsen@email.com>, t: <TELEFOONNUMMER 06-12345678>) 
is <LEEFTIJD 64> jaar oud en woonachtig in <LOCATIE Utrecht>. Hij werd op <DATUM 10 oktober> door arts <PERSOON Peter de Visser> ontslagen van de kliniek van het <INSTELLING umcu>."
>>> print (deidentified)
"Dit is stukje tekst met daarin de naam <PATIENT>. De <PATIENT> (e: <URL-1>, t: <TELEFOONNUMMER-1>) is <LEEFTIJD-1> jaar oud en woonachtig in <LOCATIE-1>.
Hij werd op <DATUM-1> door arts <PERSOON-1> ontslagen van de kliniek van het <INSTELLING-1>."

Configuring

The lookup lists in the data/ folder can be tailored to the users specific needs. This is especially recommended for the list of names of institutions, since they are by default tailored to location of development and testing of the method. Regular expressions can be modified in annotate.py, this is for the same reason recommended for detecting patient numbers.

Contributing

Thanks a lot for considering to make a contribution to DEDUCE, we are very open to your help!

  • If you need support, have a question, or found a bug/error, please get in touch by creating a New Issue. We don't have an issue template, just try to be specific and complete, so we can tackle it.
  • If you want to make a contribution either to the code or the docs, please take a few minutes to read our contribution guidelines. This greatly improve the chances of your work being merged into the repository.

Changelog

You may find detailed versioning information in the changelog.

Authors

  • Vincent Menger - Initial work
  • Jonathan de Bruin - Code review
  • Pablo Mosteiro - Bug fixes, structured annotations

License

This project is licensed under the GNU LGPLv3 license - see the LICENSE.md file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduce-1.0.7.tar.gz (113.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deduce-1.0.7-py3-none-any.whl (113.0 kB view details)

Uploaded Python 3

File details

Details for the file deduce-1.0.7.tar.gz.

File metadata

  • Download URL: deduce-1.0.7.tar.gz
  • Upload date:
  • Size: 113.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for deduce-1.0.7.tar.gz
Algorithm Hash digest
SHA256 bef217cb75a02f990ee5b4ec0a36a8255cd97c6b44b388909649e56e8d5a5dcf
MD5 5c8a06b20411165d2cdb2bcd022cd472
BLAKE2b-256 27ec9c18ffee3c90584d29e2ae60b23a6361b1dda629f7fdb8343c60e7f0bcdd

See more details on using hashes here.

File details

Details for the file deduce-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: deduce-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 113.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for deduce-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 615d70666e0df3e751eb25f36976f23b306d553af1a803d4ebe37f1b337888ee
MD5 897e04de8db02999c644bec401ffa9b7
BLAKE2b-256 29d4f11cbd03fcc7ba34923d8ae92080708d23e3af9f173b18e85d1167753513

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page