Skip to main content

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Project description

docdeid

tests build Documentation Status pypy version python versions license black

Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation

Grab the latest version from PyPi:

pip install docdeid

Getting started

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features

Additionally, docdeid features:

  • Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
  • Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
  • Callable from one interface (DocDeid.deidenitfy())
  • String processing and filtering
  • Fast lookup based on sets or tries
  • Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial

Documentation

For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/

Development and contributing

For setting up dev environment, see: docs/environment

For contributing, see: docs/contributing

Authors

Vincent Menger - Author, maintainer

License

This project is licensed under the MIT license - see the LICENSE.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdeid-1.0.0.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

docdeid-1.0.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file docdeid-1.0.0.tar.gz.

File metadata

  • Download URL: docdeid-1.0.0.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Linux/6.2.0-1018-azure

File hashes

Hashes for docdeid-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fea630e1dff140eb939c6474df8fcebe428c28c94eed5a5b9ae5c218205b0948
MD5 59b825f349f551f2f2339a95a3dbe89c
BLAKE2b-256 001ea725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58

See more details on using hashes here.

File details

Details for the file docdeid-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docdeid-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Linux/6.2.0-1018-azure

File hashes

Hashes for docdeid-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5d93ec3fbd8557a9cd41b56ec3774bc3a86575d8dc6a3becd486cdf2190993b
MD5 5ee59fff68ea632243e5891585f4dc23
BLAKE2b-256 1f3e33aec857ccd7739c2b8f8744384bf450f3ca6c498eadcacd4a79e2acc6b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page