Skip to main content

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Project description

docdeid

tests build Documentation Status pypy version python versions license black

Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation

Grab the latest version from PyPi:

pip install docdeid

Getting started

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features

Additionally, docdeid features:

  • Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
  • Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
  • Callable from one interface (DocDeid.deidenitfy())
  • String processing and filtering
  • Fast lookup based on sets or tries
  • Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial

Documentation

For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/

Development and contributing

For setting up dev environment, see: docs/environment

For contributing, see: docs/contributing

Authors

Vincent Menger - Author, maintainer

License

This project is licensed under the MIT license - see the LICENSE.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdeid-1.0.1.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docdeid-1.0.1-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file docdeid-1.0.1.tar.gz.

File metadata

  • Download URL: docdeid-1.0.1.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.17 Linux/6.11.0-1012-azure

File hashes

Hashes for docdeid-1.0.1.tar.gz
Algorithm Hash digest
SHA256 aeff79b91a1b26dc026687c91a96298bcd9cb390ef325428c99eb59628a6c4de
MD5 2ef73625f64612c11dcbb97faffaa19d
BLAKE2b-256 e07db72936179fde77d501bb770d734493ee44872fc26a2604c4674bf233efc1

See more details on using hashes here.

File details

Details for the file docdeid-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: docdeid-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.17 Linux/6.11.0-1012-azure

File hashes

Hashes for docdeid-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 884467dac254ed6dacfb8e8e0386f40553b70ed441ef3b4e2775159b5ffd97dc
MD5 7c31e4e9da1f61dd732bc02db4efb89a
BLAKE2b-256 b675082005d24241beb83019227f0edebe92721cdd0c1c59d6e1459100fa795a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page