Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

docdeid

Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation

Grab the latest version from PyPi:

pip install docdeid

Getting started

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})

print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features

Additionally, docdeid features:

Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
Callable from one interface (DocDeid.deidenitfy())
String processing and filtering
Fast lookup based on sets or tries
Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial

Documentation

For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/

Development and contributing

For setting up dev environment, see: docs/environment

For contributing, see: docs/contributing

Authors

Vincent Menger - Author, maintainer

License

This project is licensed under the MIT license - see the LICENSE.md file for details.

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.0

Dec 20, 2023

0.1.10

Nov 28, 2023

0.1.9

Oct 20, 2023

0.1.8

Aug 1, 2023

0.1.7

Jul 26, 2023

0.1.6

Mar 28, 2023

0.1.5

Feb 15, 2023

0.1.4

Nov 29, 2022

0.1.3

Nov 28, 2022

0.1.2

Nov 28, 2022

0.1.1

Nov 18, 2022

0.1.0

Nov 18, 2022

0.0.13

Oct 18, 2022

0.0.12

Oct 7, 2022

0.0.11

Oct 6, 2022

0.0.10

Sep 20, 2022

0.0.9

Sep 13, 2022

0.0.8

Sep 12, 2022

0.0.7

Aug 29, 2022

0.0.6

Aug 25, 2022

0.0.5

Aug 18, 2022

0.0.4

Aug 16, 2022

0.0.3

Aug 5, 2022

0.0.2

Aug 5, 2022

0.0.1

Aug 5, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docdeid-1.0.0.tar.gz (21.2 kB view hashes)

Uploaded Dec 20, 2023 Source

Built Distribution

docdeid-1.0.0-py3-none-any.whl (26.3 kB view hashes)

Uploaded Dec 20, 2023 Python 3

Hashes for docdeid-1.0.0.tar.gz

Hashes for docdeid-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`fea630e1dff140eb939c6474df8fcebe428c28c94eed5a5b9ae5c218205b0948`
MD5	`59b825f349f551f2f2339a95a3dbe89c`
BLAKE2b-256	`001ea725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58`

Hashes for docdeid-1.0.0-py3-none-any.whl

Hashes for docdeid-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5d93ec3fbd8557a9cd41b56ec3774bc3a86575d8dc6a3becd486cdf2190993b`
MD5	`5ee59fff68ea632243e5891585f4dc23`
BLAKE2b-256	`1f3e33aec857ccd7739c2b8f8744384bf450f3ca6c498eadcacd4a79e2acc6b8`